NMDC Core Database
National Microbiome Data Collaborative (NMDC) core data including samples, studies, omics data, annotations, and embeddings. Contains data from GOLD, metabolomics, metagenomics, and other microbiome sources. Supports multi-modal analysis through unified embeddings and tokenization. DATABASE STATISTICS (as of 2024): - 48 research studies with comprehensive metadata - 3,129,061+ metabolomics feature records - 48,196 annotation terms (GO, EC, KEGG, COG, MetaCyc) - 39,354 non-obsolete GO terms - 256-dimensional unified embeddings per biosample ANNOTATION TERM COUNTS BY SOURCE: | Source | Terms | Description | |--------------|--------|-------------------------------------| | GO | 48,196 | Gene Ontology terms | | EC | 8,813 | Enzyme Commission numbers | | KEGG KO | 8,104 | KEGG Orthology functional orthologs| | MetaCyc | 1,538 | MetaCyc metabolic pathways | | KEGG Module | 370 | KEGG functional modules | | KEGG Pathway | 306 | KEGG metabolic pathways | | COG | 26 | Clusters of Orthologous Groups | GO TERM DISTRIBUTION BY NAMESPACE: | Namespace | Count | Fraction | |--------------------|--------|----------| | biological_process | 30,817 | 64% | | molecular_function | 12,805 | 27% | | cellular_component | 4,573 | 9% | ECOSYSTEM COVERAGE (studies with ecosystem data): - Environmental/Terrestrial/Soil (3 studies) - Environmental/Aquatic/Freshwater (3 studies) - Environmental/Terrestrial/Deep subsurface (1 study) - Host-associated/Plants (1 study) KEY FEATURES: - Multi-modal embeddings for similarity search across samples - Unified annotation vocabulary across GO, KEGG, EC, COG, MetaCyc - Pre-computed GO hierarchy for efficient ancestor queries - Mass spectrometry metabolomics with compound identification - Taxonomic classification from multiple tools (Kraken, GOTTCHA, Centrifuge) USAGE: For functional analysis, start with annotation_terms_unified. For metabolomics, use metabolomics_gold. For similarity search, use embeddings_v1. For GO-based enrichment, use go_hierarchy_flat for closure queries.
URI: https://w3id.org/kbase/nmdc_core
Name: nmdc_core
Classes
| Class | Description |
|---|---|
| AbioticEmbeddings | Abiotic factor embeddings from environmental measurements (pH, temperature, m... |
| AbioticFeatures | Abiotic environmental features for machine learning |
| AnnotationCrossrefs | Cross-references between annotation databases |
| AnnotationHierarchiesUnified | Unified annotation hierarchies across sources |
| AnnotationTermsUnified | Unified annotation terms across sources (GO, KEGG, EC, COG, MetaCyc) |
| BiochemicalEmbeddings | Biochemical feature embeddings from metabolomics data |
| BiochemicalFeatures | Biochemical features from metabolomics for machine learning |
| CentrifugeGold | Centrifuge taxonomic classifications for GOLD samples |
| CogCategories | COG functional categories with descriptions and colors |
| ContigTaxonomy | Contig-level taxonomic assignments from metagenome assemblies |
| EcTerms | Enzyme Commission (EC) number terms |
| EmbeddingsV1 | Vector embeddings version 1 for samples and entities |
| GoHierarchyFlat | Flattened GO hierarchy for efficient ancestor/descendant queries |
| GoTerms | Gene Ontology terms with full metadata |
| GottchaGold | GOTTCHA taxonomic classifications for GOLD samples |
| KeggKoTerms | KEGG Orthology (KO) terms |
| KeggPathwayTerms | KEGG pathway definitions with category classification |
| KrakenGold | Kraken taxonomic classifications for GOLD samples |
| LipidomicsGold | Lipidomics data linked to GOLD samples |
| MetabolomicsGold | Metabolomics data linked to GOLD samples |
| MetacycPathwayReactions | Reactions within MetaCyc pathways |
| MetacycPathways | MetaCyc metabolic pathways with hierarchical classification |
| MetatranscriptomicsGold | Metatranscriptomics expression data linked to GOLD samples |
| NomFeatureMetadata | NOM feature metadata including molecular formulas, exact masses, and compound... |
| NomGold | Natural organic matter data linked to GOLD samples |
| NomMatrixOptimized | Optimized NOM feature matrix for efficient queries |
| OmicsFilesTable | Inventory of omics data files with metadata and URLs |
| ProteomicsGold | Proteomics data linked to GOLD samples |
| RheaCrossrefs | Cross-references from Rhea reactions to other databases |
| RheaReactions | Rhea biochemical reactions database |
| SampleFileLookup | Sample to file mapping for data retrieval |
| SampleFileSelections | User-curated file selections per sample for analysis |
| SampleTokensV1 | Sample-level token assignments from vocabulary |
| StudyTable | NMDC research studies with ecosystem classification and investigator informat... |
| TaxonomyDim | Taxonomic hierarchy dimension table using NCBI taxonomy |
| TaxonomyEmbeddings | Taxonomic profile embeddings - vector representation of community composition... |
| TaxonomyFeatures | Taxonomy-derived features for machine learning |
| TraitEmbeddings | Trait-based embeddings derived from functional annotations |
| TraitFeatures | Trait-derived features for machine learning |
| TraitSources | Sources for trait data (databases, literature, predictions) |
| TraitTaxonomyMapping | Mapping between traits and taxonomic groups |
| TraitUnified | Unified trait annotations across samples from multiple sources |
| UnifiedEmbeddings | Unified multi-modal embeddings combining taxonomy, traits, abiotic factors, a... |
| VocabRegistryV1 | Vocabulary registry for multi-modal tokenization |
Slots
| Slot | Description |
|---|---|
| index_level_0 | Sample identifier (biosample ID) |
| all_ancestors | Semicolon-separated all ancestor GO IDs (transitive closure) |
| all_parents | Semicolon-separated direct parent GO IDs (immediate is_a/part_of parents) |
| associated_dois | JSON array of DOI objects with doi_value, doi_category (dataset_doi, award_do... |
| category | KEGG functional category |
| category_code | Single-letter COG category code (A-Z) |
| category_name | Full category name |
| chebi | ChEBI compound ID |
| class | Class name |
| cog_id | Internal COG category ID (same as category_code) |
| color_code | Hex color code for visualization (without # prefix) |
| confidence_level | Confidence assessment of annotations from this source |
| confidence_score | Confidence score for this mapping (0-1) |
| coverage_count | Number of trait-taxon assignments from this source |
| definition | Full term definition with citations in double quotes |
| depth | Maximum depth from root (root terms have depth 1) |
| description | Term description or definition |
| dim_0 | First embedding dimension |
| dim_255 | Last embedding dimension (256 total dimensions, 0-255) |
| ec_id | EC number in X |
| ecosystem | Top-level ecosystem classification |
| ecosystem_category | Ecosystem category (second level) |
| ecosystem_subtype | Ecosystem subtype for further classification |
| ecosystem_type | Specific ecosystem type (third level) |
| entity_key | Unique key for this entity within its type |
| entity_type | Type of entity this token represents |
| evidence_type | Type of evidence supporting this mapping |
| family | Family name |
| feature_id | Metabolite feature ID (unique within file) |
| file_id | NMDC data object ID for source file |
| file_name | Original CSV file name |
| funding_sources | JSON array of funding sources |
| genus | Genus name |
| go_id | GO term ID in GO:NNNNNNN format |
| gold_study_identifiers | JSON array of GOLD study IDs |
| human_name | Human-readable name for the token |
| inchi | InChI (International Chemical Identifier) string |
| inchikey | InChIKey - 27-character hash of InChI for database searching |
| Intensity | Signal intensity (peak height) |
| is_obsolete | Whether term is deprecated and should not be used for new annotations |
| kegg | KEGG compound ID (C##### format) |
| kingdom | Kingdom/superkingdom name |
| ko_id | KEGG Orthology ID in KXXXXX format |
| mapping_id | Unique identifier for this trait-taxon mapping |
| modality_id | Data modality (taxonomy, trait, abiotic, biochemical) |
| Molecular_Formula | Chemical formula when determined from isotope patterns |
| mz | Mass-to-charge ratio (m/z) |
| name | Human-readable term name/label |
| namespace | Ontology namespace (primarily for GO terms) |
| order | Order name |
| organism_name | Organism name from source database |
| parent_pathway | Parent pathway in hierarchy |
| pathway_id | KEGG pathway ID (ko or map prefix) |
| phylum | Phylum name |
| principal_investigator_name | PI name |
| principal_investigator_orcid | PI ORCID identifier for disambiguation |
| Retention_Time_min | Chromatographic retention time in minutes |
| RHEA_ID_BI | Bidirectional reaction ID |
| RHEA_ID_LR | Left-to-right reaction ID |
| RHEA_ID_MASTER | Master reaction ID |
| RHEA_ID_RL | Right-to-left reaction ID |
| rule_definition | For rule-based sources, the logical rule definition |
| sample_id | Sample identifier |
| smiles | SMILES notation for chemical structure |
| source | Source ontology/database for this term |
| source_database | Database of origin |
| source_id | Unique identifier for the trait source |
| source_modality | Data modality this token comes from |
| source_name | Human-readable source name |
| source_type | Type of source (curated, rule_based, literature, computed) |
| species | Species name (binomial or with identifier) |
| specific_ecosystem | Most specific ecosystem classification |
| study_category | Category of study |
| study_id | NMDC study identifier |
| synonyms | Semicolon-separated list of term synonyms |
| taxid | NCBI taxonomy ID (integer) |
| taxon_id | NCBI taxonomy ID or IMG taxon_oid |
| term_id | Term identifier with format varying by source |
| title | Formal study title (may differ from name) |
| token_id | Unique token ID in vocabulary |
| trait_category | Category of trait (phenotype, metabolism, energy_source, oxygen_req, cell_sha... |
| trait_id | Unified trait identifier |
| trait_name | Human-readable trait name |
| value | Token value/weight (e |
| websites | JSON array of associated website URLs |
Enumerations
| Enumeration | Description |
|---|---|
| AnnotationSource | Source databases for annotation terms |
| CogFunctionalCategory | COG single-letter functional categories (A-Z) |
| Ecosystem | Top-level ecosystem classification from GOLD |
| EcosystemCategory | Ecosystem category within top-level ecosystem classification |
| EcosystemType | Specific ecosystem types within categories |
| GoNamespace | Gene Ontology namespaces (aspects) |
| SourceModality | Data modality from which tokens/embeddings are derived |
| StudyCategory | Study organization type - individual research or consortium |
| TokenEntityType | Types of entities in the vocabulary registry for tokenization |
Types
| Type | Description |
|---|---|
| Boolean | A binary (true or false) value |
| Curie | a compact URI |
| Date | a date (year, month and day) in an idealized calendar |
| DateOrDatetime | Either a date or a datetime |
| Datetime | The combination of a date and time |
| Decimal | A real number with arbitrary precision that conforms to the xsd:decimal speci... |
| Double | A real number that conforms to the xsd:double specification |
| Float | A real number that conforms to the xsd:float specification |
| Integer | An integer |
| Jsonpath | A string encoding a JSON Path |
| Jsonpointer | A string encoding a JSON Pointer |
| Ncname | Prefix part of CURIE |
| Nodeidentifier | A URI, CURIE or BNODE that represents a node in a model |
| Objectidentifier | A URI or CURIE that represents an object in the model |
| Sparqlpath | A string encoding a SPARQL Property Path |
| String | A character string |
| Time | A time object represents a (local) time of day, independent of any particular... |
| Uri | a complete URI |
| Uriorcurie | a URI or a CURIE |
Subsets
| Subset | Description |
|---|---|