Skip to content

NMDC Core Database

National Microbiome Data Collaborative (NMDC) core data including samples, studies, omics data, annotations, and embeddings. Contains data from GOLD, metabolomics, metagenomics, and other microbiome sources. Supports multi-modal analysis through unified embeddings and tokenization. DATABASE STATISTICS (as of 2024): - 48 research studies with comprehensive metadata - 3,129,061+ metabolomics feature records - 48,196 annotation terms (GO, EC, KEGG, COG, MetaCyc) - 39,354 non-obsolete GO terms - 256-dimensional unified embeddings per biosample ANNOTATION TERM COUNTS BY SOURCE: | Source | Terms | Description | |--------------|--------|-------------------------------------| | GO | 48,196 | Gene Ontology terms | | EC | 8,813 | Enzyme Commission numbers | | KEGG KO | 8,104 | KEGG Orthology functional orthologs| | MetaCyc | 1,538 | MetaCyc metabolic pathways | | KEGG Module | 370 | KEGG functional modules | | KEGG Pathway | 306 | KEGG metabolic pathways | | COG | 26 | Clusters of Orthologous Groups | GO TERM DISTRIBUTION BY NAMESPACE: | Namespace | Count | Fraction | |--------------------|--------|----------| | biological_process | 30,817 | 64% | | molecular_function | 12,805 | 27% | | cellular_component | 4,573 | 9% | ECOSYSTEM COVERAGE (studies with ecosystem data): - Environmental/Terrestrial/Soil (3 studies) - Environmental/Aquatic/Freshwater (3 studies) - Environmental/Terrestrial/Deep subsurface (1 study) - Host-associated/Plants (1 study) KEY FEATURES: - Multi-modal embeddings for similarity search across samples - Unified annotation vocabulary across GO, KEGG, EC, COG, MetaCyc - Pre-computed GO hierarchy for efficient ancestor queries - Mass spectrometry metabolomics with compound identification - Taxonomic classification from multiple tools (Kraken, GOTTCHA, Centrifuge) USAGE: For functional analysis, start with annotation_terms_unified. For metabolomics, use metabolomics_gold. For similarity search, use embeddings_v1. For GO-based enrichment, use go_hierarchy_flat for closure queries.

URI: https://w3id.org/kbase/nmdc_core

Name: nmdc_core

Classes

Class Description
AbioticEmbeddings Abiotic factor embeddings from environmental measurements (pH, temperature, m...
AbioticFeatures Abiotic environmental features for machine learning
AnnotationCrossrefs Cross-references between annotation databases
AnnotationHierarchiesUnified Unified annotation hierarchies across sources
AnnotationTermsUnified Unified annotation terms across sources (GO, KEGG, EC, COG, MetaCyc)
BiochemicalEmbeddings Biochemical feature embeddings from metabolomics data
BiochemicalFeatures Biochemical features from metabolomics for machine learning
CentrifugeGold Centrifuge taxonomic classifications for GOLD samples
CogCategories COG functional categories with descriptions and colors
ContigTaxonomy Contig-level taxonomic assignments from metagenome assemblies
EcTerms Enzyme Commission (EC) number terms
EmbeddingsV1 Vector embeddings version 1 for samples and entities
GoHierarchyFlat Flattened GO hierarchy for efficient ancestor/descendant queries
GoTerms Gene Ontology terms with full metadata
GottchaGold GOTTCHA taxonomic classifications for GOLD samples
KeggKoTerms KEGG Orthology (KO) terms
KeggPathwayTerms KEGG pathway definitions with category classification
KrakenGold Kraken taxonomic classifications for GOLD samples
LipidomicsGold Lipidomics data linked to GOLD samples
MetabolomicsGold Metabolomics data linked to GOLD samples
MetacycPathwayReactions Reactions within MetaCyc pathways
MetacycPathways MetaCyc metabolic pathways with hierarchical classification
MetatranscriptomicsGold Metatranscriptomics expression data linked to GOLD samples
NomFeatureMetadata NOM feature metadata including molecular formulas, exact masses, and compound...
NomGold Natural organic matter data linked to GOLD samples
NomMatrixOptimized Optimized NOM feature matrix for efficient queries
OmicsFilesTable Inventory of omics data files with metadata and URLs
ProteomicsGold Proteomics data linked to GOLD samples
RheaCrossrefs Cross-references from Rhea reactions to other databases
RheaReactions Rhea biochemical reactions database
SampleFileLookup Sample to file mapping for data retrieval
SampleFileSelections User-curated file selections per sample for analysis
SampleTokensV1 Sample-level token assignments from vocabulary
StudyTable NMDC research studies with ecosystem classification and investigator informat...
TaxonomyDim Taxonomic hierarchy dimension table using NCBI taxonomy
TaxonomyEmbeddings Taxonomic profile embeddings - vector representation of community composition...
TaxonomyFeatures Taxonomy-derived features for machine learning
TraitEmbeddings Trait-based embeddings derived from functional annotations
TraitFeatures Trait-derived features for machine learning
TraitSources Sources for trait data (databases, literature, predictions)
TraitTaxonomyMapping Mapping between traits and taxonomic groups
TraitUnified Unified trait annotations across samples from multiple sources
UnifiedEmbeddings Unified multi-modal embeddings combining taxonomy, traits, abiotic factors, a...
VocabRegistryV1 Vocabulary registry for multi-modal tokenization

Slots

Slot Description
index_level_0 Sample identifier (biosample ID)
all_ancestors Semicolon-separated all ancestor GO IDs (transitive closure)
all_parents Semicolon-separated direct parent GO IDs (immediate is_a/part_of parents)
associated_dois JSON array of DOI objects with doi_value, doi_category (dataset_doi, award_do...
category KEGG functional category
category_code Single-letter COG category code (A-Z)
category_name Full category name
chebi ChEBI compound ID
class Class name
cog_id Internal COG category ID (same as category_code)
color_code Hex color code for visualization (without # prefix)
confidence_level Confidence assessment of annotations from this source
confidence_score Confidence score for this mapping (0-1)
coverage_count Number of trait-taxon assignments from this source
definition Full term definition with citations in double quotes
depth Maximum depth from root (root terms have depth 1)
description Term description or definition
dim_0 First embedding dimension
dim_255 Last embedding dimension (256 total dimensions, 0-255)
ec_id EC number in X
ecosystem Top-level ecosystem classification
ecosystem_category Ecosystem category (second level)
ecosystem_subtype Ecosystem subtype for further classification
ecosystem_type Specific ecosystem type (third level)
entity_key Unique key for this entity within its type
entity_type Type of entity this token represents
evidence_type Type of evidence supporting this mapping
family Family name
feature_id Metabolite feature ID (unique within file)
file_id NMDC data object ID for source file
file_name Original CSV file name
funding_sources JSON array of funding sources
genus Genus name
go_id GO term ID in GO:NNNNNNN format
gold_study_identifiers JSON array of GOLD study IDs
human_name Human-readable name for the token
inchi InChI (International Chemical Identifier) string
inchikey InChIKey - 27-character hash of InChI for database searching
Intensity Signal intensity (peak height)
is_obsolete Whether term is deprecated and should not be used for new annotations
kegg KEGG compound ID (C##### format)
kingdom Kingdom/superkingdom name
ko_id KEGG Orthology ID in KXXXXX format
mapping_id Unique identifier for this trait-taxon mapping
modality_id Data modality (taxonomy, trait, abiotic, biochemical)
Molecular_Formula Chemical formula when determined from isotope patterns
mz Mass-to-charge ratio (m/z)
name Human-readable term name/label
namespace Ontology namespace (primarily for GO terms)
order Order name
organism_name Organism name from source database
parent_pathway Parent pathway in hierarchy
pathway_id KEGG pathway ID (ko or map prefix)
phylum Phylum name
principal_investigator_name PI name
principal_investigator_orcid PI ORCID identifier for disambiguation
Retention_Time_min Chromatographic retention time in minutes
RHEA_ID_BI Bidirectional reaction ID
RHEA_ID_LR Left-to-right reaction ID
RHEA_ID_MASTER Master reaction ID
RHEA_ID_RL Right-to-left reaction ID
rule_definition For rule-based sources, the logical rule definition
sample_id Sample identifier
smiles SMILES notation for chemical structure
source Source ontology/database for this term
source_database Database of origin
source_id Unique identifier for the trait source
source_modality Data modality this token comes from
source_name Human-readable source name
source_type Type of source (curated, rule_based, literature, computed)
species Species name (binomial or with identifier)
specific_ecosystem Most specific ecosystem classification
study_category Category of study
study_id NMDC study identifier
synonyms Semicolon-separated list of term synonyms
taxid NCBI taxonomy ID (integer)
taxon_id NCBI taxonomy ID or IMG taxon_oid
term_id Term identifier with format varying by source
title Formal study title (may differ from name)
token_id Unique token ID in vocabulary
trait_category Category of trait (phenotype, metabolism, energy_source, oxygen_req, cell_sha...
trait_id Unified trait identifier
trait_name Human-readable trait name
value Token value/weight (e
websites JSON array of associated website URLs

Enumerations

Enumeration Description
AnnotationSource Source databases for annotation terms
CogFunctionalCategory COG single-letter functional categories (A-Z)
Ecosystem Top-level ecosystem classification from GOLD
EcosystemCategory Ecosystem category within top-level ecosystem classification
EcosystemType Specific ecosystem types within categories
GoNamespace Gene Ontology namespaces (aspects)
SourceModality Data modality from which tokens/embeddings are derived
StudyCategory Study organization type - individual research or consortium
TokenEntityType Types of entities in the vocabulary registry for tokenization

Types

Type Description
Boolean A binary (true or false) value
Curie a compact URI
Date a date (year, month and day) in an idealized calendar
DateOrDatetime Either a date or a datetime
Datetime The combination of a date and time
Decimal A real number with arbitrary precision that conforms to the xsd:decimal speci...
Double A real number that conforms to the xsd:double specification
Float A real number that conforms to the xsd:float specification
Integer An integer
Jsonpath A string encoding a JSON Path
Jsonpointer A string encoding a JSON Pointer
Ncname Prefix part of CURIE
Nodeidentifier A URI, CURIE or BNODE that represents a node in a model
Objectidentifier A URI or CURIE that represents an object in the model
Sparqlpath A string encoding a SPARQL Property Path
String A character string
Time A time object represents a (local) time of day, independent of any particular...
Uri a complete URI
Uriorcurie a URI or a CURIE

Subsets

Subset Description