Microbial Traits Across Databases
This document explains how microbial traits (phenotypes, metabolic capabilities) are modeled across different databases in the JGI and KBase lakehouses.
Overview
Trait data exists in multiple databases with different modeling approaches:
| Database | Approach | Tables | Coverage |
|---|---|---|---|
| GOLD | Curated phenotypes | cvphenotype, organism_phenotype |
7,113 mappings |
| IMG | Pathway-based rules | phenotype_rule, phenotype_rule_taxons |
2M+ predictions |
| GapMind | Pathway completeness | gapmind_pathways |
463M assessments |
| NMDC | Unified (planned) | trait_unified, trait_sources |
Not yet implemented |
Not Available: MicroTraits database is not currently integrated into these lakehouses.
GOLD: Curated Phenotypes
GOLD stores manually curated phenotype annotations assigned by experts during genome registration.
Schema
cvphenotype (159 terms)
│
│ phenotype_id
▼
organism_phenotype (7,113 mappings)
│
│ organism_id
▼
organism_v2 (organism metadata)
Controlled Vocabulary (cvphenotype)
159 phenotype terms organized into functional categories:
| Category | Example Terms | Count |
|---|---|---|
| Pathogenicity | Pathogen, Non-Pathogen, Opportunistic Pathogen | ~10 |
| Extremophiles | Acidophile, Alkaliphile, Thermoacidophile, Piezophile | ~8 |
| Metabolism | Nitrogen cycle, Sulfur cycle, Carbon sequestration | ~5 |
| Biochemical | Catalase positive, Catalase negative, Beta-hemolytic | ~15 |
| Ecological | Symbiont, Probiotic, Biofilm, Bioluminescent | ~20 |
Top Phenotypes by Usage
SELECT p.term, COUNT(*) as organism_count
FROM "gold-db-2 postgresql".gold.organism_phenotype op
JOIN "gold-db-2 postgresql".gold.cvphenotype p ON op.phenotype_id = p.id
GROUP BY p.term
ORDER BY organism_count DESC
LIMIT 10
| Phenotype | Organisms |
|---|---|
| Pathogen | 3,132 |
| Non-Pathogen | 576 |
| Catalase negative | 520 |
| Catalase positive | 414 |
| Opportunistic Pathogen | 279 |
| Beta-hemolytic | 236 |
| Parasite | 146 |
| Acidophile | 143 |
| Intracellular pathogen | 123 |
| Probiotic | 106 |
Related GOLD Trait Tables
GOLD uses the same pattern for other organism characteristics:
| CV Table | Terms | Mapping Table | Assignments |
|---|---|---|---|
cvphenotype |
159 | organism_phenotype |
7,113 |
cvmetabolism |
157 | organism_metabolism |
2,423 |
cvhabitat |
957 | organism_habitat |
46,296 |
cvenergy_source |
31 | organism_energy_source |
8,316 |
cvbiotic_relationship |
3 | organism_biotic_rel |
9,434 |
cvdisease |
434 | organism_disease |
12,520 |
IMG: Pathway-Based Phenotype Rules
IMG predicts phenotypes computationally based on pathway presence/absence in genomes.
Schema
img_pathway (pathway definitions)
│
│ pathway_oid
▼
img_pathway_reactions (ordered reaction steps)
│
│ rxn_oid
▼
img_reaction → img_reaction_catalysts → img_term (enzyme definitions)
│
│ term_oid
▼
gene_img_functions (gene → enzyme annotations)
│
│ gene_oid
▼
gene (individual genes in genomes)
Phenotype Rules
65 logical rules that predict traits from pathway presence:
SELECT rule_id, name, cv_type, cv_value, rule
FROM "img-db-2 postgresql".img_ext.phenotype_rule
LIMIT 10
| rule_id | name | cv_type | cv_value | rule |
|---|---|---|---|---|
| 2 | Aerobe | oxygen_req | Aerobe | (768\|769\|770) |
| 4 | Denitrifier | metabolism | Denitrifying | (764),(765),(766),(767) |
| 6 | Carbon fixation | metabolism | Carbon fixation | (527\|335\|285\|...) |
| 7 | L-lysine auxotroph | metabolism | Auxotroph | (!170),(!169),(!199),... |
| 22 | L-histidine auxotroph | metabolism | Auxotroph | (!162) |
| 52 | Dissimilatory sulfate reduction | metabolism | Sulfate reducer | (870),(864),(871) |
| 53 | Nitrogen fixer | metabolism | Nitrogen fixation | (798) |
Rule Syntax
Rules use IMG pathway IDs with logical operators:
| Operator | Meaning | Example |
|---|---|---|
(N) |
Pathway N present | (798) = has nitrogenase |
(!N) |
Pathway N absent | (!162) = lacks histidine synthesis |
\| |
OR (within group) | (768\|769\|770) = has ANY quinol oxidase |
, |
AND (between groups) | (764),(765) = has BOTH pathways |
Example: Lysine Auxotroph Rule
The rule (!170),(!169),(!199),(!333),(!465),(!171) means "lacks ALL lysine biosynthesis pathways":
| Pathway | Name | If Absent |
|---|---|---|
| 170 | L-lysine synthesis (acetylated intermediates) | Missing DAP-acetyl route |
| 169 | L-lysine synthesis (succinylated intermediates) | Missing DAP-succinyl route |
| 199 | L-lysine synthesis by fungal aminoadipate | Missing fungal route |
| 333 | L-lysine synthesis by prokaryotic aminoadipate | Missing bacterial AAA route |
| 465 | L-lysine synthesis with DAP aminotransferase | Missing DAP-AT route |
| 171 | L-lysine synthesis with DAP dehydrogenase | Missing DAP-DH route |
An organism is predicted as L-lysine auxotroph if it lacks ALL six pathways.
How Pathways Are Defined
Each pathway is defined as ordered reactions with required enzymes:
-- Pathway 170: L-lysine synthesis (acetylated intermediates)
SELECT pr.rxn_order, r.rxn_name, t.term as enzyme, te.enzymes as ec_number
FROM "img-db-2 postgresql".img_ext.img_pathway_reactions pr
JOIN "img-db-2 postgresql".img_ext.img_reaction r ON pr.rxn = r.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_reaction_catalysts rc ON r.rxn_oid = rc.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_term t ON rc.catalysts = t.term_oid
LEFT JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te ON t.term_oid = te.term_oid
WHERE pr.pathway_oid = 170
ORDER BY pr.rxn_order
| Step | Enzyme | EC Number |
|---|---|---|
| 1 | Aspartate kinase | EC:2.7.2.4 |
| 2 | Aspartate semialdehyde dehydrogenase | EC:1.2.1.11 |
| 3 | Dihydrodipicolinate synthase | EC:4.2.1.52 |
| 4 | Dihydrodipicolinate reductase | EC:1.3.1.26 |
| 5 | THDP N-acetyltransferase | EC:2.3.1.89 |
| 6 | Acetyldiaminopimelate aminotransferase | EC:2.6.1.- |
| 7 | N-acetyl-DAP deacetylase | EC:3.5.1.47 |
| 8 | Diaminopimelate epimerase | EC:5.1.1.7 |
| 9 | Diaminopimelate decarboxylase | EC:4.1.1.20 |
Gene Annotation Pipeline
Genes are annotated to enzymes via multiple evidence types:
SELECT gif.evidence, COUNT(*) as gene_count
FROM "img-db-2 postgresql".img_ext.gene_img_functions gif
GROUP BY gif.evidence
| Evidence | Description |
|---|---|
| BLAST | Sequence similarity to characterized proteins |
| Inferred | Transferred from ortholog annotations |
| HMM | Hidden Markov Model profile matches |
Phenotype Rule Coverage
SELECT pr.name, COUNT(prt.taxon) as taxon_count
FROM "img-db-2 postgresql".img_ext.phenotype_rule pr
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON pr.rule_id = prt.rule_id
GROUP BY pr.name
ORDER BY taxon_count DESC
LIMIT 10
| Phenotype Rule | Taxa Count |
|---|---|
| L-histidine auxotroph | 74,695 |
| L-glutamate prototroph | 73,791 |
| L-phenylalanine auxotroph | 72,742 |
| L-tyrosine auxotroph | 72,589 |
| Non-biotin synthesizer | 71,742 |
| L-lysine auxotroph | 69,651 |
| Non-selenocysteine synthesizer | 67,786 |
| Glycine prototroph | 66,760 |
| L-leucine auxotroph | 63,982 |
| Aerobe | 45,956 |
GapMind: Pathway Completeness Scoring
GapMind (Price et al. 2020) assesses metabolic pathway completeness by searching for characterized enzymes and transporters in genome sequences.
Schema
gapmind_pathways (463M assessments)
│
│ genome_id
▼
kbase_ke_pangenome.genome (293K genomes)
Pathway Categories
80 metabolic pathways organized into three categories:
| Category | Description | Example Pathways |
|---|---|---|
aa |
Amino acid biosynthesis | arg, asn, cys, gln, his, ile, leu, lys, met, phe, pro, ser, thr, trp, tyr, val |
carbon |
Carbon source utilization | sucrose, lactose, maltose, trehalose, cellobiose, glucose |
aromatic |
Aromatic compound metabolism | 4-hydroxybenzoate, phenylacetate |
Scoring System
| Field | Description |
|---|---|
nHi |
High-confidence gene hits (strong matches) |
nMed |
Medium-confidence gene hits |
nLo |
Low-confidence gene hits (weak matches) |
score |
Overall score (positive = complete, negative = incomplete) |
score_category |
present, partial, or not_present |
score_simplified |
1.0 (present), 0.5 (partial), 0.0 (not_present) |
Example Query
-- Find genomes that can synthesize all amino acids
SELECT genome_id, COUNT(*) as complete_pathways
FROM globalusers_gapmind_pathways.gapmind_pathways
WHERE metabolic_category = 'aa'
AND score_category = 'present'
GROUP BY genome_id
HAVING COUNT(*) >= 20
Top Pathways by Assessment Count
| Pathway | Assessments | Category |
|---|---|---|
| phenylalanine | 18,603,662 | aa |
| arginine | 15,945,996 | aa |
| citrulline | 14,617,163 | aa |
| 4-hydroxybenzoate | 14,617,163 | aromatic |
| threonine | 14,617,163 | aa |
| sucrose | 13,288,330 | carbon |
| lactose | 13,288,330 | carbon |
GapMind vs IMG Phenotype Rules
Both predict auxotrophy/prototrophy but differ in approach:
| Aspect | GapMind | IMG Rules |
|---|---|---|
| Method | Score-based pathway completeness | Boolean pathway presence |
| Output | present/partial/not_present | auxotroph/prototroph |
| Granularity | Individual pathways with scores | Binary predictions |
| Coverage | 463M assessments, 80 pathways | 2M taxon mappings, 65 rules |
| Reference | Price et al. 2020 | IMG internal |
Cross-Validation Example
-- Compare GapMind lysine scores with IMG lysine auxotroph predictions
SELECT
g.genome_id,
gm.score_category as gapmind_lys,
CASE WHEN prt.taxon IS NOT NULL THEN 'auxotroph' ELSE 'prototroph' END as img_lys
FROM kbase_ke_pangenome.genome g
LEFT JOIN globalusers_gapmind_pathways.gapmind_pathways gm
ON g.genome_id = gm.genome_id AND gm.pathway = 'lys'
LEFT JOIN "img-db-2 postgresql".img_ext.taxon t
ON SUBSTRING(g.genome_id FROM 4) = t.ncbi_assembly_accession
LEFT JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt
ON t.taxon_oid = prt.taxon AND prt.rule_id = 7 -- L-lysine auxotroph rule
LIMIT 100
NMDC: Unified Trait Model (Planned)
NMDC aims to unify trait data from multiple sources with provenance tracking.
Planned Schema
trait_sources (provenance)
│
│ source_id
▼
trait_taxonomy_mapping (trait-taxon assignments)
│
│ trait_id
▼
trait_unified (unified vocabulary)
Intended Data Flow
GOLD cvphenotype ─────┐
(159 terms) │
├──► trait_unified (combined vocabulary)
IMG phenotype_rule ───┤ │
(65 rules) │ ▼
│ trait_taxonomy_mapping
Literature mining ────┘ │
▼
trait_sources (provenance)
Source Types
| Source Type | Origin | Confidence | Coverage |
|---|---|---|---|
gold_curated |
GOLD organism_phenotype | High | 7,113 |
img_rule_based |
IMG phenotype_rule_taxons | Medium-High | 2,059,203 |
literature |
PubMed NLP extraction | Variable | TBD |
computed |
ML predictions | Requires validation | TBD |
Comparing Approaches
| Aspect | GOLD | IMG Rules | GapMind |
|---|---|---|---|
| Method | Manual curation | Boolean pathway rules | Scored pathway completeness |
| Confidence | High (expert) | Medium-high | Medium-high |
| Coverage | Limited (7K) | Broad (2M+) | Very broad (463M) |
| Granularity | Organism-level | Genome-level | Genome × pathway |
| Output | Categorical traits | Binary predictions | Scores + categories |
| Update frequency | Slow (manual) | Automatic | Automatic |
| Best for | Pathogenicity, extremophiles | Auxotrophy/prototrophy | Metabolic potential |
Cross-Database Queries
Find organisms with both GOLD and IMG phenotype data
SELECT
g.organism_name,
gp.term as gold_phenotype,
pr.name as img_predicted_trait
FROM "gold-db-2 postgresql".gold.organism_v2 g
JOIN "gold-db-2 postgresql".gold.organism_phenotype op ON g.organism_id = op.organism_id
JOIN "gold-db-2 postgresql".gold.cvphenotype gp ON op.phenotype_id = gp.id
JOIN "img-db-2 postgresql".img_ext.taxon t ON g.ncbi_taxon_id = t.ncbi_taxon_id
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON t.taxon_oid = prt.taxon
JOIN "img-db-2 postgresql".img_ext.phenotype_rule pr ON prt.rule_id = pr.rule_id
WHERE gp.term = 'Nitrogen fixation'
AND pr.name = 'Nitrogen fixer'
LIMIT 10
Validate IMG rules against GOLD curation
-- Find genomes where IMG predicts auxotrophy but GOLD says prototrophic
-- (potential false positives or annotation errors)
SELECT
g.organism_name,
pr.name as img_prediction,
gp.term as gold_annotation
FROM "gold-db-2 postgresql".gold.organism_v2 g
JOIN "img-db-2 postgresql".img_ext.taxon t ON g.ncbi_taxon_id = t.ncbi_taxon_id
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON t.taxon_oid = prt.taxon
JOIN "img-db-2 postgresql".img_ext.phenotype_rule pr ON prt.rule_id = pr.rule_id
JOIN "gold-db-2 postgresql".gold.organism_phenotype op ON g.organism_id = op.organism_id
JOIN "gold-db-2 postgresql".gold.cvphenotype gp ON op.phenotype_id = gp.id
WHERE pr.cv_value = 'Auxotroph'
AND gp.term LIKE '%prototroph%'
Key Tables Reference
GOLD (gold-db-2)
| Table | Description | Rows |
|---|---|---|
cvphenotype |
Phenotype vocabulary | 159 |
organism_phenotype |
Organism-phenotype mappings | 7,113 |
cvmetabolism |
Metabolism vocabulary | 157 |
organism_metabolism |
Organism-metabolism mappings | 2,423 |
cvenergy_source |
Energy source vocabulary | 31 |
IMG (img-db-2)
| Table | Schema | Description | Rows |
|---|---|---|---|
phenotype_rule |
img_ext | Rule definitions | 65 |
phenotype_rule_taxons |
img_ext | Rule-taxon mappings | 2,059,203 |
img_pathway |
img_ext | Pathway definitions | ~1,000 |
img_pathway_reactions |
img_ext | Pathway-reaction links | ~10,000 |
img_term |
img_ext | Enzyme term definitions | ~10,000 |
gene_img_functions |
img_ext | Gene-enzyme annotations | Millions |
GapMind (KBase)
| Table | Description | Rows |
|---|---|---|
gapmind_pathways |
Pathway completeness assessments | 463,729,001 |
Reference: Price et al. 2020 "GapMind: Automated Annotation of Amino Acid Biosynthesis" mSystems 5:e00291-20
Recommendations
- For curated, high-confidence traits: Use GOLD
organism_phenotype - For broad coverage predictions: Use IMG
phenotype_rule_taxons - For amino acid auxotrophy/prototrophy: Both IMG rules and GapMind are comprehensive
- For pathway completeness with scores: Use GapMind
gapmind_pathways - For pathogenicity: GOLD curation is more reliable
- For cross-validation: Join sources on NCBI taxonomy ID or assembly accession
Data Availability Note
MicroTraits database is not currently integrated into the JGI or KBase lakehouses. If you need MicroTraits data, you may need to access it from its original source.