Microbial Traits Across Databases

This document explains how microbial traits (phenotypes, metabolic capabilities) are modeled across different databases in the JGI and KBase lakehouses.

Overview

Trait data exists in multiple databases with different modeling approaches:

Database	Approach	Tables	Coverage
GOLD	Curated phenotypes	`cvphenotype`, `organism_phenotype`	7,113 mappings
IMG	Pathway-based rules	`phenotype_rule`, `phenotype_rule_taxons`	2M+ predictions
GapMind	Pathway completeness	`gapmind_pathways`	463M assessments
NMDC	Unified (planned)	`trait_unified`, `trait_sources`	Not yet implemented

Not Available: MicroTraits database is not currently integrated into these lakehouses.

GOLD: Curated Phenotypes

GOLD stores manually curated phenotype annotations assigned by experts during genome registration.

Schema

cvphenotype (159 terms)
    │
    │ phenotype_id
    ▼
organism_phenotype (7,113 mappings)
    │
    │ organism_id
    ▼
organism_v2 (organism metadata)

Controlled Vocabulary (cvphenotype)

159 phenotype terms organized into functional categories:

Category	Example Terms	Count
Pathogenicity	Pathogen, Non-Pathogen, Opportunistic Pathogen	~10
Extremophiles	Acidophile, Alkaliphile, Thermoacidophile, Piezophile	~8
Metabolism	Nitrogen cycle, Sulfur cycle, Carbon sequestration	~5
Biochemical	Catalase positive, Catalase negative, Beta-hemolytic	~15
Ecological	Symbiont, Probiotic, Biofilm, Bioluminescent	~20

Top Phenotypes by Usage

SELECT p.term, COUNT(*) as organism_count
FROM "gold-db-2 postgresql".gold.organism_phenotype op
JOIN "gold-db-2 postgresql".gold.cvphenotype p ON op.phenotype_id = p.id
GROUP BY p.term
ORDER BY organism_count DESC
LIMIT 10

Phenotype	Organisms
Pathogen	3,132
Non-Pathogen	576
Catalase negative	520
Catalase positive	414
Opportunistic Pathogen	279
Beta-hemolytic	236
Parasite	146
Acidophile	143
Intracellular pathogen	123
Probiotic	106

GOLD uses the same pattern for other organism characteristics:

CV Table	Terms	Mapping Table	Assignments
`cvphenotype`	159	`organism_phenotype`	7,113
`cvmetabolism`	157	`organism_metabolism`	2,423
`cvhabitat`	957	`organism_habitat`	46,296
`cvenergy_source`	31	`organism_energy_source`	8,316
`cvbiotic_relationship`	3	`organism_biotic_rel`	9,434
`cvdisease`	434	`organism_disease`	12,520

IMG: Pathway-Based Phenotype Rules

IMG predicts phenotypes computationally based on pathway presence/absence in genomes.

Schema

img_pathway (pathway definitions)
    │
    │ pathway_oid
    ▼
img_pathway_reactions (ordered reaction steps)
    │
    │ rxn_oid
    ▼
img_reaction → img_reaction_catalysts → img_term (enzyme definitions)
    │
    │ term_oid
    ▼
gene_img_functions (gene → enzyme annotations)
    │
    │ gene_oid
    ▼
gene (individual genes in genomes)

Phenotype Rules

65 logical rules that predict traits from pathway presence:

SELECT rule_id, name, cv_type, cv_value, rule
FROM "img-db-2 postgresql".img_ext.phenotype_rule
LIMIT 10

rule_id	name	cv_type	cv_value	rule
2	Aerobe	oxygen_req	Aerobe	`(768\\|769\\|770)`
4	Denitrifier	metabolism	Denitrifying	`(764),(765),(766),(767)`
6	Carbon fixation	metabolism	Carbon fixation	`(527\\|335\\|285\\|...)`
7	L-lysine auxotroph	metabolism	Auxotroph	`(!170),(!169),(!199),...`
22	L-histidine auxotroph	metabolism	Auxotroph	`(!162)`
52	Dissimilatory sulfate reduction	metabolism	Sulfate reducer	`(870),(864),(871)`
53	Nitrogen fixer	metabolism	Nitrogen fixation	`(798)`

Rule Syntax

Rules use IMG pathway IDs with logical operators:

Operator	Meaning	Example
`(N)`	Pathway N present	`(798)` = has nitrogenase
`(!N)`	Pathway N absent	`(!162)` = lacks histidine synthesis
`\\|`	OR (within group)	`(768\\|769\\|770)` = has ANY quinol oxidase
`,`	AND (between groups)	`(764),(765)` = has BOTH pathways

Example: Lysine Auxotroph Rule

The rule (!170),(!169),(!199),(!333),(!465),(!171) means "lacks ALL lysine biosynthesis pathways":

Pathway	Name	If Absent
170	L-lysine synthesis (acetylated intermediates)	Missing DAP-acetyl route
169	L-lysine synthesis (succinylated intermediates)	Missing DAP-succinyl route
199	L-lysine synthesis by fungal aminoadipate	Missing fungal route
333	L-lysine synthesis by prokaryotic aminoadipate	Missing bacterial AAA route
465	L-lysine synthesis with DAP aminotransferase	Missing DAP-AT route
171	L-lysine synthesis with DAP dehydrogenase	Missing DAP-DH route

An organism is predicted as L-lysine auxotroph if it lacks ALL six pathways.

How Pathways Are Defined

Each pathway is defined as ordered reactions with required enzymes:

-- Pathway 170: L-lysine synthesis (acetylated intermediates)
SELECT pr.rxn_order, r.rxn_name, t.term as enzyme, te.enzymes as ec_number
FROM "img-db-2 postgresql".img_ext.img_pathway_reactions pr
JOIN "img-db-2 postgresql".img_ext.img_reaction r ON pr.rxn = r.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_reaction_catalysts rc ON r.rxn_oid = rc.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_term t ON rc.catalysts = t.term_oid
LEFT JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te ON t.term_oid = te.term_oid
WHERE pr.pathway_oid = 170
ORDER BY pr.rxn_order

Step	Enzyme	EC Number
1	Aspartate kinase	EC:2.7.2.4
2	Aspartate semialdehyde dehydrogenase	EC:1.2.1.11
3	Dihydrodipicolinate synthase	EC:4.2.1.52
4	Dihydrodipicolinate reductase	EC:1.3.1.26
5	THDP N-acetyltransferase	EC:2.3.1.89
6	Acetyldiaminopimelate aminotransferase	EC:2.6.1.-
7	N-acetyl-DAP deacetylase	EC:3.5.1.47
8	Diaminopimelate epimerase	EC:5.1.1.7
9	Diaminopimelate decarboxylase	EC:4.1.1.20

Gene Annotation Pipeline

Genes are annotated to enzymes via multiple evidence types:

SELECT gif.evidence, COUNT(*) as gene_count
FROM "img-db-2 postgresql".img_ext.gene_img_functions gif
GROUP BY gif.evidence

Evidence	Description
BLAST	Sequence similarity to characterized proteins
Inferred	Transferred from ortholog annotations
HMM	Hidden Markov Model profile matches

Phenotype Rule Coverage

SELECT pr.name, COUNT(prt.taxon) as taxon_count
FROM "img-db-2 postgresql".img_ext.phenotype_rule pr
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON pr.rule_id = prt.rule_id
GROUP BY pr.name
ORDER BY taxon_count DESC
LIMIT 10

Phenotype Rule	Taxa Count
L-histidine auxotroph	74,695
L-glutamate prototroph	73,791
L-phenylalanine auxotroph	72,742
L-tyrosine auxotroph	72,589
Non-biotin synthesizer	71,742
L-lysine auxotroph	69,651
Non-selenocysteine synthesizer	67,786
Glycine prototroph	66,760
L-leucine auxotroph	63,982
Aerobe	45,956

GapMind: Pathway Completeness Scoring

GapMind (Price et al. 2020) assesses metabolic pathway completeness by searching for characterized enzymes and transporters in genome sequences.

Schema

gapmind_pathways (463M assessments)
    │
    │ genome_id
    ▼
kbase_ke_pangenome.genome (293K genomes)

Pathway Categories

80 metabolic pathways organized into three categories:

Category	Description	Example Pathways
`aa`	Amino acid biosynthesis	arg, asn, cys, gln, his, ile, leu, lys, met, phe, pro, ser, thr, trp, tyr, val
`carbon`	Carbon source utilization	sucrose, lactose, maltose, trehalose, cellobiose, glucose
`aromatic`	Aromatic compound metabolism	4-hydroxybenzoate, phenylacetate

Scoring System

Field	Description
`nHi`	High-confidence gene hits (strong matches)
`nMed`	Medium-confidence gene hits
`nLo`	Low-confidence gene hits (weak matches)
`score`	Overall score (positive = complete, negative = incomplete)
`score_category`	`present`, `partial`, or `not_present`
`score_simplified`	1.0 (present), 0.5 (partial), 0.0 (not_present)

Example Query

-- Find genomes that can synthesize all amino acids
SELECT genome_id, COUNT(*) as complete_pathways
FROM globalusers_gapmind_pathways.gapmind_pathways
WHERE metabolic_category = 'aa'
  AND score_category = 'present'
GROUP BY genome_id
HAVING COUNT(*) >= 20

Top Pathways by Assessment Count

Pathway	Assessments	Category
phenylalanine	18,603,662	aa
arginine	15,945,996	aa
citrulline	14,617,163	aa
4-hydroxybenzoate	14,617,163	aromatic
threonine	14,617,163	aa
sucrose	13,288,330	carbon
lactose	13,288,330	carbon

GapMind vs IMG Phenotype Rules

Both predict auxotrophy/prototrophy but differ in approach:

Aspect	GapMind	IMG Rules
Method	Score-based pathway completeness	Boolean pathway presence
Output	present/partial/not_present	auxotroph/prototroph
Granularity	Individual pathways with scores	Binary predictions
Coverage	463M assessments, 80 pathways	2M taxon mappings, 65 rules
Reference	Price et al. 2020	IMG internal

Cross-Validation Example

-- Compare GapMind lysine scores with IMG lysine auxotroph predictions
SELECT
    g.genome_id,
    gm.score_category as gapmind_lys,
    CASE WHEN prt.taxon IS NOT NULL THEN 'auxotroph' ELSE 'prototroph' END as img_lys
FROM kbase_ke_pangenome.genome g
LEFT JOIN globalusers_gapmind_pathways.gapmind_pathways gm
    ON g.genome_id = gm.genome_id AND gm.pathway = 'lys'
LEFT JOIN "img-db-2 postgresql".img_ext.taxon t
    ON SUBSTRING(g.genome_id FROM 4) = t.ncbi_assembly_accession
LEFT JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt
    ON t.taxon_oid = prt.taxon AND prt.rule_id = 7  -- L-lysine auxotroph rule
LIMIT 100

NMDC: Unified Trait Model (Planned)

NMDC aims to unify trait data from multiple sources with provenance tracking.

Planned Schema

trait_sources (provenance)
    │
    │ source_id
    ▼
trait_taxonomy_mapping (trait-taxon assignments)
    │
    │ trait_id
    ▼
trait_unified (unified vocabulary)

Intended Data Flow

GOLD cvphenotype ─────┐
     (159 terms)      │
                      ├──► trait_unified (combined vocabulary)
IMG phenotype_rule ───┤         │
     (65 rules)       │         ▼
                      │    trait_taxonomy_mapping
Literature mining ────┘         │
                               ▼
                        trait_sources (provenance)

Source Types

Source Type	Origin	Confidence	Coverage
`gold_curated`	GOLD organism_phenotype	High	7,113
`img_rule_based`	IMG phenotype_rule_taxons	Medium-High	2,059,203
`literature`	PubMed NLP extraction	Variable	TBD
`computed`	ML predictions	Requires validation	TBD

Comparing Approaches

Aspect	GOLD	IMG Rules	GapMind
Method	Manual curation	Boolean pathway rules	Scored pathway completeness
Confidence	High (expert)	Medium-high	Medium-high
Coverage	Limited (7K)	Broad (2M+)	Very broad (463M)
Granularity	Organism-level	Genome-level	Genome × pathway
Output	Categorical traits	Binary predictions	Scores + categories
Update frequency	Slow (manual)	Automatic	Automatic
Best for	Pathogenicity, extremophiles	Auxotrophy/prototrophy	Metabolic potential

Cross-Database Queries

Find organisms with both GOLD and IMG phenotype data

SELECT
    g.organism_name,
    gp.term as gold_phenotype,
    pr.name as img_predicted_trait
FROM "gold-db-2 postgresql".gold.organism_v2 g
JOIN "gold-db-2 postgresql".gold.organism_phenotype op ON g.organism_id = op.organism_id
JOIN "gold-db-2 postgresql".gold.cvphenotype gp ON op.phenotype_id = gp.id
JOIN "img-db-2 postgresql".img_ext.taxon t ON g.ncbi_taxon_id = t.ncbi_taxon_id
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON t.taxon_oid = prt.taxon
JOIN "img-db-2 postgresql".img_ext.phenotype_rule pr ON prt.rule_id = pr.rule_id
WHERE gp.term = 'Nitrogen fixation'
  AND pr.name = 'Nitrogen fixer'
LIMIT 10

Validate IMG rules against GOLD curation

-- Find genomes where IMG predicts auxotrophy but GOLD says prototrophic
-- (potential false positives or annotation errors)
SELECT
    g.organism_name,
    pr.name as img_prediction,
    gp.term as gold_annotation
FROM "gold-db-2 postgresql".gold.organism_v2 g
JOIN "img-db-2 postgresql".img_ext.taxon t ON g.ncbi_taxon_id = t.ncbi_taxon_id
JOIN "img-db-2 postgresql".img_ext.phenotype_rule_taxons prt ON t.taxon_oid = prt.taxon
JOIN "img-db-2 postgresql".img_ext.phenotype_rule pr ON prt.rule_id = pr.rule_id
JOIN "gold-db-2 postgresql".gold.organism_phenotype op ON g.organism_id = op.organism_id
JOIN "gold-db-2 postgresql".gold.cvphenotype gp ON op.phenotype_id = gp.id
WHERE pr.cv_value = 'Auxotroph'
  AND gp.term LIKE '%prototroph%'

Key Tables Reference

GOLD (gold-db-2)

Table	Description	Rows
`cvphenotype`	Phenotype vocabulary	159
`organism_phenotype`	Organism-phenotype mappings	7,113
`cvmetabolism`	Metabolism vocabulary	157
`organism_metabolism`	Organism-metabolism mappings	2,423
`cvenergy_source`	Energy source vocabulary	31

IMG (img-db-2)

Table	Schema	Description	Rows
`phenotype_rule`	img_ext	Rule definitions	65
`phenotype_rule_taxons`	img_ext	Rule-taxon mappings	2,059,203
`img_pathway`	img_ext	Pathway definitions	~1,000
`img_pathway_reactions`	img_ext	Pathway-reaction links	~10,000
`img_term`	img_ext	Enzyme term definitions	~10,000
`gene_img_functions`	img_ext	Gene-enzyme annotations	Millions

GapMind (KBase)

Table	Description	Rows
`gapmind_pathways`	Pathway completeness assessments	463,729,001

Reference: Price et al. 2020 "GapMind: Automated Annotation of Amino Acid Biosynthesis" mSystems 5:e00291-20

Recommendations

For curated, high-confidence traits: Use GOLD organism_phenotype
For broad coverage predictions: Use IMG phenotype_rule_taxons
For amino acid auxotrophy/prototrophy: Both IMG rules and GapMind are comprehensive
For pathway completeness with scores: Use GapMind gapmind_pathways
For pathogenicity: GOLD curation is more reliable
For cross-validation: Join sources on NCBI taxonomy ID or assembly accession

Data Availability Note

MicroTraits database is not currently integrated into the JGI or KBase lakehouses. If you need MicroTraits data, you may need to access it from its original source.

Microbial Traits Across Databases

Overview

GOLD: Curated Phenotypes

Schema

Controlled Vocabulary (cvphenotype)

Top Phenotypes by Usage

Related GOLD Trait Tables

IMG: Pathway-Based Phenotype Rules

Schema

Phenotype Rules

Rule Syntax

Example: Lysine Auxotroph Rule

How Pathways Are Defined

Gene Annotation Pipeline

Phenotype Rule Coverage

GapMind: Pathway Completeness Scoring

Schema

Pathway Categories

Scoring System

Example Query

Top Pathways by Assessment Count

GapMind vs IMG Phenotype Rules

Cross-Validation Example

NMDC: Unified Trait Model (Planned)

Planned Schema

Intended Data Flow

Source Types

Comparing Approaches

Cross-Database Queries

Find organisms with both GOLD and IMG phenotype data

Validate IMG rules against GOLD curation

Key Tables Reference

GOLD (gold-db-2)

IMG (img-db-2)

GapMind (KBase)

Recommendations

Data Availability Note