Functional Annotation Across Databases

This document explains how functional annotations (gene functions, pathways, enzymes) are modeled across the JGI and KBase lakehouses.

Overview

Functional annotation connects genes to their biological roles through multiple ontologies and classification systems:

Database	Primary Annotations	Coverage
IMG	IMG Terms, KO, EC, COG, Pfam	Billions of gene annotations
GOLD	Organism-level metadata	Project context
KEGG	Pathways, Modules, KO	8K+ orthologs, 300+ pathways
GO	Gene Ontology terms	48K terms
NMDC	Unified annotations	67K terms across sources

IMG: The Annotation Backbone

IMG (Integrated Microbial Genomes) provides the most comprehensive functional annotation system in the JGI lakehouse.

Annotation Hierarchy

Gene (gene_oid)
    │
    ├──► IMG Terms (gene_img_functions)
    │        └──► EC Numbers (img_term_enzymes)
    │
    ├──► KEGG KO (gene_ko_terms)
    │        └──► KEGG Pathways (ko_term_pathways)
    │
    ├──► COG (gene_cog_groups)
    │        └──► COG Categories (cog_function)
    │
    ├──► Pfam (gene_pfam_families)
    │
    └──► TIGRfam (gene_tigrfam_families)

Gene Annotation Tables

IMG Terms (Custom Enzyme Definitions)

IMG maintains its own enzyme term vocabulary with curated definitions:

SELECT t.term_oid, t.term, te.enzymes as ec_number
FROM "img-db-2 postgresql".img_ext.img_term t
LEFT JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te
    ON t.term_oid = te.term_oid
WHERE t.term LIKE '%dehydrogenase%'
LIMIT 10

term_oid	term	ec_number
297	aspartate semialdehyde dehydrogenase	EC:1.2.1.11
296	aspartate kinase	EC:2.7.2.4

Gene-to-term mappings with evidence:

SELECT gif.gene_oid, gif.function as term_oid, gif.evidence, t.term
FROM "img-db-2 postgresql".img_ext.gene_img_functions gif
JOIN "img-db-2 postgresql".img_ext.img_term t ON gif.function = t.term_oid
WHERE gif.taxon = 637000001  -- specific genome
LIMIT 10

Evidence Type	Description
BLAST	Sequence similarity to characterized proteins
Inferred	Transferred from ortholog annotations
HMM	Hidden Markov Model profile matches

KEGG Orthology (KO)

KO terms link genes to KEGG pathways and modules:

-- Gene KO assignments
SELECT gk.gene_oid, gk.ko_terms, gk.percent_identity, gk.evalue
FROM "img-db-2 postgresql".img_core_v400.gene_ko_terms gk
WHERE gk.taxon = 637000001
LIMIT 10

Field	Description
`ko_terms`	KO identifier (e.g., KO:K00001)
`percent_identity`	Sequence identity to reference
`evalue`	BLAST E-value

-- KO term definitions
SELECT ko_id, ko_name, definition
FROM "img-db-2 postgresql".img_core_v400.ko_term
WHERE ko_id IN ('KO:K00928', 'KO:K00133', 'KO:K01714')

ko_id	ko_name	definition
KO:K00928	lysC	aspartate kinase [EC:2.7.2.4]
KO:K00133	asd	aspartate-semialdehyde dehydrogenase [EC:1.2.1.11]
KO:K01714	dapA	4-hydroxy-tetrahydrodipicolinate synthase [EC:4.3.3.7]

COG Functional Categories

26 single-letter categories for broad functional classification:

SELECT gc.gene_oid, gc.cog, cf.function_code, cf.definition
FROM "img-db-2 postgresql".img_core_v400.gene_cog_groups gc
JOIN "img-db-2 postgresql".img_core_v400.cog_function cf
    ON gc.cog = cf.cog_id
LIMIT 10

Code	Category	Description
J	Information	Translation, ribosomal structure and biogenesis
K	Information	Transcription
L	Information	Replication, recombination and repair
C	Metabolism	Energy production and conversion
G	Metabolism	Carbohydrate transport and metabolism
E	Metabolism	Amino acid transport and metabolism
P	Metabolism	Inorganic ion transport and metabolism
S	Unknown	Function unknown

Pathway Systems

IMG Pathways

IMG defines pathways as ordered sequences of reactions:

img_pathway
    │
    │ pathway_oid
    ▼
img_pathway_reactions (ordered steps)
    │
    │ rxn_oid
    ▼
img_reaction (biochemical equations)
    │
    │ rxn_oid
    ▼
img_reaction_catalysts
    │
    │ term_oid
    ▼
img_term (enzyme definitions)

Example: Lysine Biosynthesis Pathway

SELECT
    pr.rxn_order,
    r.rxn_name,
    t.term as enzyme,
    te.enzymes as ec
FROM "img-db-2 postgresql".img_ext.img_pathway_reactions pr
JOIN "img-db-2 postgresql".img_ext.img_reaction r ON pr.rxn = r.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_reaction_catalysts rc ON r.rxn_oid = rc.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_term t ON rc.catalysts = t.term_oid
LEFT JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te ON t.term_oid = te.term_oid
WHERE pr.pathway_oid = 170  -- L-lysine synthesis (acetylated)
ORDER BY pr.rxn_order

Step	Enzyme	EC
1	aspartate kinase	EC:2.7.2.4
2	aspartate semialdehyde dehydrogenase	EC:1.2.1.11
3	dihydrodipicolinate synthase	EC:4.2.1.52
4	dihydrodipicolinate reductase	EC:1.3.1.26
5	THDP N-acetyltransferase	EC:2.3.1.89
6	acetyldiaminopimelate aminotransferase	EC:2.6.1.-
7	N-acetyl-DAP deacetylase	EC:3.5.1.47
8	diaminopimelate epimerase	EC:5.1.1.7
9	diaminopimelate decarboxylase	EC:4.1.1.20

KEGG Pathways

KEGG organizes metabolism into hierarchical pathway maps:

-- KO to pathway mapping
SELECT kp.ko_id, kt.ko_name, kp.pathway_oid, kp.image_id
FROM "img-db-2 postgresql".img_core_v400.ko_term_pathways kp
JOIN "img-db-2 postgresql".img_core_v400.ko_term kt ON kp.ko_id = kt.ko_id
WHERE kp.image_id = 'map00300'  -- Lysine biosynthesis

Pathway	Map ID	Description
Glycolysis	map00010	Glycolysis / Gluconeogenesis
TCA cycle	map00020	Citrate cycle
Lysine biosynthesis	map00300	Lysine biosynthesis
Nitrogen metabolism	map00910	Nitrogen metabolism

NMDC Unified Annotations

NMDC provides a unified view across annotation sources:

annotation_terms_unified

SELECT source, term_id, name, namespace, is_obsolete
FROM nmdc_core.annotation_terms_unified
WHERE name LIKE '%kinase%'
LIMIT 10

Source	Count	ID Format
GO	48,196	GO:0000001
EC	8,813	1.1.1.1
KEGG KO	8,104	K00001
MetaCyc	1,538	pathway-id
KEGG Module	370	M00001
KEGG Pathway	306	ko00010
COG	26	J, K, L, ...

GO Hierarchy

Pre-computed transitive closure for efficient ancestor queries:

-- Find all ancestors of a GO term
SELECT go_id, namespace, all_ancestors, depth
FROM nmdc_core.go_hierarchy_flat
WHERE go_id = 'GO:0006096'  -- glycolytic process

Field	Description
`all_parents`	Direct parent terms (semicolon-separated)
`all_ancestors`	All ancestor terms (transitive closure)
`depth`	Distance from root (root = 1)

Cross-Database Annotation Mapping

EC ↔ KO ↔ GO

Multiple ontologies annotate the same biological functions:

-- Find all annotations for alcohol dehydrogenase
SELECT 'EC' as source, '1.1.1.1' as id, 'alcohol dehydrogenase' as name
UNION ALL
SELECT 'KO', 'K00001', 'alcohol dehydrogenase'
UNION ALL
SELECT 'GO', 'GO:0004022', 'alcohol dehydrogenase (NAD+) activity'

Mapping Tables

From	To	Table
IMG Term	EC	`img_term_enzymes`
KO	EC	Embedded in KO definition
KO	Pathway	`ko_term_pathways`
GO	EC	`go_terms` (via dbxref)

Annotation Pipeline Overview

IMG Annotation Process

1. Gene Calling
   └── Prodigal/GeneMark → gene coordinates

2. Functional Annotation
   ├── BLAST vs UniProt/RefSeq → top hits
   ├── HMM vs Pfam/TIGRfam → domain assignments
   ├── BLAST vs KO profiles → KO assignments
   └── COG assignment → functional categories

3. Pathway Inference
   ├── Map genes to IMG terms
   ├── Check pathway completeness
   └── Infer phenotypes (phenotype_rule)

Evidence Hierarchy

Evidence	Confidence	Source
Experimentally characterized	Highest	Literature curation
High-confidence BLAST	High	>70% identity, full length
HMM above trusted cutoff	High	Pfam/TIGRfam
Medium BLAST	Medium	40-70% identity
Low BLAST / partial HMM	Low	Requires validation
Inferred from ortholog	Variable	Depends on ortholog evidence

Querying Functional Annotations

Find all genes in a pathway

-- All genes annotated to lysine biosynthesis enzymes
SELECT g.gene_oid, g.locus_tag, t.term, te.enzymes
FROM "img-db-2 postgresql".img_ext.gene_img_functions gif
JOIN "img-db-2 postgresql".img_ext.gene g ON gif.gene_oid = g.gene_oid
JOIN "img-db-2 postgresql".img_ext.img_term t ON gif.function = t.term_oid
JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te ON t.term_oid = te.term_oid
JOIN "img-db-2 postgresql".img_ext.img_reaction_catalysts rc ON t.term_oid = rc.catalysts
JOIN "img-db-2 postgresql".img_ext.img_pathway_reactions pr ON rc.rxn_oid = pr.rxn
WHERE pr.pathway_oid = 170
  AND gif.taxon = 637000001

Count genes by COG category

SELECT cf.function_code, cf.definition, COUNT(*) as gene_count
FROM "img-db-2 postgresql".img_core_v400.gene_cog_groups gc
JOIN "img-db-2 postgresql".img_core_v400.cog_function cf ON gc.cog = cf.cog_id
WHERE gc.taxon = 637000001
GROUP BY cf.function_code, cf.definition
ORDER BY gene_count DESC

Find enzymes for a reaction

-- What enzymes catalyze a specific reaction?
SELECT r.rxn_name, r.rxn_equation, t.term, te.enzymes
FROM "img-db-2 postgresql".img_ext.img_reaction r
JOIN "img-db-2 postgresql".img_ext.img_reaction_catalysts rc ON r.rxn_oid = rc.rxn_oid
JOIN "img-db-2 postgresql".img_ext.img_term t ON rc.catalysts = t.term_oid
LEFT JOIN "img-db-2 postgresql".img_ext.img_term_enzymes te ON t.term_oid = te.term_oid
WHERE r.rxn_name LIKE '%aspartate%'

Key Tables Reference

IMG Core (img_core_v400)

Table	Description
`gene`	Gene coordinates and basic info
`gene_ko_terms`	Gene → KO assignments
`gene_cog_groups`	Gene → COG assignments
`gene_pfam_families`	Gene → Pfam domains
`ko_term`	KO term definitions
`ko_term_pathways`	KO → pathway mappings
`cog_function`	COG category definitions

IMG Extended (img_ext)

Table	Description
`gene_img_functions`	Gene → IMG term with evidence
`img_term`	IMG enzyme term definitions
`img_term_enzymes`	IMG term → EC number
`img_pathway`	Pathway definitions
`img_pathway_reactions`	Pathway → reaction ordering
`img_reaction`	Reaction definitions with equations
`img_reaction_catalysts`	Reaction → enzyme (IMG term)

NMDC (nmdc_core)

Table	Description
`annotation_terms_unified`	All annotation terms (GO, EC, KO, etc.)
`go_terms`	GO term definitions
`go_hierarchy_flat`	Pre-computed GO ancestry
`ec_terms`	EC number definitions
`kegg_ko_terms`	KO definitions
`kegg_pathway_terms`	KEGG pathway definitions
`cog_categories`	COG category definitions

Recommendations

For gene-level annotations: Start with IMG gene_ko_terms or gene_img_functions
For pathway analysis: Use IMG pathway tables or GapMind scores
For broad functional categories: Use COG categories
For domain architecture: Use Pfam/TIGRfam assignments
For GO enrichment: Use NMDC go_hierarchy_flat for efficient ancestor queries
For cross-database lookups: Use NMDC annotation_terms_unified