Querying Genes Across Lakehouse Tables

This document describes how to query for genes across all available lakehouse tables in the JGI, KBase, and NMDC data ecosystems.

Overview

Gene data is distributed across multiple databases with different schemas, identifiers, and levels of detail:

Database	Location	Primary Gene Table	Gene Count	Key Features
IMG Core	JGI Dremio	`img_core_v400.gene`	~billions	Main microbial gene database
IMG VR	JGI Dremio	`imgvr_prod.uvig`	millions	Viral sequences
NMDC	NMDC API	`functional_annotation_set`	varies	Metagenome annotations
KBase	KBase API	`kbase_ke_pangenome.gene`	varies	Pangenomic analyses

JGI IMG Core Database

The IMG (Integrated Microbial Genomes) database is the primary source for microbial gene data.

Main Gene Table

Schema: "img-db-2 postgresql".img_core_v400.gene

Key columns: - gene_oid (BIGINT) - Primary identifier - gene_symbol - Standard gene symbol (e.g., "tauC", "leuC2") - gene_display_name - Full gene name/description - product_name - Protein product name - locus_tag - Genome-specific locus identifier - taxon - Foreign key to taxon table - scaffold - Scaffold location - start_coord, end_coord, strand - Genomic coordinates

Searching by Gene Name

SELECT gene_oid, gene_symbol, gene_display_name, locus_tag, taxon
FROM "img-db-2 postgresql".img_core_v400.gene
WHERE LOWER(gene_display_name) LIKE '%kinase%'
LIMIT 100

Searching by Gene Symbol

SELECT gene_oid, gene_symbol, gene_display_name, locus_tag
FROM "img-db-2 postgresql".img_core_v400.gene
WHERE gene_symbol = 'tauC'

Gene Annotation Tables

IMG provides extensive annotation linkages through separate tables:

KEGG Orthologs (KO)

Table: img_core_v400.gene_ko_terms

SELECT g.gene_oid, g.gene_display_name, g.locus_tag, k.ko_terms, k.percent_identity
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_ko_terms k ON g.gene_oid = k.gene_oid
WHERE k.ko_terms = 'KO:K00001'  -- alcohol dehydrogenase
LIMIT 100

COG Groups

Table: img_core_v400.gene_cog_groups

SELECT g.gene_oid, g.gene_display_name, c.cog, c.percent_identity
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_cog_groups c ON g.gene_oid = c.gene_oid
WHERE c.cog = 'COG0001'  -- glutamate-1-semialdehyde aminomutase

Pfam Families

Table: img_core_v400.gene_pfam_families

SELECT g.gene_oid, g.gene_display_name, p.pfam_family, p.percent_identity
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_pfam_families p ON g.gene_oid = p.gene_oid
WHERE p.pfam_family = 'pfam00106'

GO Terms

Table: img_core_v400.gene_go_terms

SELECT g.gene_oid, g.gene_display_name, go.go_id, go.go_type, go.go_evidence
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_go_terms go ON g.gene_oid = go.gene_oid
WHERE go.go_id = 'GO:0003824'  -- catalytic activity

External Database Links

Table: img_core_v400.gene_ext_links

Links genes to external databases including: - UniProtKB, UniProtKB/Swiss-Prot, UniProtKB/TrEMBL - GeneID (NCBI) - PDB - InterPro - MGI, RGD, ZFIN (model organisms) - KEGG - And 70+ other databases

-- Find genes by UniProt ID
SELECT g.gene_oid, g.gene_display_name, e.db_name, e.id
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_ext_links e ON g.gene_oid = e.gene_oid
WHERE e.db_name = 'UniProtKB' AND e.id = 'Q62D23'

-- Count genes by external database
SELECT db_name, COUNT(*) as gene_count
FROM "img-db-2 postgresql".img_core_v400.gene_ext_links
GROUP BY db_name
ORDER BY gene_count DESC

Linking Genes to Taxonomy and GOLD

Genes link to taxonomic/genome information via the taxon column:

SELECT
    g.gene_oid,
    g.gene_display_name,
    g.locus_tag,
    t.taxon_display_name,
    t.ncbi_taxon_id,
    t.sample_gold_id,
    t.study_gold_id
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.taxon t ON g.taxon = t.taxon_oid
WHERE g.gene_symbol = 'dnaA'
LIMIT 10

IMG VR (Viral Genomes)

Schema: "img-db-1 mysql".imgvr_prod

Table	Description
`uvig`	Uncultivated viral genomes
`uvig_pfams`	Pfam annotations for viral genes
`uvig_hosts`	Predicted viral hosts
`uvig_lineage`	Viral taxonomy

SELECT uvig, taxon_oid, viral_otus, uvig_length, estimated_completeness
FROM "img-db-1 mysql".imgvr_prod.uvig
WHERE high_confidence = 'YES'
LIMIT 10

NMDC Gene Annotations

NMDC stores gene annotations in metagenome workflow outputs. Query via the NMDC API:

Using NMDC MCP Tools

# Find samples with specific functional annotations
from nmdc_mcp import get_samples_by_annotation

results = get_samples_by_annotation(
    gene_function_ids=["KEGG.ORTHOLOGY:K00001"],  # alcohol dehydrogenase
    max_records=10
)

Annotation ID Formats

PFAM: PFAM:PF04183
KEGG Orthology: KEGG.ORTHOLOGY:K00001
COG: COG:COG0001
GO: GO:GO0000001

GFF Files

NMDC metagenome annotations are stored in GFF3 format files: - *_functional_annotation.gff - Full functional annotations - *_ko.tsv - KEGG Orthology annotations - *_cog.gff - COG annotations - *_pfam.gff - Pfam annotations

Complete Gene Query Strategy

To comprehensively search for a gene across all lakehouses:

1. Search by Name/Description

-- IMG Core
SELECT gene_oid, gene_display_name, taxon
FROM "img-db-2 postgresql".img_core_v400.gene
WHERE LOWER(gene_display_name) LIKE '%your_gene%'

2. Search by Ortholog/Family

-- By KEGG Ortholog
SELECT g.gene_oid, g.gene_display_name, t.taxon_display_name
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_ko_terms k ON g.gene_oid = k.gene_oid
JOIN "img-db-2 postgresql".img_core_v400.taxon t ON g.taxon = t.taxon_oid
WHERE k.ko_terms = 'KO:KXXXXX'

3. Search by External ID

-- By UniProt
SELECT g.*, e.db_name, e.id
FROM "img-db-2 postgresql".img_core_v400.gene g
JOIN "img-db-2 postgresql".img_core_v400.gene_ext_links e ON g.gene_oid = e.gene_oid
WHERE e.id = 'YOUR_UNIPROT_ID'

4. Cross-Database Joins

-- Gene with all annotations
SELECT
    g.gene_oid,
    g.gene_display_name,
    g.locus_tag,
    k.ko_terms,
    c.cog,
    p.pfam_family,
    t.taxon_display_name
FROM "img-db-2 postgresql".img_core_v400.gene g
LEFT JOIN "img-db-2 postgresql".img_core_v400.gene_ko_terms k ON g.gene_oid = k.gene_oid
LEFT JOIN "img-db-2 postgresql".img_core_v400.gene_cog_groups c ON g.gene_oid = c.gene_oid
LEFT JOIN "img-db-2 postgresql".img_core_v400.gene_pfam_families p ON g.gene_oid = p.gene_oid
JOIN "img-db-2 postgresql".img_core_v400.taxon t ON g.taxon = t.taxon_oid
WHERE g.gene_symbol = 'dnaA'
LIMIT 100

Table	Description
`gene`	Main gene table
`gene_aliases`	Alternative gene names
`gene_biocyc_rxns`	BioCyc reactions
`gene_cathfam`	CATH protein families
`gene_cog_groups`	COG assignments
`gene_essential_genes`	Essential gene annotations
`gene_ext_links`	External database links
`gene_go_terms`	GO annotations
`gene_img_interpro_hits`	InterPro matches
`gene_ko_terms`	KEGG Ortholog assignments
`gene_kog_groups`	KOG (eukaryotic) assignments
`gene_paralogs`	Paralog relationships
`gene_pdb_xrefs`	PDB structure links
`gene_pfam_families`	Pfam assignments
`gene_sig_peptides`	Signal peptide predictions
`gene_smart`	SMART domain assignments
`gene_superfam`	SUPERFAMILY assignments
`gene_tigrfams`	TIGRfam assignments
`kegg_gene`	KEGG gene reference data

CLI Commands

Using linkml-store:

# Direct SQL query
linkml-store -d "dremio-rest://lakehouse.jgi.lbl.gov" query \
    --sql 'SELECT * FROM "img-db-2 postgresql".img_core_v400.gene WHERE gene_symbol = '\''dnaA'\'' LIMIT 10' \
    -O yaml

# List all gene tables
linkml-store -d "dremio-rest://lakehouse.jgi.lbl.gov" query \
    --sql 'SELECT TABLE_NAME FROM INFORMATION_SCHEMA."TABLES" WHERE TABLE_SCHEMA LIKE '\''%img_core%'\'' AND TABLE_NAME LIKE '\''%gene%'\''' \
    -O csv

Performance Tips

Always use LIMIT - Gene tables are massive (billions of rows)
Index on gene_oid - Primary key lookups are fastest
Filter early - Apply WHERE clauses before JOINs when possible
Use specific schemas - Avoid wildcard schema searches
Cache taxon lookups - Taxon table is relatively small

Critical Limitation: Full Table Scans on Gene Tables

Dremio cannot use indexes on federated PostgreSQL tables. For massive tables like gene, this means:

COUNT(*) will timeout
DISTINCT column requires a full table scan, even for 3-value columns like strand
GROUP BY aggregations require scanning all rows

Table Size Reference

Table	Row Count	Aggregations
`gold.study`	70K	✅ Fast
`gold.biosample`	273K	✅ Fast
`img_core.taxon`	287K	✅ Fast
`gold.project`	705K	✅ Fast
`img_core.gene`	billions	❌ Timeout
`img_core.gene_*` (annotation tables)	billions	❌ Timeout

Examples

-- THIS WILL TIMEOUT (full table scan of billions of rows):
SELECT DISTINCT strand FROM "img-db-2 postgresql".img_core_v400.gene

-- THIS WORKS - filter to a specific taxon first:
SELECT DISTINCT strand FROM "img-db-2 postgresql".img_core_v400.gene
WHERE taxon = 637000057

-- GOLD tables work fine for aggregations:
SELECT ecosystem, COUNT(*) FROM "gold-db-2 postgresql".gold.study
GROUP BY ecosystem ORDER BY COUNT(*) DESC

Workarounds

Filter first - Always filter by taxon, scaffold, or gene_oid range before aggregating
Use taxon table - Aggregate at the taxon level (287K rows) instead of gene level
Sample with LIMIT - Use LIMIT to sample before aggregating
Pre-computed stats - Check for statistics tables if available

Cross-Reference Summary

┌─────────────────────────────────────────────────────────────┐
│                     Gene Identifier Types                    │
├─────────────────────────────────────────────────────────────┤
│ IMG gene_oid ──────┬──→ taxon_oid ──→ GOLD IDs              │
│                    ├──→ KO terms                             │
│                    ├──→ COG groups                           │
│                    ├──→ Pfam families                        │
│                    ├──→ GO terms                             │
│                    └──→ External links (UniProt, NCBI, etc.) │
├─────────────────────────────────────────────────────────────┤
│ NMDC annotations ──────→ KEGG.ORTHOLOGY, PFAM, COG, GO      │
├─────────────────────────────────────────────────────────────┤
│ IMG VR uvig ───────────→ taxon_oid ──→ Pfam domains         │
└─────────────────────────────────────────────────────────────┘