Cross-Database Linkages in JGI Lakehouse

This document maps the identifier relationships across JGI lakehouse databases, enabling cross-database queries and data integration.

Database Overview

Database	Schema Path	Records	Primary ID	Description
SMC	`smc-db postgresql.public`	13.2M BGCs	`bgc_id`, `smc_id`	Secondary Metabolite Clusters
IMG Core	`img-db-2 postgresql.img_core_v400`	287K taxons	`taxon_oid`	Integrated Microbial Genomes
IMG ABC	`img-db-1 mysql.abc`	489K regions	`region_id`	Biosynthetic Gene Clusters in IMG
GOLD	`gold-db-2 postgresql.gold`	705K projects	`gold_id`	Genomes OnLine Database
KBase	via MCP REST API	varies	varies	DOE KBase databases

Linkage Diagram

┌─────────────────┐                              ┌─────────────────┐
│  KBase/NMDC     │                              │   MIBiG         │
│  Pangenome      │                              │ (BGC0000xxx)    │
│  (293K genomes) │                              └────────┬────────┘
└────────┬────────┘                                       │ id
         │                                                ▼
         │ genome_id (GCA_*/GCF_*)            ┌─────────────────┐
         │ gold_study_identifiers (NMDC)      │   IMG ABC       │
         │                                    │  bcg_region_    │
         ▼                                    │  ext_links      │
┌─────────────────┐                           │  (1.3K MIBiG)   │
│ GOLD            │                           └────────┬────────┘
│ ncbi_assembly   │                                    │ taxon_oid
│ (3.1M rows)     │                                    │
└────────┬────────┘                                    ▼
         │                          ┌─────────────────────────────────┐
         │ taxid                    │          IMG Core               │
         ▼                          │           taxon                 │
┌─────────────────┐                 │        (287K rows)              │
│      SMC        │ img_accession   │                                 │
│  source_data    │────────────────►│  ◄── taxon_smc_stats (22.5K) ──►│
│  (1.36M rows)   │   (52K)         │                                 │
└────────┬────────┘                 │  ◄── kbase_pangenome (27.7K) ──►│
         │                          └────────────────┬────────────────┘
         │ refseq/genbank                            │
         │ accession (1.3M)          sequencing_gold_id (284K)
         │                           study_gold_id
         ▼                                           │
    ┌─────────┐                                      ▼
    │  NCBI   │                         ┌─────────────────┐
    │ RefSeq/ │                         │  GOLD project   │◄──► GOLD study
    │ GenBank │                         │  (705K rows)    │     (70K rows)
    └─────────┘                         └────────┬────────┘
                                                 │ img_taxon_id
                                                 ▼
                                        ┌─────────────────┐
                                        │ GOLD analysis_  │
                                        │    project      │
                                        │  (620K rows)    │
                                        └─────────────────┘

Primary Linkages

1. SMC ↔ IMG Core (Direct)

Path: source_data.img_accession_id → taxon.taxon_oid

Source	Field	Target	Field	Coverage
SMC source_data	`img_accession_id`	IMG taxon	`taxon_oid`	52,226 (3.8%)

-- SMC to IMG via img_accession_id
SELECT s.smc_id, s.tax_name, t.taxon_display_name, t.sequencing_gold_id
FROM "smc-db postgresql".public.source_data s
JOIN "img-db-2 postgresql".img_core_v400.taxon t
  ON CAST(s.img_accession_id AS BIGINT) = t.taxon_oid
WHERE s.img_accession_id <> ''

2. SMC ↔ IMG Core (via taxon_smc_stats bridge)

Path: taxon_smc_stats.smc_id ↔ taxon_smc_stats.taxon_oid

The taxon_smc_stats table in IMG Core provides a direct bridge:

Field	Type	Links To
`taxon_oid`	BIGINT	IMG taxon.taxon_oid
`smc_id`	INTEGER	SMC source_data.smc_id
`bgc_count`	INTEGER	Number of BGCs

-- IMG to SMC via bridge table
SELECT t.taxon_display_name, ts.bgc_count, s.tax_genus
FROM "img-db-2 postgresql".img_core_v400.taxon t
JOIN "img-db-2 postgresql".img_core_v400.taxon_smc_stats ts
  ON t.taxon_oid = ts.taxon_oid
JOIN "smc-db postgresql".public.source_data s
  ON ts.smc_id = s.smc_id
LIMIT 10

Coverage: 22,534 IMG taxons linked to SMC

3. IMG Core ↔ GOLD

Multiple linkage paths exist:

IMG Field	GOLD Table	GOLD Field	Coverage
`taxon.sequencing_gold_id`	project	`gold_id`	283,727
`taxon.study_gold_id`	study	`gold_id`	~200K
`taxon.sample_gold_id`	biosample	`gold_id`	varies
`taxon.taxon_oid`	analysis_project	`img_taxon_id`	278,063

-- IMG to GOLD via sequencing_gold_id
SELECT t.taxon_oid, t.taxon_display_name, p.project_name, p.ncbi_bioproject_accession
FROM "img-db-2 postgresql".img_core_v400.taxon t
JOIN "gold-db-2 postgresql".gold.project p
  ON t.sequencing_gold_id = p.gold_id
WHERE t.sequencing_gold_id IS NOT NULL
LIMIT 10

4. IMG ABC (BCG) ↔ MIBiG

The bcg_region_ext_links table provides links to external databases:

db_name	Count	Example ID
MIBiG	1,325	BGC0000772
Genbank	1,332	AF047478

-- IMG BCG regions linked to MIBiG
SELECT r.region_id, r.bcg_type, e.id AS mibig_id, e.comments
FROM "img-db-1 mysql".abc.bcg_region r
JOIN "img-db-1 mysql".abc.bcg_region_ext_links e
  ON r.region_id = e.region_id
WHERE e.db_name = 'MIBiG'
LIMIT 10

External Database Cross-References

SMC External Accessions

The SMC source_data table contains cross-references to external databases:

Field	Description	Coverage
`img_accession_id`	IMG taxon_oid	52,226
`refseq_accession_id`	NCBI RefSeq (GCF_*)	213,897
`genbank_accession_id`	NCBI GenBank (GCA_*)	1,116,356
`mibig_accession_id`	MIBiG BGC IDs	0 (not populated)
`taxonomy_id`	NCBI Taxonomy ID	~1.3M

GOLD External Accessions

Table	Field	External DB
project	`ncbi_bioproject_accession`	NCBI BioProject
project	`ncbi_biosample_accession`	NCBI BioSample
project	`assembly_accession`	NCBI Assembly
project	`ncbi_sra_id`	NCBI SRA
ncbi_assembly	`assembly_accession`	RefSeq/GenBank

NCBI Taxonomy as Universal Bridge

NCBI Taxonomy ID serves as a universal linker across all databases:

SMC source_data.taxonomy_id ────┐
                                │
IMG taxon.ncbi_taxon_id ────────┼──► NCBI Taxonomy
                                │
GOLD ncbi_taxonomy.ncbi_tax_id ─┘

Caution: Taxonomy-based joins may produce many-to-many relationships (multiple genomes per species). Use for broad aggregation, not precise matching.

Recommended Join Strategies

For SMC Analysis

SMC → IMG: Use img_accession_id when available, otherwise use taxon_smc_stats bridge
SMC → GOLD: Chain through IMG (smc_id → taxon_oid → sequencing_gold_id)
SMC → External: Use refseq_accession_id or genbank_accession_id directly

For BGC Comparative Analysis

SMC BGCs: Query smc-db postgresql.public.bgc for antiSMASH/emeraldBGC predictions
IMG BGCs: Query img-db-1 mysql.abc.bcg_region for IMG-native BCG annotations
MIBiG reference: Join IMG BCG via bcg_region_ext_links

Multi-Hop Example: SMC → IMG → GOLD

-- Full chain: SMC source → IMG taxon → GOLD project
SELECT
    s.smc_id,
    s.tax_name AS smc_organism,
    s.bgc_count,
    t.taxon_oid,
    p.gold_id,
    p.project_name
FROM "smc-db postgresql".public.source_data s
JOIN "img-db-2 postgresql".img_core_v400.taxon_smc_stats ts
  ON s.smc_id = ts.smc_id
JOIN "img-db-2 postgresql".img_core_v400.taxon t
  ON ts.taxon_oid = t.taxon_oid
JOIN "gold-db-2 postgresql".gold.project p
  ON t.sequencing_gold_id = p.gold_id
LIMIT 10

KBase ↔ JGI Linkages

KBase databases connect to JGI primarily through NCBI assembly accessions and GOLD study IDs.

KBase Pangenome ↔ GOLD/IMG

Primary linkage: NCBI assembly accession (GCA_/GCF_)

KBase genome.genome_id: "RS_GCF_011881725.1"
                              ↓ strip RS_/GB_ prefix
GOLD ncbi_assembly.assembly_accession: "GCF_011881725.1"
                              ↓ via ncbi_assembly.taxid
IMG taxon.ncbi_taxon_id

Source	Field	Target	Field	Coverage
KBase pangenome	`pangenome_id`	GOLD ncbi_assembly	`assembly_accession`	17,202 (62%)
KBase members	`member_id`	GOLD ncbi_assembly	`assembly_accession`	89,290 (30%)

IMG has a KBase mirror table in img_sub schema:

-- IMG's copy of KBase pangenome data
SELECT pangenome_id, gtdb_species, gtdb_phylum
FROM "img-db-2 postgresql".img_sub.kbase_pangenome
LIMIT 10
-- 27,701 pangenomes, 293,089 members

-- KBase pangenome to GOLD lookup
SELECT k.pangenome_id, k.gtdb_species, g.organism_name, g.taxid
FROM "img-db-2 postgresql".img_sub.kbase_pangenome k
JOIN "gold-db-2 postgresql".gold.ncbi_assembly g
  ON k.pangenome_id = g.assembly_accession
LIMIT 10

NMDC ↔ GOLD

NMDC links to GOLD via study identifiers in the studies table:

NMDC Table	Field	Description
`studies`	`gold_study_identifiers`	JSON array of GOLD study IDs (e.g., `["gold:Gs0114675"]`)

-- NMDC study to GOLD linkage (conceptual - via KBase MCP)
-- studies.gold_study_identifiers contains: ["gold:Gs0114675"]
-- Extract Gs* ID and join to GOLD study table

Note on *_gold table naming: Tables like metabolomics_gold, proteomics_gold, lipidomics_gold are not cross-reference tables. The _gold suffix indicates these contain omics data derived from GOLD-registered samples, not linkages to GOLD. They contain actual measurement data (m/z ratios, expression values, etc.).

KBase Gene-Level Linkages

Gene-level joins are complex due to different ID systems:

Database	Gene ID Format	Example
KBase	`{nucleotide_accession}_{CDS_number}`	`NZ_JAATUR010000001.1_1`
IMG	`gene_oid` (integer) + `locus_tag`	`2500001234` / `ECOLI_0001`

The nucleotide accession prefix indicates source: - NC_: RefSeq complete genomic molecules - NZ_: RefSeq annotated genomic sequences (WGS)

Strategy: Extract nucleotide accession from KBase gene_id, match to IMG scaffold, then locate gene by position.

Recommended KBase → JGI Paths

KBase Pangenome → GOLD: Strip RS_/GB_ prefix from genome_id, join to ncbi_assembly.assembly_accession
KBase Pangenome → IMG: Use IMG's kbase_pangenome mirror table, or chain through GOLD
NMDC → GOLD: Extract GOLD study IDs from gold_study_identifiers JSON field
KBase → SMC: Chain: KBase → GOLD ncbi_assembly → IMG taxon → SMC source_data

KBase-JGI Coverage Summary

Linkage	Records	Notes
KBase pangenome → GOLD	17,202	62% of pangenomes have GOLD match
KBase members → GOLD	89,290	30% of member genomes
NMDC studies → GOLD	~48	Via `gold_study_identifiers` JSON field
IMG kbase_pangenome mirror	27,701	Full KBase pangenome copy in IMG

Coverage Summary (All Linkages)

Linkage	Records Linked	% of Source
SMC → IMG (direct)	52,226	3.8% of SMC
SMC → IMG (bridge)	22,534	1.7% of SMC
IMG → GOLD (project)	283,727	99% of IMG
IMG → GOLD (analysis)	278,063	97% of IMG
IMG BCG → MIBiG	1,325	0.3% of BCG regions
KBase → GOLD	17,202	62% of KBase pangenomes

GOLD ID Patterns

Prefix	Entity	Example
`Gp`	Project (sequencing)	Gp0295815
`Gs`	Study	Gs0133538
`Ga`	Analysis project	Ga0262978
`Go`	Organism	Go0000001
`Gb`	Biosample	Gb0000001