Schema Generation Methods

This document describes how LinkML schemas are generated and maintained for the JGI and KBase genomics lakehouses.

Overview

Schemas are generated through database introspection using linkml-store, which connects to data sources and extracts:

Table/view names → LinkML classes
Column names and types → LinkML slots with appropriate ranges
Foreign key constraints → Slot ranges pointing to target classes
PostgreSQL comments → Class and slot descriptions

Data Sources

JGI Lakehouse (Dremio)

JGI data is accessed via Dremio, a data lakehouse that federates multiple PostgreSQL and MySQL databases.

Database	Source Type	FK Support	Comments
GOLD	PostgreSQL	Yes (31 FKs)	Limited
IMG Core	PostgreSQL	Yes (141 FKs)	Limited
IMG Extended	PostgreSQL	Yes (80 FKs)	Limited
IMG Satellite	PostgreSQL	Yes (127 FKs)	Limited
IMG MySQL*	MySQL	No	No

*MySQL sources don't support FK introspection via Dremio.

KBase/BERDL (REST API)

KBase data is accessed via their MCP REST API, which doesn't expose PostgreSQL catalogs.

Database	FK Support	Notes
Pangenome	Manual	Manually curated with descriptions
NMDC Core	Manual	Manually curated with descriptions
Others	No	Column names only

Introspection Process

Step 1: Table Discovery

SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE
FROM INFORMATION_SCHEMA."TABLES"
WHERE TABLE_TYPE IN ('TABLE', 'VIEW')

Step 2: Column Introspection

SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE
FROM INFORMATION_SCHEMA."COLUMNS"
ORDER BY TABLE_NAME, ORDINAL_POSITION

Step 3: Foreign Key Detection (PostgreSQL only)

SELECT con.conname, src.relname, tgt.relname, con.conkey, con.confkey
FROM pg_catalog.pg_constraint con
JOIN pg_catalog.pg_class src ON con.conrelid = src.oid
JOIN pg_catalog.pg_class tgt ON con.confrelid = tgt.oid
WHERE con.contype = 'f'

Step 4: Comment Extraction (PostgreSQL only)

SELECT c.relname, a.attname, td.description, cd.description
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_description td ON td.objoid = c.oid AND td.objsubid = 0
LEFT JOIN pg_catalog.pg_description cd ON cd.objoid = c.oid AND cd.objsubid = a.attnum

Type Mappings

Dremio/SQL Type	LinkML Type
VARCHAR, CHAR	string
INTEGER, BIGINT, SMALLINT	integer
FLOAT, DOUBLE, DECIMAL	float
BOOLEAN	boolean
DATE	date
TIMESTAMP	datetime
LIST, STRUCT, MAP	string

Generated Artifacts

From each LinkML schema, we generate:

OWL ontologies (project/owl/) - For semantic web integration
Documentation (docs/) - Markdown with Mermaid class diagrams

Regeneration Commands

JGI Schemas (PostgreSQL sources)

# Single schema
linkml-store -d "dremio-rest://lakehouse.jgi.lbl.gov?schema=gold-db-2 postgresql.gold" \
  schema -O yaml -o src/bridge_schemas/schema/jgi/gold.linkml.yaml

# All schemas (use regeneration script)
python scripts/regenerate_jgi.py

Derived Artifacts

# Generate OWL
make gen-owl

# Generate documentation
make docs

Known Limitations

Dremio INFORMATION_SCHEMA Incompleteness

Issue: Dremio's INFORMATION_SCHEMA.COLUMNS does not return column metadata for all tables, even when those tables are queryable and contain data.

Impact: For GOLD, INFORMATION_SCHEMA.TABLES reports 374 tables, but INFORMATION_SCHEMA.COLUMNS only has metadata for ~57 tables. The generated schema is therefore incomplete.

Example of missing tables: - contact (17,505 rows) - not in COLUMNS - cvbiological_status (7 rows) - not in COLUMNS - cvphenotype (159 rows) - IS in COLUMNS ✓

Diagnostic queries:

-- Tables visible
SELECT COUNT(*) FROM INFORMATION_SCHEMA."TABLES"
WHERE TABLE_SCHEMA = 'gold-db-2 postgresql.gold'
-- Returns: 374

-- Tables with column metadata
SELECT COUNT(DISTINCT TABLE_NAME) FROM INFORMATION_SCHEMA."COLUMNS"
WHERE TABLE_SCHEMA = 'gold-db-2 postgresql.gold'
-- Returns: 57

Workaround: Tables missing from INFORMATION_SCHEMA can still be queried directly with SELECT * FROM table LIMIT 0 to get column info, but this is slow for many tables.

Status: Reported to JGI Dremio administrators (Feb 2026).

Manual Curation

Some schemas require manual curation because:

No FK constraints defined - KBase REST API doesn't expose pg_catalog
Rich descriptions needed - Biological context requires expert annotation
Complex relationships - Some relationships aren't captured by simple FKs

Manually curated schemas: - kbase_ke_pangenome.linkml.yaml - Pangenomic data with GTDB taxonomy - nmdc_core.linkml.yaml - NMDC microbiome data

Do not regenerate these schemas - they will lose curated content.