Schema Generation Methods
This document describes how LinkML schemas are generated and maintained for the JGI and KBase genomics lakehouses.
Overview
Schemas are generated through database introspection using linkml-store, which connects to data sources and extracts:
- Table/view names → LinkML classes
- Column names and types → LinkML slots with appropriate ranges
- Foreign key constraints → Slot ranges pointing to target classes
- PostgreSQL comments → Class and slot descriptions
Data Sources
JGI Lakehouse (Dremio)
JGI data is accessed via Dremio, a data lakehouse that federates multiple PostgreSQL and MySQL databases.
| Database | Source Type | FK Support | Comments |
|---|---|---|---|
| GOLD | PostgreSQL | Yes (31 FKs) | Limited |
| IMG Core | PostgreSQL | Yes (141 FKs) | Limited |
| IMG Extended | PostgreSQL | Yes (80 FKs) | Limited |
| IMG Satellite | PostgreSQL | Yes (127 FKs) | Limited |
| IMG MySQL* | MySQL | No | No |
*MySQL sources don't support FK introspection via Dremio.
KBase/BERDL (REST API)
KBase data is accessed via their MCP REST API, which doesn't expose PostgreSQL catalogs.
| Database | FK Support | Notes |
|---|---|---|
| Pangenome | Manual | Manually curated with descriptions |
| NMDC Core | Manual | Manually curated with descriptions |
| Others | No | Column names only |
Introspection Process
Step 1: Table Discovery
SELECT TABLE_SCHEMA, TABLE_NAME, TABLE_TYPE
FROM INFORMATION_SCHEMA."TABLES"
WHERE TABLE_TYPE IN ('TABLE', 'VIEW')
Step 2: Column Introspection
SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE, IS_NULLABLE
FROM INFORMATION_SCHEMA."COLUMNS"
ORDER BY TABLE_NAME, ORDINAL_POSITION
Step 3: Foreign Key Detection (PostgreSQL only)
SELECT con.conname, src.relname, tgt.relname, con.conkey, con.confkey
FROM pg_catalog.pg_constraint con
JOIN pg_catalog.pg_class src ON con.conrelid = src.oid
JOIN pg_catalog.pg_class tgt ON con.confrelid = tgt.oid
WHERE con.contype = 'f'
Step 4: Comment Extraction (PostgreSQL only)
SELECT c.relname, a.attname, td.description, cd.description
FROM pg_catalog.pg_class c
LEFT JOIN pg_catalog.pg_description td ON td.objoid = c.oid AND td.objsubid = 0
LEFT JOIN pg_catalog.pg_description cd ON cd.objoid = c.oid AND cd.objsubid = a.attnum
Type Mappings
| Dremio/SQL Type | LinkML Type |
|---|---|
| VARCHAR, CHAR | string |
| INTEGER, BIGINT, SMALLINT | integer |
| FLOAT, DOUBLE, DECIMAL | float |
| BOOLEAN | boolean |
| DATE | date |
| TIMESTAMP | datetime |
| LIST, STRUCT, MAP | string |
Generated Artifacts
From each LinkML schema, we generate:
- OWL ontologies (
project/owl/) - For semantic web integration - Documentation (
docs/) - Markdown with Mermaid class diagrams
Regeneration Commands
JGI Schemas (PostgreSQL sources)
# Single schema
linkml-store -d "dremio-rest://lakehouse.jgi.lbl.gov?schema=gold-db-2 postgresql.gold" \
schema -O yaml -o src/bridge_schemas/schema/jgi/gold.linkml.yaml
# All schemas (use regeneration script)
python scripts/regenerate_jgi.py
Derived Artifacts
# Generate OWL
make gen-owl
# Generate documentation
make docs
Known Limitations
Dremio INFORMATION_SCHEMA Incompleteness
Issue: Dremio's INFORMATION_SCHEMA.COLUMNS does not return column metadata for all tables,
even when those tables are queryable and contain data.
Impact: For GOLD, INFORMATION_SCHEMA.TABLES reports 374 tables, but INFORMATION_SCHEMA.COLUMNS
only has metadata for ~57 tables. The generated schema is therefore incomplete.
Example of missing tables:
- contact (17,505 rows) - not in COLUMNS
- cvbiological_status (7 rows) - not in COLUMNS
- cvphenotype (159 rows) - IS in COLUMNS ✓
Diagnostic queries:
-- Tables visible
SELECT COUNT(*) FROM INFORMATION_SCHEMA."TABLES"
WHERE TABLE_SCHEMA = 'gold-db-2 postgresql.gold'
-- Returns: 374
-- Tables with column metadata
SELECT COUNT(DISTINCT TABLE_NAME) FROM INFORMATION_SCHEMA."COLUMNS"
WHERE TABLE_SCHEMA = 'gold-db-2 postgresql.gold'
-- Returns: 57
Workaround: Tables missing from INFORMATION_SCHEMA can still be queried directly
with SELECT * FROM table LIMIT 0 to get column info, but this is slow for many tables.
Status: Reported to JGI Dremio administrators (Feb 2026).
Manual Curation
Some schemas require manual curation because:
- No FK constraints defined - KBase REST API doesn't expose pg_catalog
- Rich descriptions needed - Biological context requires expert annotation
- Complex relationships - Some relationships aren't captured by simple FKs
Manually curated schemas:
- kbase_ke_pangenome.linkml.yaml - Pangenomic data with GTDB taxonomy
- nmdc_core.linkml.yaml - NMDC microbiome data
Do not regenerate these schemas - they will lose curated content.