Existing Canonical Schemas vs Bridge-Schemas
This document clarifies the relationship between canonical LinkML schemas maintained by upstream projects and the introspected schemas in this repository.
The Core Problem
This repository generates LinkML schemas through database introspection—querying
APIs or pg_catalog to discover tables, columns, and constraints. This approach:
- Works across any database we can connect to
- Captures the actual deployed schema
- Loses semantic information not present in the database itself:
- Rich descriptions and definitions
- Logical relationships not expressed as foreign keys
- Validation rules and enumerations
- Inheritance and class hierarchies
Upstream projects often maintain canonical LinkML schemas with this rich semantic information—but we're not using them.
Canonical Schema Sources
KBase CDM Schema
Repository: github.com/kbase/cdm-schema
The KBase Common Data Model (CDM) schema is the authoritative LinkML specification for KBase's data structures. It defines ~80 classes including:
| Module | Classes | Purpose |
|---|---|---|
cdm_bioentity.yaml |
Entity, Sequence, Feature, Protein | Core biological entities |
cdm_protocol.yaml |
Protocol, ProtocolExecution, Measurement | Experimental workflows |
cdm_ontology.yaml |
Prefix, Statement, EntailedEdge | Ontology/vocabulary support |
cdm_components.yaml |
Sample, Contig, Cluster | Reusable components |
cdm_credit.yaml |
Contributor, FundingReference, License | Attribution |
cdm_join_tables.yaml |
Various *_x_* tables |
Many-to-many relationships |
Current gap: We introspect KBase tables via the BERDL REST API, which returns only column names and types. The CDM schema contains descriptions, relationships, and constraints we're not capturing.
NMDC Schema
Repository: github.com/microbiomedata/nmdc-schema
The National Microbiome Data Collaborative (NMDC) schema is the authoritative LinkML specification for NMDC's MongoDB backend. Core classes include:
| Class | Purpose |
|---|---|
Study |
Research project container |
Biosample |
Biological material collected from environment |
ProcessedSample |
Derived from biosamples via extraction/preparation |
DataGeneration |
Sequencing or analytical processes |
WorkflowExecution |
Computational analysis runs |
DataObject |
Actual data files and results |
FieldResearchSite |
Physical collection locations |
The NMDC schema is comprehensive, with rich descriptions, slot constraints, enumerations for environmental metadata (MIxS, GOLD ecosystem classification), and extensive cross-references.
CORAL/ENIGMA Schema
Repository: github.com/realmarcin/linkml-coral
The CORAL (Common Ontology-based Resource for Annotation and Linking) schema is a LinkML implementation of the ENIGMA Common Data Model (CDM) for environmental molecular science data. It defines ~12 core classes for environmental sampling, sequencing, and microbial genomics:
| Class | Purpose |
|---|---|
Location |
Geographic sampling locations with coordinates and environmental context |
Sample |
Environmental samples with depth, material, date, and environmental package |
Community |
Microbial community samples (isolates, enrichments, assemblages) |
Reads |
Sequencing read data with read counts and technology metadata |
Assembly |
Genome assemblies with contig statistics |
Genome |
Annotated genomes with feature counts |
Gene |
Gene predictions with functional annotations |
OTU (ASV) |
16S amplicon sequence variants for community profiling |
Process |
Provenance tracking for experimental workflows |
Key features:
- Semantic annotations: 69 microtype annotations (ME: terms) from
context_measurement_ontology.obo - Ontology integration: ENVO, UO, DA, and ME prefixes for standardized terms
- 23 enumerated types: Auto-generated from OBO including ReadType, SequencingTechnology, Strand
- Provenance tracking: Complete lineage from samples through sequencing to analysis
- Foreign key validation: Explicit relationships between entities
- Enhanced validation: Regex patterns, range constraints, required fields
Schema location: src/linkml_coral/schema/linkml_coral.yaml
CDM naming variant: Also available as linkml_coral_cdm.yaml with BERDL naming conventions (sdt_*, sys_*, ddt_*)
Data sources:
- CORAL typedef JSON (git submodule at CORAL/)
- KBase ENIGMA CDM parquet exports (~500MB, 44 tables)
- Supports loading into DuckDB via linkml-store
The CORAL schema bridges ENIGMA's original JSON-based type definitions with modern LinkML semantics, providing comprehensive metadata management for environmental microbiology datasets.
The NMDC Confusion
This is the key source of confusion: The nmdc_core tables in KBase/BERDL
are NOT the same as the classes in nmdc-schema.
What nmdc-schema Defines
The canonical NMDC schema (github.com/microbiomedata/nmdc-schema) defines the
data model for NMDC's MongoDB backend. When NMDC stores a study or biosample,
it uses these class definitions.
nmdc-schema defines:
Study, Biosample, DataObject, WorkflowExecution, ...
↓
Stored in NMDC MongoDB collections:
biosample_set, study_set, data_object_set, ...
What nmdc_core in BERDL Contains
The tables in bridge_schemas/schema/kbase/nmdc_core.linkml.yaml represent a
2nd-order ingest of NMDC data into the KBase/BERDL data lake. These are
derived/computed tables, not the raw NMDC entities:
NMDC MongoDB BERDL Data Lake
┌────────────────┐ ┌─────────────────────────────┐
│ biosample_set │ │ annotation_terms_unified │
│ study_set │ ─── ETL ──► │ go_terms │
│ data_object_set│ │ metabolomics_gold │
│ workflow_exec │ │ embeddings_v1 │
└────────────────┘ │ go_hierarchy_flat │
└─────────────────────────────┘
The BERDL tables are: - Aggregated: Terms unified across studies - Pre-computed: GO hierarchy flattened for efficient queries - Enhanced: Embeddings computed from raw data - Restructured: Optimized for analytical queries, not storage
Table Name Confusion
| BERDL Table | What It Contains | NOT the same as |
|---|---|---|
annotation_terms_unified |
Merged GO/KEGG/EC terms across all NMDC samples | Any single NMDC class |
metabolomics_gold |
Mass spec features from GOLD-registered samples | DataObject or WorkflowExecution |
go_hierarchy_flat |
Pre-computed GO ancestor closure | OntologyClass |
embeddings_v1 |
256-dim sample embeddings for similarity | No equivalent |
studies |
Study metadata with GOLD linkages | Study (partial overlap) |
The _gold suffix on tables like metabolomics_gold indicates the data comes
from GOLD-registered samples, not that they link to the GOLD database.
Consequences of Schema Introspection
What We Capture (via BERDL API)
# Introspected schema (impoverished)
classes:
AnnotationTermsUnified:
attributes:
source:
range: string
term_id:
range: string
name:
range: string
What Canonical Schemas Provide
# Canonical schema (rich)
classes:
AnnotationTermsUnified:
description: >-
Unified annotation terms across sources (GO, KEGG, EC, COG, MetaCyc).
Provides a single interface for querying functional annotations...
TOTAL TERMS: 67,353 across all sources
attributes:
source:
range: AnnotationSource # Enum, not string!
required: true
description: >-
Source ontology/database for this term. Determines ID format...
term_id:
identifier: true
range: string
required: true
pattern: "^(GO:\\d{7}|K\\d{5}|\\d+\\.\\d+\\.\\d+\\.\\d+|...)$"
Information Lost
| Aspect | Introspected | Canonical |
|---|---|---|
| Descriptions | None or minimal | Rich, contextual |
| Enumerations | All strings | Defined value sets |
| Patterns | None | Regex validation |
| Required fields | Sometimes | Explicit |
| Identifiers | Guessed | Declared |
| Foreign keys | API-dependent | Logical relationships |
| Inheritance | None | Class hierarchies |
Current Mitigation: Manual Curation
For critical schemas, we manually curate descriptions after initial introspection:
kbase_ke_pangenome.linkml.yaml- 38 curated descriptionsnmdc_core.linkml.yaml- 79 curated descriptions
Warning: Do not regenerate these schemas—curation will be lost.
Recommended Path Forward
Short-term: Document the Gap
This document serves to clarify: 1. Canonical schemas exist but are not being used 2. BERDL tables ≠ NMDC schema classes 3. The "2nd order ingest" nature of NMDC data in BERDL
Medium-term: Schema Alignment
Potential improvements:
-
Import canonical enums: Use AnnotationSource, GoNamespace, etc. from upstream schemas instead of regenerating as strings
-
Link to canonical docs: Reference
nmdc-schemaandcdm-schemadocumentation for authoritative definitions -
Distinguish derived tables: Clearly mark which tables are derived/computed vs. direct representations of upstream entities
Long-term: Hybrid Approach
Combine introspection with canonical schemas:
# Import canonical definitions
imports:
- https://w3id.org/nmdc/nmdc-schema # Enums, base types
# Extend with BERDL-specific tables
classes:
EmbeddingsV1:
description: BERDL-computed embeddings (not in nmdc-schema)
...
Data Catalog Alternatives
Could enterprise data catalog tools help bridge the gap between introspected and canonical schemas? Here's an assessment of the major open-source options:
Tool Comparison
| Tool | Schema Format | Semantic Enrichment | Complexity |
|---|---|---|---|
| OpenMetadata | JSON Schema | Collaborative curation, glossary | Medium |
| DataHub | Graph-based | Tags, terms, domains, lineage | High |
| Amundsen | Neo4j | Owners, tags, badges | Low |
| LinkML Registry | LinkML native | Discovery only | Low |
How Data Catalogs Could Help
A data catalog like OpenMetadata or DataHub could serve as a semantic overlay:
BERDL Tables Data Catalog Bridge-Schemas
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ Introspected │───►│ + descriptions │──►│ Enriched │
│ columns │ │ + glossary terms │ │ LinkML │
│ + types │ │ + lineage │ │ schemas │
└──────────────┘ │ + ownership │ └──────────────┘
└──────────────────┘
Potential benefits: - Collaborative curation (data producers add business context) - Column-level lineage (track NMDC MongoDB → BERDL ETL transformations) - Business glossary (define terms like "GOLD ecosystem", "MIxS package") - Data quality alerts and deprecation notices
Limitations for our use case: - Additional infrastructure to deploy and maintain - JSON Schema based, not LinkML (translation layer needed) - No awareness of canonical upstream schemas (cdm-schema, nmdc-schema) - Designed for enterprise data governance, not scientific schema alignment
DataHub vs OpenMetadata
DataHub offers stronger governance features: - Fine-grained column-level lineage - dbt semantic layer integration - Hierarchical business glossary - But: more complex to operate
OpenMetadata is simpler to deploy: - 100+ data source connectors - Collaborative annotation workflows - Built-in data quality framework - But: less mature lineage tracking
Recommendation for Bridge-Schemas
Given that canonical LinkML schemas already exist, direct schema integration is more appropriate than adopting a data catalog:
-
Skip data catalogs for this specific problem—they add complexity without solving the core issue (canonical schemas exist but aren't used)
-
Import canonical schemas where they exist (nmdc-schema, cdm-schema)
-
Create explicit mappings between BERDL tables and canonical classes:
# mappings/berdl-to-canonical.yaml
berdl_tables:
nmdc_core.studies:
canonical_class: nmdc:Study
notes: "Subset of fields, adds gold_study_identifiers linkage"
nmdc_core.annotation_terms_unified:
canonical_class: null # Derived table, no canonical equivalent
derived_from:
- nmdc:FunctionalAnnotation
- nmdc:GeneProduct
nmdc_core.embeddings_v1:
canonical_class: null # BERDL-computed, no upstream equivalent
- Consider data catalogs later if collaborative curation becomes important across multiple teams or if lineage tracking for ETL pipelines is needed
See Also
- Schema Generation Methods - How introspection works
- Cross-Database Linkages - How NMDC links to GOLD
- KBase CDM Schema - Canonical KBase schema
- NMDC Schema - Canonical NMDC schema
- CORAL/ENIGMA Schema - ENIGMA environmental microbiology data model