Existing Canonical Schemas vs Bridge-Schemas

This document clarifies the relationship between canonical LinkML schemas maintained by upstream projects and the introspected schemas in this repository.

The Core Problem

This repository generates LinkML schemas through database introspection—querying APIs or pg_catalog to discover tables, columns, and constraints. This approach:

Works across any database we can connect to
Captures the actual deployed schema
Loses semantic information not present in the database itself:
Rich descriptions and definitions
Logical relationships not expressed as foreign keys
Validation rules and enumerations
Inheritance and class hierarchies

Upstream projects often maintain canonical LinkML schemas with this rich semantic information—but we're not using them.

Canonical Schema Sources

KBase CDM Schema

Repository: github.com/kbase/cdm-schema

The KBase Common Data Model (CDM) schema is the authoritative LinkML specification for KBase's data structures. It defines ~80 classes including:

Module	Classes	Purpose
`cdm_bioentity.yaml`	Entity, Sequence, Feature, Protein	Core biological entities
`cdm_protocol.yaml`	Protocol, ProtocolExecution, Measurement	Experimental workflows
`cdm_ontology.yaml`	Prefix, Statement, EntailedEdge	Ontology/vocabulary support
`cdm_components.yaml`	Sample, Contig, Cluster	Reusable components
`cdm_credit.yaml`	Contributor, FundingReference, License	Attribution
`cdm_join_tables.yaml`	Various `_x_` tables	Many-to-many relationships

Current gap: We introspect KBase tables via the BERDL REST API, which returns only column names and types. The CDM schema contains descriptions, relationships, and constraints we're not capturing.

NMDC Schema

Repository: github.com/microbiomedata/nmdc-schema

The National Microbiome Data Collaborative (NMDC) schema is the authoritative LinkML specification for NMDC's MongoDB backend. Core classes include:

Class	Purpose
`Study`	Research project container
`Biosample`	Biological material collected from environment
`ProcessedSample`	Derived from biosamples via extraction/preparation
`DataGeneration`	Sequencing or analytical processes
`WorkflowExecution`	Computational analysis runs
`DataObject`	Actual data files and results
`FieldResearchSite`	Physical collection locations

The NMDC schema is comprehensive, with rich descriptions, slot constraints, enumerations for environmental metadata (MIxS, GOLD ecosystem classification), and extensive cross-references.

CORAL/ENIGMA Schema

Repository: github.com/realmarcin/linkml-coral

The CORAL (Common Ontology-based Resource for Annotation and Linking) schema is a LinkML implementation of the ENIGMA Common Data Model (CDM) for environmental molecular science data. It defines ~12 core classes for environmental sampling, sequencing, and microbial genomics:

Class	Purpose
`Location`	Geographic sampling locations with coordinates and environmental context
`Sample`	Environmental samples with depth, material, date, and environmental package
`Community`	Microbial community samples (isolates, enrichments, assemblages)
`Reads`	Sequencing read data with read counts and technology metadata
`Assembly`	Genome assemblies with contig statistics
`Genome`	Annotated genomes with feature counts
`Gene`	Gene predictions with functional annotations
`OTU` (ASV)	16S amplicon sequence variants for community profiling
`Process`	Provenance tracking for experimental workflows

Key features:

Semantic annotations: 69 microtype annotations (ME: terms) from context_measurement_ontology.obo
Ontology integration: ENVO, UO, DA, and ME prefixes for standardized terms
23 enumerated types: Auto-generated from OBO including ReadType, SequencingTechnology, Strand
Provenance tracking: Complete lineage from samples through sequencing to analysis
Foreign key validation: Explicit relationships between entities
Enhanced validation: Regex patterns, range constraints, required fields

Schema location: src/linkml_coral/schema/linkml_coral.yaml

CDM naming variant: Also available as linkml_coral_cdm.yaml with BERDL naming conventions (sdt_*, sys_*, ddt_*)

Data sources: - CORAL typedef JSON (git submodule at CORAL/) - KBase ENIGMA CDM parquet exports (~500MB, 44 tables) - Supports loading into DuckDB via linkml-store

The CORAL schema bridges ENIGMA's original JSON-based type definitions with modern LinkML semantics, providing comprehensive metadata management for environmental microbiology datasets.

The NMDC Confusion

This is the key source of confusion: The nmdc_core tables in KBase/BERDL are NOT the same as the classes in nmdc-schema.

What `nmdc-schema` Defines

The canonical NMDC schema (github.com/microbiomedata/nmdc-schema) defines the data model for NMDC's MongoDB backend. When NMDC stores a study or biosample, it uses these class definitions.

nmdc-schema defines:
  Study, Biosample, DataObject, WorkflowExecution, ...
     ↓
  Stored in NMDC MongoDB collections:
    biosample_set, study_set, data_object_set, ...

What `nmdc_core` in BERDL Contains

The tables in bridge_schemas/schema/kbase/nmdc_core.linkml.yaml represent a 2nd-order ingest of NMDC data into the KBase/BERDL data lake. These are derived/computed tables, not the raw NMDC entities:

NMDC MongoDB                    BERDL Data Lake
┌────────────────┐              ┌─────────────────────────────┐
│ biosample_set  │              │ annotation_terms_unified    │
│ study_set      │  ─── ETL ──► │ go_terms                    │
│ data_object_set│              │ metabolomics_gold           │
│ workflow_exec  │              │ embeddings_v1               │
└────────────────┘              │ go_hierarchy_flat           │
                                └─────────────────────────────┘

The BERDL tables are: - Aggregated: Terms unified across studies - Pre-computed: GO hierarchy flattened for efficient queries - Enhanced: Embeddings computed from raw data - Restructured: Optimized for analytical queries, not storage

Table Name Confusion

BERDL Table	What It Contains	NOT the same as
`annotation_terms_unified`	Merged GO/KEGG/EC terms across all NMDC samples	Any single NMDC class
`metabolomics_gold`	Mass spec features from GOLD-registered samples	`DataObject` or `WorkflowExecution`
`go_hierarchy_flat`	Pre-computed GO ancestor closure	`OntologyClass`
`embeddings_v1`	256-dim sample embeddings for similarity	No equivalent
`studies`	Study metadata with GOLD linkages	`Study` (partial overlap)

The _gold suffix on tables like metabolomics_gold indicates the data comes from GOLD-registered samples, not that they link to the GOLD database.

Consequences of Schema Introspection

What We Capture (via BERDL API)

# Introspected schema (impoverished)
classes:
  AnnotationTermsUnified:
    attributes:
      source:
        range: string
      term_id:
        range: string
      name:
        range: string

What Canonical Schemas Provide

# Canonical schema (rich)
classes:
  AnnotationTermsUnified:
    description: >-
      Unified annotation terms across sources (GO, KEGG, EC, COG, MetaCyc).
      Provides a single interface for querying functional annotations...

      TOTAL TERMS: 67,353 across all sources
    attributes:
      source:
        range: AnnotationSource  # Enum, not string!
        required: true
        description: >-
          Source ontology/database for this term. Determines ID format...
      term_id:
        identifier: true
        range: string
        required: true
        pattern: "^(GO:\\d{7}|K\\d{5}|\\d+\\.\\d+\\.\\d+\\.\\d+|...)$"

Information Lost

Aspect	Introspected	Canonical
Descriptions	None or minimal	Rich, contextual
Enumerations	All strings	Defined value sets
Patterns	None	Regex validation
Required fields	Sometimes	Explicit
Identifiers	Guessed	Declared
Foreign keys	API-dependent	Logical relationships
Inheritance	None	Class hierarchies

Current Mitigation: Manual Curation

For critical schemas, we manually curate descriptions after initial introspection:

kbase_ke_pangenome.linkml.yaml - 38 curated descriptions
nmdc_core.linkml.yaml - 79 curated descriptions

Warning: Do not regenerate these schemas—curation will be lost.

Recommended Path Forward

Short-term: Document the Gap

This document serves to clarify: 1. Canonical schemas exist but are not being used 2. BERDL tables ≠ NMDC schema classes 3. The "2nd order ingest" nature of NMDC data in BERDL

Medium-term: Schema Alignment

Potential improvements:

Import canonical enums: Use AnnotationSource, GoNamespace, etc. from upstream schemas instead of regenerating as strings
Link to canonical docs: Reference nmdc-schema and cdm-schema documentation for authoritative definitions
Distinguish derived tables: Clearly mark which tables are derived/computed vs. direct representations of upstream entities

Long-term: Hybrid Approach

Combine introspection with canonical schemas:

# Import canonical definitions
imports:
  - https://w3id.org/nmdc/nmdc-schema  # Enums, base types

# Extend with BERDL-specific tables
classes:
  EmbeddingsV1:
    description: BERDL-computed embeddings (not in nmdc-schema)
    ...

Data Catalog Alternatives

Could enterprise data catalog tools help bridge the gap between introspected and canonical schemas? Here's an assessment of the major open-source options:

Tool Comparison

Tool	Schema Format	Semantic Enrichment	Complexity
OpenMetadata	JSON Schema	Collaborative curation, glossary	Medium
DataHub	Graph-based	Tags, terms, domains, lineage	High
Amundsen	Neo4j	Owners, tags, badges	Low
LinkML Registry	LinkML native	Discovery only	Low

How Data Catalogs Could Help

A data catalog like OpenMetadata or DataHub could serve as a semantic overlay:

BERDL Tables        Data Catalog           Bridge-Schemas
┌──────────────┐    ┌──────────────────┐   ┌──────────────┐
│ Introspected │───►│ + descriptions   │──►│ Enriched     │
│ columns      │    │ + glossary terms │   │ LinkML       │
│ + types      │    │ + lineage        │   │ schemas      │
└──────────────┘    │ + ownership      │   └──────────────┘
                    └──────────────────┘

Potential benefits: - Collaborative curation (data producers add business context) - Column-level lineage (track NMDC MongoDB → BERDL ETL transformations) - Business glossary (define terms like "GOLD ecosystem", "MIxS package") - Data quality alerts and deprecation notices

Limitations for our use case: - Additional infrastructure to deploy and maintain - JSON Schema based, not LinkML (translation layer needed) - No awareness of canonical upstream schemas (cdm-schema, nmdc-schema) - Designed for enterprise data governance, not scientific schema alignment

DataHub vs OpenMetadata

DataHub offers stronger governance features: - Fine-grained column-level lineage - dbt semantic layer integration - Hierarchical business glossary - But: more complex to operate

OpenMetadata is simpler to deploy: - 100+ data source connectors - Collaborative annotation workflows - Built-in data quality framework - But: less mature lineage tracking

Recommendation for Bridge-Schemas

Given that canonical LinkML schemas already exist, direct schema integration is more appropriate than adopting a data catalog:

Skip data catalogs for this specific problem—they add complexity without solving the core issue (canonical schemas exist but aren't used)
Import canonical schemas where they exist (nmdc-schema, cdm-schema)
Create explicit mappings between BERDL tables and canonical classes:

# mappings/berdl-to-canonical.yaml
berdl_tables:
  nmdc_core.studies:
    canonical_class: nmdc:Study
    notes: "Subset of fields, adds gold_study_identifiers linkage"

  nmdc_core.annotation_terms_unified:
    canonical_class: null  # Derived table, no canonical equivalent
    derived_from:
      - nmdc:FunctionalAnnotation
      - nmdc:GeneProduct

  nmdc_core.embeddings_v1:
    canonical_class: null  # BERDL-computed, no upstream equivalent

Consider data catalogs later if collaborative curation becomes important across multiple teams or if lineage tracking for ETL pipelines is needed