Skip to content

Class: VocabRegistryV1

Vocabulary registry for multi-modal tokenization. Maps features to token IDs for embedding models.

TOKEN TYPES: - special: System tokens ([PAD], [CLS], [SEP], [MASK], [UNK]) - taxon: Taxonomic entities - go_term: Gene Ontology terms - compound: Chemical compounds - environmental: Environmental parameters

SPECIAL TOKENS (IDs 0-4): - 0: [PAD] - padding token - 1: [CLS] - classification token - 2: [SEP] - separator token - 3: [MASK] - mask token for MLM - 4: [UNK] - unknown token

URI: https://w3id.org/kbase/nmdc_core/VocabRegistryV1

classDiagram class VocabRegistryV1 click VocabRegistryV1 href "../VocabRegistryV1/" VocabRegistryV1 : entity_key VocabRegistryV1 : entity_type VocabRegistryV1 --> "0..1" TokenEntityType : entity_type click TokenEntityType href "../TokenEntityType/" VocabRegistryV1 : human_name VocabRegistryV1 : source_modality VocabRegistryV1 --> "0..1" SourceModality : source_modality click SourceModality href "../SourceModality/" VocabRegistryV1 : token_id

Slots

Name Cardinality and Range Description Inheritance
token_id 1
Integer
Unique token ID in vocabulary direct
entity_type 0..1
TokenEntityType
Type of entity this token represents direct
entity_key 0..1
String
Unique key for this entity within its type direct
human_name 0..1
String
Human-readable name for the token direct
source_modality 0..1
SourceModality
Data modality this token comes from direct

Identifier and Mapping Information

Annotations

property value
source_table vocab_registry_v1

Schema Source

  • from schema: https://w3id.org/kbase/nmdc_core

Mappings

Mapping Type Mapped Value
self https://w3id.org/kbase/nmdc_core/VocabRegistryV1
native https://w3id.org/kbase/nmdc_core/VocabRegistryV1

LinkML Source

Direct

name: VocabRegistryV1
annotations:
  source_table:
    tag: source_table
    value: vocab_registry_v1
description: 'Vocabulary registry for multi-modal tokenization. Maps features to token
  IDs for embedding models.

  TOKEN TYPES: - special: System tokens ([PAD], [CLS], [SEP], [MASK], [UNK]) - taxon:
  Taxonomic entities - go_term: Gene Ontology terms - compound: Chemical compounds
  - environmental: Environmental parameters

  SPECIAL TOKENS (IDs 0-4): - 0: [PAD] - padding token - 1: [CLS] - classification
  token - 2: [SEP] - separator token - 3: [MASK] - mask token for MLM - 4: [UNK] -
  unknown token'
from_schema: https://w3id.org/kbase/nmdc_core
attributes:
  token_id:
    name: token_id
    description: Unique token ID in vocabulary
    examples:
    - value: '0'
      description: '[PAD] token'
    - value: '1'
      description: '[CLS] token'
    - value: '2'
      description: '[SEP] token'
    - value: '3'
      description: '[MASK] token'
    - value: '4'
      description: '[UNK] token'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    identifier: true
    domain_of:
    - VocabRegistryV1
    - SampleTokensV1
    range: integer
    required: true
    minimum_value: 0
  entity_type:
    name: entity_type
    description: Type of entity this token represents
    examples:
    - value: special
      description: System tokens
    - value: taxon
      description: Taxonomic entities
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    domain_of:
    - VocabRegistryV1
    range: TokenEntityType
  entity_key:
    name: entity_key
    description: Unique key for this entity within its type
    examples:
    - value: '[PAD]'
    - value: '[CLS]'
    - value: '[SEP]'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    domain_of:
    - VocabRegistryV1
    range: string
  human_name:
    name: human_name
    description: Human-readable name for the token
    examples:
    - value: '[PAD]'
    - value: '[CLS]'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    domain_of:
    - VocabRegistryV1
    range: string
  source_modality:
    name: source_modality
    description: Data modality this token comes from
    examples:
    - value: system
      description: System/special tokens
    - value: taxonomy
      description: Taxonomic features
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    domain_of:
    - VocabRegistryV1
    range: SourceModality

Induced

name: VocabRegistryV1
annotations:
  source_table:
    tag: source_table
    value: vocab_registry_v1
description: 'Vocabulary registry for multi-modal tokenization. Maps features to token
  IDs for embedding models.

  TOKEN TYPES: - special: System tokens ([PAD], [CLS], [SEP], [MASK], [UNK]) - taxon:
  Taxonomic entities - go_term: Gene Ontology terms - compound: Chemical compounds
  - environmental: Environmental parameters

  SPECIAL TOKENS (IDs 0-4): - 0: [PAD] - padding token - 1: [CLS] - classification
  token - 2: [SEP] - separator token - 3: [MASK] - mask token for MLM - 4: [UNK] -
  unknown token'
from_schema: https://w3id.org/kbase/nmdc_core
attributes:
  token_id:
    name: token_id
    description: Unique token ID in vocabulary
    examples:
    - value: '0'
      description: '[PAD] token'
    - value: '1'
      description: '[CLS] token'
    - value: '2'
      description: '[SEP] token'
    - value: '3'
      description: '[MASK] token'
    - value: '4'
      description: '[UNK] token'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    identifier: true
    alias: token_id
    owner: VocabRegistryV1
    domain_of:
    - VocabRegistryV1
    - SampleTokensV1
    range: integer
    required: true
    minimum_value: 0
  entity_type:
    name: entity_type
    description: Type of entity this token represents
    examples:
    - value: special
      description: System tokens
    - value: taxon
      description: Taxonomic entities
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    alias: entity_type
    owner: VocabRegistryV1
    domain_of:
    - VocabRegistryV1
    range: TokenEntityType
  entity_key:
    name: entity_key
    description: Unique key for this entity within its type
    examples:
    - value: '[PAD]'
    - value: '[CLS]'
    - value: '[SEP]'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    alias: entity_key
    owner: VocabRegistryV1
    domain_of:
    - VocabRegistryV1
    range: string
  human_name:
    name: human_name
    description: Human-readable name for the token
    examples:
    - value: '[PAD]'
    - value: '[CLS]'
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    alias: human_name
    owner: VocabRegistryV1
    domain_of:
    - VocabRegistryV1
    range: string
  source_modality:
    name: source_modality
    description: Data modality this token comes from
    examples:
    - value: system
      description: System/special tokens
    - value: taxonomy
      description: Taxonomic features
    from_schema: https://w3id.org/kbase/nmdc_core
    rank: 1000
    alias: source_modality
    owner: VocabRegistryV1
    domain_of:
    - VocabRegistryV1
    range: SourceModality