Skip to content

KBase Pangenome Database

Pangenomic data organized around GTDB species clades. Contains gene clusters, genome metadata, taxonomy, and functional annotations. Built from GTDB r214 with pangenome analysis using PPanGGOLiN-style methods. DATABASE STATISTICS: - 85,000+ species pangenomes - 1,011,650,903 genes - 132,531,501 gene clusters - 93,558,330 EggNOG annotations TOP SPECIES BY GENOME COUNT: - S. aureus: 14,526 genomes, 2,083 core genes - K. pneumoniae: 14,240 genomes, 4,199 core genes - S. enterica: 11,402 genomes, 3,639 core genes - S. pneumoniae: 8,434 genomes, 1,475 core genes - M. tuberculosis: 6,903 genomes, 3,741 core genes TAXONOMIC DISTRIBUTION: - Pseudomonadota: 117,619 genomes - Bacillota: 67,072 genomes - Actinomycetota: 26,949 genomes - Bacteroidota: 20,615 genomes

URI: https://w3id.org/kbase/kbase_ke_pangenome

Name: kbase_ke_pangenome

Classes

Class Description
EggnogMapperAnnotations EggNOG-mapper v2 functional annotations for genes
GapmindPathways GapMind metabolic pathway completeness scores
Gene Gene/CDS within a genome
GeneCluster Ortholog cluster at species level
GeneGeneclusterJunction Junction table linking genes to gene clusters
Genome Individual genome assembly from NCBI RefSeq or GenBank
GenomeAni Pairwise Average Nucleotide Identity (ANI) between genomes within species cla...
GtdbMetadata Comprehensive GTDB metadata with quality metrics, genome statistics, and NCBI...
GtdbSpeciesClade GTDB species-level grouping with representative genome
GtdbTaxonomyR214v1 GTDB release 214 taxonomy with parsed rank assignments
Pangenome Summary statistics for a species pangenome
Sample Links genomes to NCBI BioProject and BioSample accessions

Slots

Slot Description
accession NCBI assembly accession with RS_/GB_ prefix
AF Alignment Fraction - proportion of genome that aligned
ANI Average Nucleotide Identity as percentage
ANI_circumscription_radius ANI threshold for species membership
checkm_completeness CheckM genome completeness estimate
checkm_contamination CheckM contamination estimate
class Class name with c__ prefix
COG_category COG functional category code(s)
contig_count Number of contigs in assembly
corrected_mean_completness Completeness after pangenome-based correction
Description Functional description from seed ortholog
domain Domain rank (d__Archaea or d__Bacteria)
EC EC enzyme numbers, comma-separated
eggNOG_OGs Hierarchical EggNOG ortholog groups from root to most specific
evalue E-value of seed ortholog match (lower = better match)
faa_file_path_nersc Absolute path to protein FASTA file at NERSC filesystem
family Family name with f__ prefix
fna_file_path_nersc Absolute path to nucleotide FASTA file at NERSC filesystem
gc_percentage GC content percentage
gene_cluster_id Unique cluster identifier
gene_id Composite gene identifier constructed from NCBI nucleotide accession and CDS ...
genome1_id First genome in pairwise comparison
genome2_id Second genome in pairwise comparison
genome_id Genome accession with source prefix and version
genome_size Total genome size in base pairs
genus Genus name with g__ prefix
GOs GO terms, comma-separated
gtdb_representative Whether this genome is the GTDB species representative
GTDB_species GTDB species name with s__ prefix
gtdb_species_clade_id Species clade ID combining species name and representative genome
GTDB_taxonomy Full GTDB lineage from domain to genus (species not repeated)
gtdb_taxonomy Full GTDB taxonomy string
gtdb_taxonomy_id Full GTDB taxonomy lineage string for this genome
is_auxiliary Present in some but not all genomes
is_core Present in all (or nearly all) genomes
is_singleton Present in only one genome
KEGG_ko KEGG Orthology IDs
KEGG_Pathway KEGG pathway IDs, comma-separated
likelihood Log-likelihood from PPanGGOLiN Bayesian partitioning model
mean_initial_completeness Mean CheckM completeness of input genomes before filtering
mean_intra_species_AF Mean alignment fraction - proportion of genome aligning in ANI calculations
mean_intra_species_ANI Mean pairwise ANI among all genomes
metabolic_category Category - amino acid (aa) or carbon source
min_intra_species_AF Minimum alignment fraction observed between any two genomes
min_intra_species_ANI Minimum pairwise ANI observed
ncbi_bioproject_accession_id NCBI BioProject accession
ncbi_biosample NCBI BioSample accession
ncbi_biosample_accession_id NCBI BioSample accession with sample metadata
ncbi_biosample_id NCBI BioSample accession linking to sample metadata including isolation sourc...
ncbi_organism_name NCBI organism name including strain
ncbi_taxid NCBI taxonomy ID
no_aux_genome Number of auxiliary (shell) gene clusters
no_clustered_genomes_filtered Genomes passing quality filters used in pangenome analysis
no_clustered_genomes_unfiltered Total genomes assigned to species before quality filtering
no_core Number of core gene clusters
no_gene_clusters Total gene clusters (core + auxiliary + singleton)
no_genomes Number of genomes in pangenome analysis
no_singleton_gene_clusters Number of singleton clusters
number_of_iterations PPanGGOLiN model training iterations (0 = converged early)
order Order name with o__ prefix
pathway Pathway/compound name
PFAMs PFAM domain annotations, comma-separated
phylum Phylum name with p__ prefix
Preferred_name Gene symbol when available, "-" if none
protein_count Number of predicted protein-coding genes
protocol_id Analysis protocol version identifier
query_name Gene ID - links to Gene
representative_genome_id Reference genome for this species
score Bit score of seed ortholog alignment
score_category Categorical assessment of pathway completeness
seed_ortholog Best matching seed ortholog from eggNOG database
species Species name with s__ prefix
total_sum_of_loglikelihood_ratios Model fit quality metric

Enumerations

Enumeration Description
CogFunctionalCategory COG (Clusters of Orthologous Groups) single-letter functional categories
GapmindMetabolicCategory GapMind metabolic pathway categories
GapmindScoreCategory GapMind pathway completeness score categories
GtdbDomain GTDB taxonomic domains

Types

Type Description
Boolean A binary (true or false) value
Curie a compact URI
Date a date (year, month and day) in an idealized calendar
DateOrDatetime Either a date or a datetime
Datetime The combination of a date and time
Decimal A real number with arbitrary precision that conforms to the xsd:decimal speci...
Double A real number that conforms to the xsd:double specification
Float A real number that conforms to the xsd:float specification
Integer An integer
Jsonpath A string encoding a JSON Path
Jsonpointer A string encoding a JSON Pointer
Ncname Prefix part of CURIE
Nodeidentifier A URI, CURIE or BNODE that represents a node in a model
Objectidentifier A URI or CURIE that represents an object in the model
Sparqlpath A string encoding a SPARQL Property Path
String A character string
Time A time object represents a (local) time of day, independent of any particular...
Uri a complete URI
Uriorcurie a URI or a CURIE

Subsets

Subset Description