Skip to content

BioStride and EMPIAR Alignment Analysis

Note: This document was generated using Claude (Anthropic's AI assistant) through automated analysis of documentation and web sources. While efforts have been made to ensure accuracy, there may be errors or outdated information. Please verify critical details with official EMPIAR documentation.

Overview

This document provides a detailed analysis of the alignment between the BioStride schema and EMPIAR (Electron Microscopy Public Image Archive), examining how BioStride can integrate with and complement EMPIAR's role as the global repository for raw electron microscopy and related imaging data.

Introduction to EMPIAR

What is EMPIAR?

EMPIAR is the global public archive for raw electron microscopy images and related imaging data. Key characteristics:

  • Purpose: Archive raw EM images underpinning 3D reconstructions
  • Content: 2,385 entries, >6.24 PiB of data (as of 2025)
  • Scope: Beyond EM - includes volume EM, X-ray tomography, CLEM
  • Access: REST API, Aspera/Globus downloads, CC0 licensing
  • Integration: Cross-references with EMDB, PDB, BioImage Archive

Data Types Supported

EMPIAR has evolved beyond traditional cryo-EM to support:

  1. Single Particle Cryo-EM
  2. Movie stacks (multiframe micrographs)
  3. Single frame micrographs
  4. Particle stacks
  5. Class averages

  6. Electron Tomography

  7. Tilt series
  8. Tomograms
  9. Subtomograms

  10. Volume EM Techniques

  11. FIB-SEM (Focused Ion Beam SEM)
  12. SBF-SEM (Serial Block Face SEM)
  13. Array tomography datasets

  14. X-ray Microscopy

  15. Soft X-ray tomography
  16. Hard X-ray tomography
  17. Cryo-soft X-ray data

  18. Correlative Microscopy

  19. CLEM (Correlative Light and Electron Microscopy)
  20. Multimodal datasets
  21. Aligned image stacks

Current Status (2024-2025)

  • Growth: Exponential data growth reaching petabyte scale
  • Expansion: Support for volume EM and X-ray techniques
  • Integration: Improved workflow uploads from Scipion
  • Standards: JSON schema-based metadata model
  • Automation: Header extraction from common formats

EMPIAR Data Model

JSON Schema Structure

EMPIAR uses a structured JSON schema for deposition metadata:

{
  "admin": {
    "title": "Dataset title",
    "authors": [...],
    "references": [...]
  },
  "imagesets": [
    {
      "name": "Micrographs",
      "category": "T1",  // T1=single frame, T2=multiframe
      "format": "MRC",
      "dimensions": {
        "width": 5760,
        "height": 4092
      },
      "pixel_spacing": 0.85,
      "num_images": 8500
    }
  ],
  "specimen": {
    "cs": 2.7,
    "electron_dose": 40.0,
    "gain_reference": "..."
  }
}

Image Categories

Code Category BioStride Equivalent
T1 Micrographs - single frame Image2D (single exposure)
T2 Micrographs - multiframe DataFile (movie stack)
T9 Tilt series DataFile (collection)
T10 Class averages DataFile (processed)
T13 Reconstructed volumes Image3D
T14 Subtomograms DataFile (extracted)
OT Other DataFile (generic)

BioStride-EMPIAR Alignment

Structural Mapping

EMPIAR Concept BioStride Equivalent Alignment Notes
Entry Study/Dataset Complete experimental dataset
Imagesets DataFile collections ✅ Groups of related images
Admin metadata Dataset attributes Title, authors, references
Specimen info Sample + ExperimentRun ✅ Sample and collection parameters
Image metadata Image attributes Dimensions, pixel size, format
File hierarchy DataFile.file_path Directory structure preservation
Cross-references external_ids ✅ EMDB, PDB, BioStudies links

Metadata Alignment

Sample and Specimen

EMPIAR Specimen:

{
  "specimen": {
    "cs": 2.7,
    "electron_dose": 40.0,
    "gain_reference": "gain.mrc"
  }
}

BioStride Equivalent:

CryoEMInstrument:
  cs_corrector: true  # Cs value implies corrector

ExperimentRun:
  data_collection_strategy:
    total_dose: 40.0

DataFile:
  file_name: "gain.mrc"
  data_type: calibration

Imageset Organization

EMPIAR Imagesets:

{
  "imagesets": [{
    "name": "Movies",
    "category": "T2",
    "format": "MRC",
    "num_images": 8500,
    "frames_per_image": 40
  }]
}

BioStride Equivalent:

DataFile:
  file_name: "movies.tar"
  file_format: mrc
  data_type: micrograph
  metadata:
    empiar_category: "T2"
    image_count: 8500
    frames_per_image: 40

Key Differences and Complementarities

1. Raw Data Focus

EMPIAR: Specialized for raw data - Optimized for large file storage - Direct download infrastructure - Minimal processing metadata

BioStride: Workflow context - Tracks processing history - Links raw to processed data - Captures parameter evolution

2. Multi-modal Support

EMPIAR: Expanding modality coverage - Volume EM (FIB-SEM, SBF-SEM) - X-ray tomography - CLEM datasets

BioStride: Unified multi-modal schema - Native support for FTIR, fluorescence - Integrated workflow across techniques - Common metadata model

3. Access Patterns

EMPIAR: Bulk data access - Aspera/Globus for large transfers - REST API for metadata - Volume Browser for visualization

BioStride: Metadata-centric access - Query by workflow stage - Sample lineage tracking - Processing parameter search

Integration Strategies

BioStride → EMPIAR Deposition

def prepare_empiar_deposition(study):
    """Generate EMPIAR JSON from BioStride study"""
    empiar_json = {
        "admin": {
            "title": study.title,
            "authors": extract_authors(study),
            "references": extract_publications(study)
        },
        "imagesets": [],
        "specimen": {}
    }

    # Map DataFiles to imagesets
    for data_file in study.data_files:
        if data_file.data_type in ['micrograph', 'tilt_series']:
            imageset = {
                "name": data_file.file_name,
                "category": map_to_empiar_category(data_file),
                "format": data_file.file_format.upper(),
                "num_images": data_file.metadata.get('image_count')
            }
            empiar_json["imagesets"].append(imageset)

    # Extract specimen parameters
    for exp_run in study.experiment_runs:
        if exp_run.technique == 'cryo_em':
            empiar_json["specimen"]["electron_dose"] = exp_run.data_collection_strategy.total_dose

    return empiar_json

EMPIAR → BioStride Import

# BioStride representation of EMPIAR entry
Study:
  title: "Imported from EMPIAR-12345"

  data_files:
    - file_name: "micrographs/"
      file_format: mrc
      data_type: micrograph
      file_size_bytes: 5400000000000
      external_ids:
        empiar_id: "EMPIAR-12345"
      metadata:
        empiar_category: "T2"
        image_count: 8500
        download_method: "aspera"

  experiment_runs:
    - technique: cryo_em
      quality_metrics:
        completeness: 100  # All raw data present

Advanced Integration Features

1. Volume EM Support

BioStride can track volume EM workflows:

ExperimentRun:
  technique: volume_em
  metadata:
    modality: "FIB-SEM"
    slice_thickness: 5.0  # nm
    volume_dimensions: [1000, 1000, 500]  # voxels

DataFile:
  data_type: volume_em_stack
  external_ids:
    empiar_id: "EMPIAR-11000"

2. Correlative Microscopy

Track CLEM experiments:

Study:
  title: "Correlative light and electron microscopy"

  experiment_runs:
    - technique: fluorescence
      instrument_id: "light_microscope_001"
    - technique: cryo_em
      instrument_id: "titan_krios_001"

  data_files:
    - data_type: clem_registration
      metadata:
        alignment_method: "CLEM-Reg"
        modalities: ["fluorescence", "cryo_em"]

3. X-ray Tomography

Support for soft X-ray data:

ExperimentRun:
  technique: soft_xray_tomography
  instrument_id: "synchrotron_beamline"

DataFile:
  data_type: xray_tomogram
  external_ids:
    empiar_id: "EMPIAR-11500"

REST API Integration

Accessing EMPIAR from BioStride

import requests

class EMPIARConnector:
    """Connect BioStride to EMPIAR REST API"""

    BASE_URL = "https://www.ebi.ac.uk/empiar/api"

    def get_entry_metadata(self, empiar_id):
        """Retrieve EMPIAR entry metadata"""
        response = requests.get(f"{self.BASE_URL}/entry/{empiar_id}")
        return response.json()

    def check_entry_status(self, empiar_id):
        """Verify entry release status"""
        response = requests.get(
            f"{self.BASE_URL}/entry_status/{empiar_id}"
        )
        return response.json()

    def get_emdb_linked_entries(self, emdb_id):
        """Find EMPIAR entries for EMDB map"""
        response = requests.get(
            f"{self.BASE_URL}/search/emdb_id/{emdb_id}"
        )
        return response.json()

1. Data Collection Phase

BioStride: Track samples, instruments, parameters
         ↓
   Collect raw data (movies, micrographs)
         ↓
   Store locally with BioStride metadata

2. Processing Phase

BioStride: Document processing workflows
         ↓
   Generate processed data
         ↓
   Track quality metrics

3. Deposition Preparation

BioStride: Organize raw data for EMPIAR
         ↓
   Generate EMPIAR JSON metadata
         ↓
   Prepare data hierarchy

4. Archive Submission

EMPIAR: Deposit raw data
EMDB: Deposit 3D maps
PDB: Deposit atomic models
         ↓
   Update BioStride with accession IDs

5. Public Access

BioStride: Maintain complete workflow record
EMPIAR: Serve raw data to community
         ↓
   Enable reprocessing and validation

Future Opportunities

Technical Developments

  1. Automated Deposition
  2. Direct EMPIAR submission from BioStride
  3. Metadata validation before submission
  4. Status tracking integration

  5. Enhanced Metadata

  6. Richer processing provenance
  7. Multi-modal experiment description
  8. Sample preparation details

  9. Data Mining

  10. Cross-archive queries
  11. Workflow pattern analysis
  12. Quality metric aggregation

Emerging Modalities

Support for new EMPIAR data types: - Time-resolved EM - In situ structural biology - Microcrystal electron diffraction - Correlative super-resolution data

Best Practices

For Data Producers

  1. Plan Early: Structure data for EMPIAR compatibility
  2. Track Metadata: Use BioStride for comprehensive documentation
  3. Organize Hierarchically: Maintain logical file organization
  4. Include Calibration: Don't forget gain references, defects

For Data Consumers

  1. Check References: Verify EMPIAR-EMDB-PDB links
  2. Review Metadata: Understand collection parameters
  3. Use APIs: Programmatic access for large studies
  4. Cite Properly: Acknowledge data sources

Validation and Quality

EMPIAR Requirements

  • Minimum metadata for reprocessing
  • Standard file formats (MRC, TIFF, DM4)
  • Organized directory structure
  • Clear imageset categorization

BioStride Enhancements

DataFile:
  validation:
    empiar_compliant: true
    format_valid: true
    metadata_complete: true
  quality_checks:
    - image_integrity
    - header_consistency
    - gain_reference_present

Conclusion

BioStride and EMPIAR form a powerful complementary ecosystem for electron microscopy data management:

  • EMPIAR provides the definitive archive for raw EM data with:
  • Petabyte-scale storage infrastructure
  • Global accessibility
  • Expanding modality support
  • Community standard compliance

  • BioStride adds comprehensive workflow context through:

  • Sample-to-structure tracking
  • Multi-technique integration
  • Processing provenance
  • Flexible metadata capture

The optimal strategy involves: 1. Using BioStride for active research and workflow management 2. Depositing raw data to EMPIAR for long-term preservation 3. Maintaining bidirectional links for complete provenance 4. Leveraging both systems' strengths throughout the research lifecycle

This integration ensures that both the experimental process (BioStride) and the raw data products (EMPIAR) are properly documented, preserved, and made accessible to the global scientific community, enabling reproducibility, reanalysis, and methodological advancement in structural biology.