SODA-VEC: Scientific Literature Embeddings with VICReg Loss

Model Description

SODA-VEC is a state-of-the-art sentence transformer model specifically designed for scientific literature, trained on a massive dataset of 26+ million title-abstract pairs from PubMed Central (PMC). The model uses a custom VICReg loss configuration optimized for learning robust representations of scientific text.

Key Features

  • Base Model: answerdotai/ModernBERT-base (768-dimensional embeddings)
  • Sequence Length: 512 tokens (optimized for memory efficiency)
  • Training Data: 26,209,161 title-abstract pairs from PMC
  • Loss Configuration: Custom VICReg loss with dot product, variance, and covariance regularization
  • Total Training: 550,000 steps (~3 epochs) with stable convergence

Model Performance

Training Results

The model achieved excellent performance with stable training throughout:

  • Final Training Loss: 0.8022
  • Final Evaluation Loss: 0.8184
  • Training Duration: 2.915 days
  • Total Steps: 500,000 (resumed from step 100,000)
  • Evaluation Dataset: 50% of test set (132,369 samples)

Training Curves

The model shows excellent learning dynamics:

  1. Initial Phase (0-100k steps): Rapid learning from ~0.95 to ~0.85-0.90
  2. Mid-Phase (100k-300k steps): Gradual improvement to ~0.82-0.85
  3. Final Phase (300k-500k steps): Stable convergence to ~0.80-0.82

The evaluation loss shows a clear downward trend from 0.89 to 0.8184, indicating excellent generalization.

Training History

Phase 1: Initial Training (0-106k steps)

  • Configuration: dot_std_cov loss (dot_loss + std_loss + cov_loss)
  • Learning Rate: 2e-4
  • Batch Size: 32 per device, 4 GPUs
  • Result: Training became unstable with loss explosion >250 and gradient norm NaN

Phase 2: Resumed Training (100k-550k steps)

  • Resume Point: Checkpoint-100000 (stable evaluation loss)
  • Configuration: Same dot_std_cov loss
  • Learning Rate: 8e-5 (reduced for stability)
  • Warmup Steps: 1,000
  • Max Grad Norm: 2.0 (increased for stability)
  • Evaluation: 50% of test dataset for efficiency
  • Result: Stable training with excellent convergence

Technical Specifications

Model Architecture

  • Base: ModernBERT-base (22 layers, 768 hidden size, 12 attention heads)
  • Pooling: Mean pooling
  • Embedding Dimension: 768
  • Max Sequence Length: 512 tokens
  • Vocabulary Size: 50,368 tokens

Training Configuration

{
    "learning_rate": 8e-5,
    "warmup_steps": 1000,
    "max_grad_norm": 2.0,
    "weight_decay": 0.01,
    "batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": True,
    "lr_scheduler_type": "cosine",
    "max_steps": 500000,
    "eval_steps": 10000,
    "save_steps": 50000
}

Loss Function

The model uses a custom VICReg loss configuration:

loss_config = "dot_std_cov"
# Components:
# - dot_loss: Encourages high cosine similarity between paired embeddings
# - std_loss: Variance regularization to prevent dimensional collapse
# - cov_loss: Covariance regularization to decorrelate features

Usage

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode sentences
sentences = [
    "Machine learning approaches for drug discovery",
    "Deep learning methods in computational biology"
]
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(sentences[0], sentences[1])

Advanced Usage

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode with custom parameters
embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Compute pairwise similarities
similarities = model.similarity(sentences, sentences)

Training Data

Dataset

  • Source: EMBO/soda-vec-data-full_pmc_title_abstract
  • Size: 26,209,161 training pairs, 264,739 test pairs
  • Content: Title-abstract pairs from PubMed Central
  • Language: English
  • Domain: Scientific literature across all biomedical fields

Data Processing

  • Train/Test Split: 99% training, 1% testing
  • Evaluation Fraction: 50% of test set used for evaluation
  • Text Processing: Standard sentence transformer preprocessing
  • Tokenization: ModernBERT tokenizer with 512 token limit

Evaluation

The model was evaluated on a held-out test set of 132,369 samples (50% of total test set):

  • Final Evaluation Loss: 0.8184
  • Training-Eval Alignment: Excellent (0.8022 vs 0.8184)
  • Generalization: Strong performance on unseen data
  • Stability: No overfitting observed

Model Files

The model includes all necessary components:

  • config.json: Model configuration
  • model.safetensors: Model weights (596MB)
  • tokenizer.json: Tokenizer configuration
  • tokenizer_config.json: Tokenizer settings
  • sentence_bert_config.json: Sentence transformer configuration
  • modules.json: Model architecture
  • special_tokens_map.json: Special tokens mapping

Citation

If you use this model in your research, please cite:

@misc{soda-vec-dot-std-cov-losses,
  title={SODA-VEC: Scientific Literature Embeddings with VICReg Loss},
  author={EMBO},
  year={2024},
  url={https://huggingface.co/EMBO/soda-vec-dot-std-cov-losses}
}

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

  • Base Model: Built on answerdotai/ModernBERT-base
  • Training Framework: Hugging Face Transformers and Sentence Transformers
  • Data Source: PubMed Central (PMC) via Hugging Face Datasets
  • Infrastructure: EMBO computational resources

Contact

For questions or issues, please contact the EMBO team or open an issue on the model repository.


This model represents a significant advancement in scientific literature embeddings, combining the power of ModernBERT with carefully tuned VICReg loss for optimal representation learning.

Downloads last month
11
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support