SODA-VEC: Scientific Literature Embeddings with VICReg Loss

Model Description

SODA-VEC is a state-of-the-art sentence transformer model specifically designed for scientific literature, trained on a massive dataset of 26+ million title-abstract pairs from PubMed Central (PMC). The model uses a custom VICReg loss configuration optimized for learning robust representations of scientific text.

Key Features

Base Model: answerdotai/ModernBERT-base (768-dimensional embeddings)
Sequence Length: 512 tokens (optimized for memory efficiency)
Training Data: 26,209,161 title-abstract pairs from PMC
Loss Configuration: Custom VICReg loss with dot product, variance, and covariance regularization
Total Training: 550,000 steps (~3 epochs) with stable convergence

Model Performance

Training Results

The model achieved excellent performance with stable training throughout:

Final Training Loss: 0.8022
Final Evaluation Loss: 0.8184
Training Duration: 2.915 days
Total Steps: 500,000 (resumed from step 100,000)
Evaluation Dataset: 50% of test set (132,369 samples)

Training Curves

The model shows excellent learning dynamics:

Initial Phase (0-100k steps): Rapid learning from ~0.95 to ~0.85-0.90
Mid-Phase (100k-300k steps): Gradual improvement to ~0.82-0.85
Final Phase (300k-500k steps): Stable convergence to ~0.80-0.82

The evaluation loss shows a clear downward trend from 0.89 to 0.8184, indicating excellent generalization.

Training History

Phase 1: Initial Training (0-106k steps)

Configuration: dot_std_cov loss (dot_loss + std_loss + cov_loss)
Learning Rate: 2e-4
Batch Size: 32 per device, 4 GPUs
Result: Training became unstable with loss explosion >250 and gradient norm NaN

Phase 2: Resumed Training (100k-550k steps)

Resume Point: Checkpoint-100000 (stable evaluation loss)
Configuration: Same dot_std_cov loss
Learning Rate: 8e-5 (reduced for stability)
Warmup Steps: 1,000
Max Grad Norm: 2.0 (increased for stability)
Evaluation: 50% of test dataset for efficiency
Result: Stable training with excellent convergence

Technical Specifications

Model Architecture

Base: ModernBERT-base (22 layers, 768 hidden size, 12 attention heads)
Pooling: Mean pooling
Embedding Dimension: 768
Max Sequence Length: 512 tokens
Vocabulary Size: 50,368 tokens

Training Configuration

{
    "learning_rate": 8e-5,
    "warmup_steps": 1000,
    "max_grad_norm": 2.0,
    "weight_decay": 0.01,
    "batch_size": 32,
    "gradient_accumulation_steps": 1,
    "fp16": True,
    "lr_scheduler_type": "cosine",
    "max_steps": 500000,
    "eval_steps": 10000,
    "save_steps": 50000
}

Loss Function

The model uses a custom VICReg loss configuration:

loss_config = "dot_std_cov"
# Components:
# - dot_loss: Encourages high cosine similarity between paired embeddings
# - std_loss: Variance regularization to prevent dimensional collapse
# - cov_loss: Covariance regularization to decorrelate features

Usage

Basic Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode sentences
sentences = [
    "Machine learning approaches for drug discovery",
    "Deep learning methods in computational biology"
]
embeddings = model.encode(sentences)

# Compute similarity
similarity = model.similarity(sentences[0], sentences[1])

Advanced Usage

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')

# Encode with custom parameters
embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=True,
    convert_to_numpy=True
)

# Compute pairwise similarities
similarities = model.similarity(sentences, sentences)

Training Data

Dataset

Source: EMBO/soda-vec-data-full_pmc_title_abstract
Size: 26,209,161 training pairs, 264,739 test pairs
Content: Title-abstract pairs from PubMed Central
Language: English
Domain: Scientific literature across all biomedical fields

Data Processing

Train/Test Split: 99% training, 1% testing
Evaluation Fraction: 50% of test set used for evaluation
Text Processing: Standard sentence transformer preprocessing
Tokenization: ModernBERT tokenizer with 512 token limit

Evaluation

The model was evaluated on a held-out test set of 132,369 samples (50% of total test set):

Final Evaluation Loss: 0.8184
Training-Eval Alignment: Excellent (0.8022 vs 0.8184)
Generalization: Strong performance on unseen data
Stability: No overfitting observed

Model Files

The model includes all necessary components:

config.json: Model configuration
model.safetensors: Model weights (596MB)
tokenizer.json: Tokenizer configuration
tokenizer_config.json: Tokenizer settings
sentence_bert_config.json: Sentence transformer configuration
modules.json: Model architecture
special_tokens_map.json: Special tokens mapping

Citation

If you use this model in your research, please cite:

@misc{soda-vec-dot-std-cov-losses,
  title={SODA-VEC: Scientific Literature Embeddings with VICReg Loss},
  author={EMBO},
  year={2024},
  url={https://huggingface.co/EMBO/soda-vec-dot-std-cov-losses}
}

License

This model is released under the MIT License. See the LICENSE file for details.

Acknowledgments

Base Model: Built on answerdotai/ModernBERT-base
Training Framework: Hugging Face Transformers and Sentence Transformers
Data Source: PubMed Central (PMC) via Hugging Face Datasets
Infrastructure: EMBO computational resources

Contact

For questions or issues, please contact the EMBO team or open an issue on the model repository.

This model represents a significant advancement in scientific literature embeddings, combining the power of ModernBERT with carefully tuned VICReg loss for optimal representation learning.

Downloads last month: 11

Safetensors

Model size

0.1B params

Tensor type

F32