SODA-VEC: Scientific Literature Embeddings with VICReg Loss
Model Description
SODA-VEC is a state-of-the-art sentence transformer model specifically designed for scientific literature, trained on a massive dataset of 26+ million title-abstract pairs from PubMed Central (PMC). The model uses a custom VICReg loss configuration optimized for learning robust representations of scientific text.
Key Features
- Base Model:
answerdotai/ModernBERT-base(768-dimensional embeddings) - Sequence Length: 512 tokens (optimized for memory efficiency)
- Training Data: 26,209,161 title-abstract pairs from PMC
- Loss Configuration: Custom VICReg loss with dot product, variance, and covariance regularization
- Total Training: 550,000 steps (~3 epochs) with stable convergence
Model Performance
Training Results
The model achieved excellent performance with stable training throughout:
- Final Training Loss: 0.8022
- Final Evaluation Loss: 0.8184
- Training Duration: 2.915 days
- Total Steps: 500,000 (resumed from step 100,000)
- Evaluation Dataset: 50% of test set (132,369 samples)
Training Curves
The model shows excellent learning dynamics:
- Initial Phase (0-100k steps): Rapid learning from ~0.95 to ~0.85-0.90
- Mid-Phase (100k-300k steps): Gradual improvement to ~0.82-0.85
- Final Phase (300k-500k steps): Stable convergence to ~0.80-0.82
The evaluation loss shows a clear downward trend from 0.89 to 0.8184, indicating excellent generalization.
Training History
Phase 1: Initial Training (0-106k steps)
- Configuration:
dot_std_covloss (dot_loss + std_loss + cov_loss) - Learning Rate: 2e-4
- Batch Size: 32 per device, 4 GPUs
- Result: Training became unstable with loss explosion >250 and gradient norm NaN
Phase 2: Resumed Training (100k-550k steps)
- Resume Point: Checkpoint-100000 (stable evaluation loss)
- Configuration: Same
dot_std_covloss - Learning Rate: 8e-5 (reduced for stability)
- Warmup Steps: 1,000
- Max Grad Norm: 2.0 (increased for stability)
- Evaluation: 50% of test dataset for efficiency
- Result: Stable training with excellent convergence
Technical Specifications
Model Architecture
- Base: ModernBERT-base (22 layers, 768 hidden size, 12 attention heads)
- Pooling: Mean pooling
- Embedding Dimension: 768
- Max Sequence Length: 512 tokens
- Vocabulary Size: 50,368 tokens
Training Configuration
{
"learning_rate": 8e-5,
"warmup_steps": 1000,
"max_grad_norm": 2.0,
"weight_decay": 0.01,
"batch_size": 32,
"gradient_accumulation_steps": 1,
"fp16": True,
"lr_scheduler_type": "cosine",
"max_steps": 500000,
"eval_steps": 10000,
"save_steps": 50000
}
Loss Function
The model uses a custom VICReg loss configuration:
loss_config = "dot_std_cov"
# Components:
# - dot_loss: Encourages high cosine similarity between paired embeddings
# - std_loss: Variance regularization to prevent dimensional collapse
# - cov_loss: Covariance regularization to decorrelate features
Usage
Basic Usage
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')
# Encode sentences
sentences = [
"Machine learning approaches for drug discovery",
"Deep learning methods in computational biology"
]
embeddings = model.encode(sentences)
# Compute similarity
similarity = model.similarity(sentences[0], sentences[1])
Advanced Usage
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('EMBO/soda-vec-dot-std-cov-losses')
# Encode with custom parameters
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=True,
convert_to_numpy=True
)
# Compute pairwise similarities
similarities = model.similarity(sentences, sentences)
Training Data
Dataset
- Source:
EMBO/soda-vec-data-full_pmc_title_abstract - Size: 26,209,161 training pairs, 264,739 test pairs
- Content: Title-abstract pairs from PubMed Central
- Language: English
- Domain: Scientific literature across all biomedical fields
Data Processing
- Train/Test Split: 99% training, 1% testing
- Evaluation Fraction: 50% of test set used for evaluation
- Text Processing: Standard sentence transformer preprocessing
- Tokenization: ModernBERT tokenizer with 512 token limit
Evaluation
The model was evaluated on a held-out test set of 132,369 samples (50% of total test set):
- Final Evaluation Loss: 0.8184
- Training-Eval Alignment: Excellent (0.8022 vs 0.8184)
- Generalization: Strong performance on unseen data
- Stability: No overfitting observed
Model Files
The model includes all necessary components:
config.json: Model configurationmodel.safetensors: Model weights (596MB)tokenizer.json: Tokenizer configurationtokenizer_config.json: Tokenizer settingssentence_bert_config.json: Sentence transformer configurationmodules.json: Model architecturespecial_tokens_map.json: Special tokens mapping
Citation
If you use this model in your research, please cite:
@misc{soda-vec-dot-std-cov-losses,
title={SODA-VEC: Scientific Literature Embeddings with VICReg Loss},
author={EMBO},
year={2024},
url={https://huggingface.co/EMBO/soda-vec-dot-std-cov-losses}
}
License
This model is released under the MIT License. See the LICENSE file for details.
Acknowledgments
- Base Model: Built on
answerdotai/ModernBERT-base - Training Framework: Hugging Face Transformers and Sentence Transformers
- Data Source: PubMed Central (PMC) via Hugging Face Datasets
- Infrastructure: EMBO computational resources
Contact
For questions or issues, please contact the EMBO team or open an issue on the model repository.
This model represents a significant advancement in scientific literature embeddings, combining the power of ModernBERT with carefully tuned VICReg loss for optimal representation learning.
- Downloads last month
- 11