File size: 1,952 Bytes
cd05548 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
---
license: mit
language:
- en
---
# Ettin Checkpoints
[](https://opensource.org/licenses/MIT)
[](https://arxiv.org/abs/2509.06888)
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
[](https://github.com/jhu-clsp/mmBERT)
This repository contains the raw training checkpoints for the mmBERT models. Each model contains three subfolders for `decay`, `ext`, and `pretrain`.
These files work with Composer and contain all state needed to resume pre-training. Please see the [ModernBERT repository](https://github.com/AnswerDotAI/ModernBERT) for usage details.
## 🔗 Related Resources
- **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4)
- **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) (2.3T tokens)
- **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens)
- **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens)
- **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888)
- **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT)
## Citation
```bibtex
@misc{marone2025mmbertmodernmultilingualencoder,
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning},
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme},
year={2025},
eprint={2509.06888},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.06888},
}
``` |