|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: translation |
|
|
language: |
|
|
- bg |
|
|
- ca |
|
|
- cs |
|
|
- cy |
|
|
- da |
|
|
- de |
|
|
- el |
|
|
- en |
|
|
- es |
|
|
- et |
|
|
- eu |
|
|
- fi |
|
|
- fr |
|
|
- ga |
|
|
- gl |
|
|
- hr |
|
|
- hu |
|
|
- it |
|
|
- lt |
|
|
- lv |
|
|
- mt |
|
|
- nl |
|
|
- nb |
|
|
- 'no' |
|
|
- nn |
|
|
- oc |
|
|
- pl |
|
|
- pt |
|
|
- ro |
|
|
- ru |
|
|
- sl |
|
|
- sk |
|
|
- sr |
|
|
- sv |
|
|
- uk |
|
|
- ast |
|
|
- an |
|
|
- ar |
|
|
- ja |
|
|
- ko |
|
|
- is |
|
|
- zh |
|
|
- bho |
|
|
- hi |
|
|
base_model: |
|
|
- BSC-LT/salamandra-7b |
|
|
--- |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
# SalamandraTA Model Card |
|
|
|
|
|
SalamandraTA-7b-instruct-WMT25 is an expanded version of [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) specifically created for the WMT25 General Translation Shared Task submission and released solely for **non-commercial research purposes**. |
|
|
This version of SalamadraTA has undergone an additional continual pre-training round aimed at expanding the language coverage of the model, and include the additional languages featured in the WMT 2025 shared task that were not supported by the previous version of SalamandraTA-7B-Instruct. It has also undergone an extra instruction fine-tuning step. |
|
|
|
|
|
Note that we additionally included the English-to-Hindi direction, in order to support better transfer for related languages such as Bhojpuri. |
|
|
|
|
|
|
|
|
To extend the language coverage, we performed vocabulary adaptation by training from scratch a new tokenizer on a corpus that includes the original languages, along with additional monolingual text for the seven new languages not present in the original SALAMANDRA tokenizer. |
|
|
|
|
|
For further details, we refer to our paper which is due to be published in August 2025. |
|
|
|
|
|
> [!WARNING] |
|
|
> **DISCLAIMER:** This version of Salamandra is tailored exclusively for translation tasks. It lacks chat capabilities and has not been trained with any chat instructions. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Architecture |
|
|
|
|
|
| | | |
|
|
|-------------------------|:--------------| |
|
|
| Total Parameters | 7,768,117,248 | |
|
|
| Embedding Parameters | 1,048,576,000 | |
|
|
| Layers | 32 | |
|
|
| Hidden size | 4,096 | |
|
|
| Attention heads | 32 | |
|
|
| Context length | 8,192 | |
|
|
| Vocabulary size | 256,000 | |
|
|
| Precision | bfloat16 | |
|
|
| Embedding type | RoPE | |
|
|
| Activation Function | SwiGLU | |
|
|
| Layer normalization | RMS Norm | |
|
|
| Flash attention | ✅ | |
|
|
| Grouped Query Attention | ✅ | |
|
|
| Num. query groups | 8 | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
The Model can be used in any of the languages included in the training data for general machine translation tasks. |
|
|
The model is released under a **Research-Only License** that regulates its use. By accessing or using the Model, you acknowledge that you have read, understood, and agree to be bound by the terms below. |
|
|
|
|
|
### 1. Purpose |
|
|
|
|
|
The Model is provided solely for **non-commercial research purposes**. This includes academic research, experimentation, benchmarking, and publication of research results. |
|
|
|
|
|
### 2. Restrictions |
|
|
|
|
|
You may **not**: |
|
|
|
|
|
- Use the Model or any derivative works for commercial purposes, including offering it as a service or integrating it into commercial products. |
|
|
- Use the Model to intentionally generate or disseminate harmful, deceptive, or unlawful content. |
|
|
- Use the Model in a manner that violates applicable laws, regulations, or the rights of others. |
|
|
- Attempt to relicense, sell, or redistribute the Model outside the scope of this License. |
|
|
|
|
|
### 3. Derivative Works |
|
|
|
|
|
You may create modifications or derivative works of the Model **for research purposes only**. Any such works must be released under terms that are at least as restrictive as this License. |
|
|
|
|
|
### 4. Attribution |
|
|
|
|
|
Any publications, reports, or other outputs obtained by using the Model must acknowledge "SalamandraTA-7b-instruct-WMT25, released under a Research-Only License by MT Group, Language Technologies Lab, Barcelona Supercomputing Center, 2025." |
|
|
|
|
|
### 5. No Warranty |
|
|
|
|
|
The Model is provided "as is", without warranties of any kind, express or implied, including but not limited to accuracy, safety, or fitness for a particular purpose. |
|
|
|
|
|
### 6. Limitation of Liability |
|
|
|
|
|
The Licensor is not responsible for any damages arising from the use or inability to use the Model. |
|
|
|
|
|
### 7. Termination |
|
|
|
|
|
This License will terminate automatically if you fail to comply with any of its terms. Upon termination, you must cease all use of the Model and destroy any copies in your possession. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## Hardware and Software |
|
|
|
|
|
### Training Framework |
|
|
|
|
|
SalamandraTA-7b-base was continually pre-trained using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html), |
|
|
which leverages PyTorch Lightning for efficient model training in highly distributed settings. |
|
|
|
|
|
SalamandraTA-7b-instruct was produced with [FastChat](https://github.com/lm-sys/FastChat). |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and |
|
|
operated by Barcelona Supercomputing Center. |
|
|
|
|
|
The accelerated partition is composed of 1,120 nodes with the following specifications: |
|
|
- 4x Nvidia Hopper GPUs with 64GB HBM2 memory |
|
|
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores) |
|
|
- 4x NDR200 (BW per node 800Gb/s) |
|
|
- 512 GB of Main memory (DDR5) |
|
|
- 460GB on NVMe storage |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## How to use |
|
|
|
|
|
You can use this model to translate in the following directions: |
|
|
|
|
|
- Czech to Ukrainian |
|
|
- Czech to German |
|
|
- Japanese to Chinese (Simplified) |
|
|
- English to Arabic |
|
|
- English to Bhojpuri |
|
|
- English to Chinese (Simplified) |
|
|
- English to Czech |
|
|
- English to Estonian |
|
|
- English to Icelandic |
|
|
- English to Italian |
|
|
- English to Japanese |
|
|
- English to Korean |
|
|
- English to Russian |
|
|
- English to Serbian (Latin script) |
|
|
- English to Ukrainian |
|
|
|
|
|
The instruction-following model uses the commonly adopted ChatML template: |
|
|
|
|
|
``` |
|
|
<|im_start|>system |
|
|
{SYSTEM PROMPT}<|im_end|> |
|
|
<|im_start|>user |
|
|
{USER PROMPT}<|im_end|> |
|
|
<|im_start|>assistant |
|
|
{MODEL RESPONSE}<|im_end|> |
|
|
<|im_start|>user |
|
|
[...] |
|
|
``` |
|
|
|
|
|
The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet. |
|
|
|
|
|
```python |
|
|
from datetime import datetime |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "LangTech-MT/salamandraTA-7b-instruct-WMT25" |
|
|
|
|
|
# Map from human-readable names to model language labels |
|
|
lang_map = { |
|
|
'English': 'English' |
|
|
'Czech': 'Czech', |
|
|
'Ukrainian': 'Ukrainian', |
|
|
'German': 'German', |
|
|
'Japanese': 'Japanese', |
|
|
'Chinese': 'Chinese', |
|
|
'English': 'English', |
|
|
'Arabic': 'Arabic', |
|
|
'Bhojpuri': 'Bhojpuri', |
|
|
'Estonian': 'Estonian', |
|
|
'Icelandic': 'Icelandic', |
|
|
'Italian': 'Italian', |
|
|
'Korean': 'Korean', |
|
|
'Russian': 'Russian', |
|
|
'Serbian': 'Serbian_Latin' |
|
|
} |
|
|
|
|
|
# Input parameters |
|
|
source = 'English' |
|
|
target = 'Japanese' |
|
|
sentence = "Once when I was six years old I saw a magnificent picture in a book." |
|
|
|
|
|
# Map to model's expected language names |
|
|
source_lang = lang_map[source] |
|
|
target_lang = lang_map[target] |
|
|
|
|
|
text = f"Translate the following text from {source_lang} into {target_lang}.\n{source_lang}: {sentence} \n{target_lang}:" |
|
|
|
|
|
# Load tokenizer and model |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16 |
|
|
) |
|
|
|
|
|
# Construct prompt using chat template |
|
|
message = [ { "role": "user", "content": text } ] |
|
|
date_string = datetime.today().strftime('%Y-%m-%d') |
|
|
|
|
|
prompt = tokenizer.apply_chat_template( |
|
|
message, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True, |
|
|
date_string=date_string |
|
|
) |
|
|
|
|
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
|
input_length = inputs.shape[1] |
|
|
|
|
|
# Generate output |
|
|
outputs = model.generate( |
|
|
input_ids=inputs.to(model.device), |
|
|
max_new_tokens=400, |
|
|
early_stopping=True, |
|
|
num_beams=5 |
|
|
) |
|
|
|
|
|
# Decode and print output |
|
|
print(tokenizer.decode(outputs[0, input_length:], skip_special_tokens=True)) |
|
|
# 6歳の時、ある本で素晴らしい絵を見たことがある。 |
|
|
``` |
|
|
|
|
|
Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity |
|
|
(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token. |
|
|
|
|
|
#### Machine Translation Prompt |
|
|
|
|
|
You can use the following prompt template: |
|
|
|
|
|
``` |
|
|
Translate the following text from {source} into {target}. |
|
|
{source}: {source sentence} |
|
|
{target}: |
|
|
``` |
|
|
<details> |
|
|
<summary>Show an example</summary> |
|
|
|
|
|
```python |
|
|
source = 'English' |
|
|
target = 'Japanese' |
|
|
source_sentence = "Once when I was six years old I saw a magnificent picture in a book." |
|
|
|
|
|
text = f"Translate the following text from {source} into {target}.\n{source}: {source_sentence} \n{target}:" |
|
|
# 6歳の時、ある本で素晴らしい絵を見たことがある。 |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
## Data |
|
|
|
|
|
### Pretraining data |
|
|
|
|
|
We first conducted continual pretraining on SalamandraTA-7B, focusing on expanding coverage to include additional language pairs featured in the WMT 2025 shared task. |
|
|
This stage included 393,507,678 sentence pairs across 14 languages and 15 translation directions, totaling approximately 27 billion tokens. |
|
|
|
|
|
For the language pairs featured in the WMT 2025 shared task, we primarily used data sourced from the [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/). |
|
|
To mitigate the risk of catastrophic forgetting, we subsampled 20M sentences for directions already covered during the continual pretraining of [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) (EN→CS, EN→ET, EN→RU, EN→UK). |
|
|
For English-to-Serbian (Latin script), we combined two sources from the continual pretraining of [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct): |
|
|
- English–Serbian (Latin script) data |
|
|
- English–Serbian (Cyrillic script) data, which we converted to Latin script using rule-based transliteration |
|
|
|
|
|
We also included the English-to-Hindi direction, which is not part of this year’s shared task, to support transfer learning for related Indic languages such as Bhojpuri. |
|
|
English-to-Hindi data was sourced from multiple corpora hosted by [OPUS](https://opus.nlpl.eu/). |
|
|
|
|
|
Click the expand button below to see the full list of language pairs and the data sources included in the pretraining. |
|
|
|
|
|
<details id="pre-data-sources"> |
|
|
<summary>Data Sources</summary> |
|
|
|
|
|
| Language Pair | Count | Source | |
|
|
|---------------|--------------------:|------------------| |
|
|
| en → ar | 74,471,769 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/), [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → zh | 66,544,557 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| cs → de | 50,552,593 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| en → ko | 27,378,827 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| en → hi | 27,273,478 | [CCMatrix](https://opus.nlpl.eu/CCMatrix), [MultiHPLT](https://opus.nlpl.eu/MultiHPLT), [NLLB](https://opus.nlpl.eu/NLLB), [Samanantar](https://opus.nlpl.eu/Samanantar) | |
|
|
| en → cs | 20,000,000 | Sampled from [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) pretraining data | |
|
|
| en → et | 20,000,000 | Sampled from [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) pretraining data | |
|
|
| en → ru | 20,000,000 | Sampled from [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) pretraining data | |
|
|
| en → uk | 20,000,000 | Sampled from [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) pretraining data | |
|
|
| en → ja | 19,109,592 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| en → sh | 15,107,525 | [SalamandraTA-7B-Instruct](https://huggingface.co/BSC-LT/salamandraTA-7b-instruct) patraining data | |
|
|
| ja → zh | 13,123,870 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| en → is | 10,281,190 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| cs → uk | 8,963,484 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| en → bho | 700,793 | [WMT 2025 Translation Task Training Data](https://www2.statmt.org/wmt25/mtdata/) | |
|
|
| **Total** | **393,507,678** | | |
|
|
|
|
|
</details> |
|
|
|
|
|
<details id="pre-references"> |
|
|
<summary>References</summary> |
|
|
|
|
|
- Aulamo, M., Sulubacak, U., Virpioja, S., & Tiedemann, J. (2020). OpusTools and Parallel Corpus Diagnostics. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference (pp. 3782–3789). European Language Resources Association. https://aclanthology.org/2020.lrec-1.467 |
|
|
- de Gibert, O., Nail, G., Arefyev, N., Bañón, M., van der Linde, J., Ji, S., Zaragoza-Bernabeu, J., Aulamo, M., Ramírez-Sánchez, G., Kutuzov, A., Pyysalo, S., Oepen, S., & Tiedemann, J. (2024). A new massive multilingual dataset for high-performance language technologies. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 1116–1128). Torino, Italia: ELRA and ICCL. |
|
|
- NLLB Team, Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Mejia Gonzalez, G., Hansanti, P., Hoffman, J., Jarrett, S., Sadagopan, K. R., Rowe, D., Spruit, S., Tran, C., Andrews, P., Ayan, N. F., Bhosale, S., Edunov, S., Fan, A., Gao, C., Goswami, V., Guzmán, F., Koehn, P., Mourachko, A., Ropers, C., Saleem, S., Schwenk, H., & Wang, J. (2022). No language left behind: Scaling human-centered machine translation (No. arXiv: 2207.04672). arXiv. https://arxiv.org/abs/2207.04672 |
|
|
- Ramesh, G., Doddapaneni, S., Bheemaraj, A., Jobanputra, M., AK, R., Sharma, A., Sahoo, S., Diddee, H., Mahalakshmi, J., Kakwani, D., Kumar, N., Pradeep, A., Nagaraj, S., Deepak, K., Raghavan, V., Kunchukuttan, A., Kumar, P., & Khapra, M. S. (2022). Samanantar: The largest publicly available parallel corpora collection for 11 Indic languages. Transactions of the Association for Computational Linguistics, 10, 145–162. https://doi.org/10.1162/tacl_a_00461 |
|
|
- Schwenk, H., Wenzek, G., Edunov, S., Grave, E., & Joulin, A. (2020). CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB (No. arXiv:1911.04944). arXiv. https://doi.org/10.48550/arXiv.1911.04944 |
|
|
|
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
### Instruction Tuning Data |
|
|
|
|
|
|
|
|
This model has been fine-tuned on ~51k instructions, primarily targeting machine translation performance for the languages pairs featured in the WMT 2025 shared task. |
|
|
The corpus used for the instruction tuning round was built to focus on paragraph-level translation, context-aware machine translation, and sentence-level translation. |
|
|
To construct paragraph-level data, we source from [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus), [NTREX](https://github.com/MicrosoftTranslator/NTREX), and [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary). |
|
|
The paragraph-level data were constructed by concatenating adjacent sentences (randomly grouping 2, 3, or 4) from the same article or document in Flores+200 dev, NTREX, and News Commentary. |
|
|
Serbian Cyrillic data from FLORES-dev was transliterated into Serbian Latin. |
|
|
In addition, we included data from [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) that we considered relevant to our tasks. |
|
|
|
|
|
|
|
|
Click the expand button below to see the full list of tasks included in the finetuning data. |
|
|
|
|
|
<details id="instr-data-sources"> |
|
|
<summary>Data Sources</summary> |
|
|
|
|
|
|
|
|
| Language pair | Count | Task | Source | |
|
|
|-------------------|--------:|-----------------------------|------------------| |
|
|
| en → ru | 22,112 | General Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → zh | 10,521 | General Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| mixed | 10,000 | Multi-reference Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → ko | 2,782 | General Machine Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → de | 558 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → ru | 454 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → es | 431 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → nl | 417 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → fr | 369 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → it | 348 | Context-Aware Translation | [TowerBlocks](https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.2) | |
|
|
| en → zh | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| cs → de | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → cs | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → de | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → ja | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| ja → zh | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → ru | 250 | Paragraph-level | [News-Commentary](https://opus.nlpl.eu/News-Commentary/corpus/version/News-Commentary) | |
|
|
| en → ja | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → uk | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| cs → uk | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| ja → zh | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → zh | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → ko | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → et | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → is | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → sh | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → cs | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| cs → de | 58 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → ru | 50 | Paragraph-level | [NTREX](https://github.com/MicrosoftTranslator/NTREX) | |
|
|
| en → ar | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → bho | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → ja | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| cs → uk | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| ja → zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → zh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → ko | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → et | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → is | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → sh | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → cs | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| cs → de | 30 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| en → ru | 21 | Paragraph-level | [Flores+200 dev](https://huggingface.co/datasets/openlanguagedata/flores_plus) | |
|
|
| **Total** | **50,841** | | | |
|
|
|
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
</details> |
|
|
|
|
|
<details id="instr-references"> |
|
|
<summary>References</summary> |
|
|
|
|
|
- Alves, D. M., Pombal, J., Guerreiro, N. M., Martins, P. H., Alves, J., Farajian, A., Peters, B., Rei, R., Fernandes, P., Agrawal, S., Colombo, P., de Souza, J. G. C., & Martins, A. F. T. (2024). Tower: An open multilingual large language model for translation-related tasks. arXiv. https://arxiv.org/abs/2402.17733 |
|
|
- Federmann, C., Kocmi, T., & Xin, Y. (2022). NTREX-128 – News test references for MT evaluation of 128 languages. Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, 21–24. Association for Computational Linguistics. https://aclanthology.org/2022.sumeval-1.4 |
|
|
|
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
Below are the evaluation results compared against the state-of-the-art [TOWER-V2 7B](https://huggingface.co/collections/Unbabel/tower-659eaedfe36e6dd29eb1805c) ([Rei, R., et al.](https://www2.statmt.org/wmt24/pdf/2024.wmt-1.12.pdf)), [MADLAD400-7B-mt model](https://huggingface.co/google/madlad400-7b-mt) ([Kudugunta, S., et al.](https://arxiv.org/abs/2309.04662)), [NLLB 3.3B](https://huggingface.co/facebook/nllb-200-3.3B) ([NLLB Team, et al.](https://arxiv.org/abs/2207.04672)) and the SalamandraTA-2b and 7b base models. |
|
|
The evaluation was conducted using [MT-Lens](https://github.com/langtech-bsc/mt-evaluation) on the [WMT++ dataset](https://huggingface.co/datasets/google/wmt24pp) ([Deutsch, D., et al.](https://arxiv.org/html/2502.12404v1)), following the standard setting (beam search with beam size 5, limiting the translation length to 500 tokens). |
|
|
An exception is the English to Bhojpuri direction, which is not included in the WMT++; for this case, we used the [Flores+200 devtest set](https://huggingface.co/datasets/openlanguagedata/flores_plus) for evaluation. We report the following metrics: |
|
|
|
|
|
<details> |
|
|
<summary>Click to show metrics details</summary> |
|
|
|
|
|
- `BLEU`: Sacrebleu implementation. Signature: nrefs:1— case:mixed— eff:no— tok:13a— smooth:exp—version:2.3.1 |
|
|
- `ChrF`: Sacrebleu implementation. |
|
|
- `Comet`: Model checkpoint: "Unbabel/wmt22-comet-da". |
|
|
- `Comet-kiwi`: Model checkpoint: "Unbabel/wmt22-cometkiwi-da". |
|
|
- `Bleurt`: Model checkpoint: "lucadiliello/BLEURT-20". |
|
|
- `MetricX`: Model checkpoint: "google/metricx-23-xl-v2p0". |
|
|
- `MetricX-QE`: Model checkpoint: "google/metricx-23-qe-xl-v2p0". |
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
<details> |
|
|
<summary>BLEU</summary> |
|
|
|
|
|
### BLEU |
|
|
|
|
|
This section presents the BLEU scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_bleu.png" target="_blank"> |
|
|
<img src="./images/eval_bleu.png" alt="BLEU"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>ChrF</summary> |
|
|
|
|
|
### ChrF |
|
|
|
|
|
This section presents the ChrF scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_chrf.png" target="_blank"> |
|
|
<img src="./images/eval_chrf.png" alt="ChrF"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Comet</summary> |
|
|
|
|
|
#### Comet |
|
|
|
|
|
This section presents the Comet scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_comet.png" target="_blank"> |
|
|
<img src="./images/eval_comet.png" alt="Comet"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>Comet-Kiwi</summary> |
|
|
|
|
|
#### Comet-Kiwi |
|
|
|
|
|
This section presents the Comet-Kiwi scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_comet-kiwi.png" target="_blank"> |
|
|
<img src="./images/eval_comet-kiwi.png" alt="Comet-Kiwi"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>BLEURT</summary> |
|
|
|
|
|
#### BLEURT |
|
|
|
|
|
This section presents the BLEURT scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_bleurt.png" target="_blank"> |
|
|
<img src="./images/eval_bleurt.png" alt="BLEURT"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>MetricX</summary> |
|
|
|
|
|
### MetricX |
|
|
|
|
|
This section presents the MetricX scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_metricx.png" target="_blank"> |
|
|
<img src="./images/eval_metricx.png" alt="MetricX"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
<details> |
|
|
<summary>MetricX-QE</summary> |
|
|
|
|
|
#### MetricX-QE |
|
|
|
|
|
This section presents the MetricX-QE scores on the WMT++ test set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_metricx-qe.png" target="_blank"> |
|
|
<img src="./images/eval_metricx-qe.png" alt="MetricX-QE"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
|
|
|
<details> |
|
|
<summary>The case of Bhojpuri</summary> |
|
|
|
|
|
#### The case of Bhojpuri |
|
|
|
|
|
This section presents the BLEU and ChrF scores on the Flores+200 devtest set. |
|
|
|
|
|
<a href="https://huggingface.co/LangTech-MT/salamandraTA-7b-instruct-WMT25/blob/main/images/eval_bho.png" target="_blank"> |
|
|
<img src="./images/eval_bho.png" alt="Bhojpuri"/> |
|
|
</a> |
|
|
|
|
|
</details> |
|
|
|
|
|
|
|
|
## Ethical Considerations and Limitations |
|
|
|
|
|
Detailed information on the work done to examine the presence of unwanted social and cognitive biases in the base model can be found |
|
|
at [Salamandra-7B model card](https://huggingface.co/BSC-LT/salamandra-7b). |
|
|
No specific analysis has yet been carried out in order to evaluate potential biases or limitations in translation |
|
|
accuracy across different languages, dialects, or domains. However, we recognize the importance of identifying and addressing any harmful stereotypes, |
|
|
cultural inaccuracies, or systematic performance discrepancies that may arise in Machine Translation. As such, we plan to continue performing more analyses |
|
|
as we implement the necessary metrics and methods within our evaluation framework [MT-Lens](https://github.com/langtech-bsc/mt-evaluation). |
|
|
Note that the model has only undergone preliminary instruction tuning. |
|
|
We urge developers to consider potential limitations and conduct safety testing and tuning tailored to their specific applications. |
|
|
|
|
|
## Additional information |
|
|
|
|
|
### Author |
|
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
|
|
### Contact |
|
|
For further information, please send an email to <[email protected]>. |
|
|
|
|
|
### Copyright |
|
|
Copyright(c) 2025 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
|
|
### Funding |
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Modelos del Lenguaje. |
|
|
|
|
|
This work has been promoted and financed by the Government of Catalonia through the [Aina project](https://projecteaina.cat/). |
|
|
|
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA](https://proyectoilenia.es/) with reference 2022/TL22/00215337. |
|
|
|
|
|
|
|
|
### Disclaimer |
|
|
Be aware that the model may contain biases or other unintended distortions. |
|
|
When third parties deploy systems or provide services based on this model, or use the model themselves, |
|
|
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, |
|
|
including those governing the use of Artificial Intelligence. |
|
|
|
|
|
The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use. |
|
|
|
|
|
### License |
|
|
Research-Only |
|
|
|
|
|
|
|
|
|