EstLLM Prototype 0825
llama-estllm-protype-0825 is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational
ChatbotArena-style setting on baromeeter.ai, and thus establish a baseline for future improvements.
The model underwent continuous pre-training starting from Llama-3.1-8B on approximately 35B tokens, then supervised fine-tuning and direct preference optimization were applied.
Use with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "tartuNLP/llama-estllm-protype-0825"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="auto",
device_map="auto"
)
# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
# model_name,
# dtype=torch.float16,
# device_map="mps",
# )
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [
{"role": "user", "content": "Kas sa rรครคgid eesti keelt?"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=128,
do_sample=True,
temperature=0.4,
# specify eos token to stop at the end of the assistant response
eos_token_id=tokenizer.eos_token_id,
)
# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
generated_ids[0][model_inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)
print(response)
Model Details
Model Description
- Developed by: TartuNLP and TalTechNLP research groups
- Funded by: Estonian Ministry of Education and Research, โEstonian Language Technology Program 2018-2027โ
- Model type: Causal Language Model, Instruction-following
- Language(s) (NLP): Estonian, English
- License: Llama 3.1 Community License Agreement
- Finetuned from model meta-llama/Llama-3.1-8B
Evaluation
Logits-based
Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.
Generative
Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.
Instruction-following
Instruction level strict accuracy is reported for IFEval-et.
| Model (# parameters โ) | IFEval-et |
|---|---|
| moonshotai/Kimi-K2-Instruct | 0.7891 |
| deepseek-ai/DeepSeek-V3-0324 | 0.7171 |
| meta-llama/Llama-3.1-405B-Instruct | 0.7159 |
| meta-llama/Llama-3.3-70B-Instruct | 0.7705 |
| Qwen/Qwen2.5-72B-Instruct | 0.7407 |
| google/gemma-3-27b-it | 0.7655 |
| utter-project/EuroLLM-9B-Instruct | 0.5397 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5484 |
| meta-llama/Llama-3.1-8B-Instruct | 0.3797 |
| tartuNLP/llama-estllm-prototype-0825 | 0.5174 |
| BSC-LT/salamandra-7b-instruct | 0.5195 |
| tartuNLP/Llammas | 0.3524 |
| Qwen/Qwen2.5-7B-Instruct | 0.4988 |
Multiple Choice
All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.
| Model (# parameters โ) | Winogrande-et | Trivia-et | Grammar-et | Inflection-et | Word-Meanings-et |
|---|---|---|---|---|---|
| moonshotai/Kimi-K2-Instruct | 0.8138 | 0.4225 | 0.916 | 0.6458 | 0.9689 |
| deepseek-ai/DeepSeek-V3-0324 | 0.8042 | 0.27 | 0.364 | 0 | 0 |
| meta-llama/Llama-3.1-405B-Instruct | 0.7878 | 0.4713 | 0.818 | 0.9089 | 0.9438 |
| meta-llama/Llama-3.3-70B-Instruct | 0.7397 | 0.3875 | 0.797 | 0.6421 | 0.9408 |
| Qwen/Qwen2.5-72B-Instruct | 0.7227 | 0.315 | 0.694 | 0.5208 | 0.9057 |
| google/gemma-3-27b-it | 0.7510 | 0.325 | 0.817 | 0.5934 | 0.9529 |
| utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.764 | 0.367 | 0.9258 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.512 | 0.3662 | 0.9027 |
| meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.657 | 0.4165 | 0.8335 |
| tartuNLP/llama-estllm-prototype-0825 | 0.5812 | 0.425 | 0.692 | 0.5188 | 0.9569 |
| BSC-LT/salamandra-7b-instruct | 0.2878 | 0.2875 | 0.594 | 0.2668 | 0.8084 |
| Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.598 | 0.4136 | 0.7984 |
| tartuNLP/Llammas | 0.5037 | 0.2838 | 0.529 | 0.2289 | 0.5326 |
Translation
English to Estonian
| Model | wmt24pp (BLEU โ) |
|---|---|
| BSC-LT/salamandraTA-7b-instruct | 0.2713 |
| tartuNLP/llama-estllm-prototype-0825 | 0.264 |
| utter-project/EuroLLM-9B-Instruct | 0.2602 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 |
| tartuNLP/Llammas | 0.1472 |
| meta-llama/Llama-3.1-8B-Instruct | 0.1406 |
| BSC-LT/salamandra-7b-instruct | 0.1201 |
| Qwen/Qwen2.5-7B-Instruct | 0.0476 |
Limitations
This is an early prototype version. Accordignly, it has limitations in addition to the base Llama limitations:
- Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
- Multi-turn conversations are not supported in this version.
- Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.
Citation
TBA
- Downloads last month
- 86
Model tree for tartuNLP/llama-estllm-protype-0825
Base model
meta-llama/Llama-3.1-8B