EstLLM Prototype 0825

llama-estllm-protype-0825 is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational ChatbotArena-style setting on baromeeter.ai, and thus establish a baseline for future improvements.

The model underwent continuous pre-training starting from Llama-3.1-8B on approximately 35B tokens, then supervised fine-tuning and direct preference optimization were applied.

Use with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tartuNLP/llama-estllm-protype-0825"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     dtype=torch.float16,
#     device_map="mps",
# )

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Kas sa räägid eesti keelt?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.4,
    # specify eos token to stop at the end of the assistant response
    eos_token_id=tokenizer.eos_token_id,
)

# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
    generated_ids[0][model_inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print(response)

Model Details

Model Description

Developed by: TartuNLP and TalTechNLP research groups
Funded by: Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
Model type: Causal Language Model, Instruction-following
Language(s) (NLP): Estonian, English
License: Llama 3.1 Community License Agreement
Finetuned from model meta-llama/Llama-3.1-8B

Evaluation

Logits-based

Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.

Generative

Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.

Instruction-following

Instruction level strict accuracy is reported for IFEval-et.

Model (# parameters ↓)	IFEval-et
moonshotai/Kimi-K2-Instruct	0.7891
deepseek-ai/DeepSeek-V3-0324	0.7171
meta-llama/Llama-3.1-405B-Instruct	0.7159
meta-llama/Llama-3.3-70B-Instruct	*0.7705*
Qwen/Qwen2.5-72B-Instruct	0.7407
google/gemma-3-27b-it	0.7655
utter-project/EuroLLM-9B-Instruct	0.5397
swiss-ai/Apertus-8B-Instruct-2509	0.5484
meta-llama/Llama-3.1-8B-Instruct	0.3797
tartuNLP/llama-estllm-prototype-0825	0.5174
BSC-LT/salamandra-7b-instruct	0.5195
tartuNLP/Llammas	0.3524
Qwen/Qwen2.5-7B-Instruct	0.4988

Multiple Choice

All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.

Model (# parameters ↓)	Winogrande-et	Trivia-et	Grammar-et	Inflection-et	Word-Meanings-et
moonshotai/Kimi-K2-Instruct	0.8138	0.4225	0.916	*0.6458*	0.9689
deepseek-ai/DeepSeek-V3-0324	*0.8042*	0.27	0.364	0	0
meta-llama/Llama-3.1-405B-Instruct	0.7878	0.4713	*0.818*	0.9089	0.9438
meta-llama/Llama-3.3-70B-Instruct	0.7397	0.3875	0.797	0.6421	0.9408
Qwen/Qwen2.5-72B-Instruct	0.7227	0.315	0.694	0.5208	0.9057
google/gemma-3-27b-it	0.7510	0.325	0.817	0.5934	0.9529
utter-project/EuroLLM-9B-Instruct	0.5846	0.3738	0.764	0.367	0.9258
swiss-ai/Apertus-8B-Instruct-2509	0.5105	0.345	0.512	0.3662	0.9027
meta-llama/Llama-3.1-8B-Instruct	0.5399	0.2888	0.657	0.4165	0.8335
tartuNLP/llama-estllm-prototype-0825	0.5812	*0.425*	0.692	0.5188	*0.9569*
BSC-LT/salamandra-7b-instruct	0.2878	0.2875	0.594	0.2668	0.8084
Qwen/Qwen2.5-7B-Instruct	0.5473	0.2938	0.598	0.4136	0.7984
tartuNLP/Llammas	0.5037	0.2838	0.529	0.2289	0.5326

Translation

English to Estonian

Model	wmt24pp (BLEU ↑)
BSC-LT/salamandraTA-7b-instruct	0.2713
tartuNLP/llama-estllm-prototype-0825	0.264
utter-project/EuroLLM-9B-Instruct	0.2602
swiss-ai/Apertus-8B-Instruct-2509	0.2372
tartuNLP/Llammas	0.1472
meta-llama/Llama-3.1-8B-Instruct	0.1406
BSC-LT/salamandra-7b-instruct	0.1201
Qwen/Qwen2.5-7B-Instruct	0.0476

Limitations

This is an early prototype version. Accordignly, it has limitations in addition to the base Llama limitations:

Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
Multi-turn conversations are not supported in this version.
Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.

Citation

TBA

Downloads last month: 86

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for tartuNLP/llama-estllm-protype-0825

Base model

meta-llama/Llama-3.1-8B

Finetuned

(1602)

this model