llama3.1-8b-spaetzle-v90

llama3.1-8b-spaetzle-v90 is a progressive merge of merges.

evaluation

German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)

Open LLM Leaderboard Evaluation Results Detailed results can be found here

Metric	Value
Avg.	27.59
IFEval (0-Shot)	73.56
BBH (3-Shot)	32.76
MATH Lvl 5 (4-Shot)	13.37
GPQA (0-shot)	4.36
MuSR (0-shot)	11.15
MMLU-PRO (5-shot)	30.34

Model	AGIEval	TruthfulQA	Bigbench
llama3.1-8b-spaetzle-v90	42.05	57.2	44.75

AGIEval

Task	Version	Metric	Value		Stderr
agieval_aqua_rat	0	acc	24.02	±	2.69
		acc_norm	23.62	±	2.67
agieval_logiqa_en	0	acc	40.09	±	1.92
		acc_norm	39.78	±	1.92
agieval_lsat_ar	0	acc	22.17	±	2.75
		acc_norm	21.74	±	2.73
agieval_lsat_lr	0	acc	50.39	±	2.22
		acc_norm	45.29	±	2.21
agieval_lsat_rc	0	acc	64.31	±	2.93
		acc_norm	58.36	±	3.01
agieval_sat_en	0	acc	81.07	±	2.74
		acc_norm	73.79	±	3.07
agieval_sat_en_without_passage	0	acc	45.15	±	3.48
		acc_norm	38.83	±	3.40
agieval_sat_math	0	acc	40.91	±	3.32
		acc_norm	35.00	±	3.22

Average: 42.05%

TruthfulQA

Task	Version	Metric	Value		Stderr
truthfulqa_mc	1	mc1	39.66	±	1.71
		mc2	57.20	±	1.51

Average: 57.2%

Bigbench

Task	Version	Metric	Value		Stderr
bigbench_causal_judgement	0	multiple_choice_grade	58.42	±	3.59
bigbench_date_understanding	0	multiple_choice_grade	70.46	±	2.38
bigbench_disambiguation_qa	0	multiple_choice_grade	31.40	±	2.89
bigbench_geometric_shapes	0	multiple_choice_grade	33.43	±	2.49
		exact_str_match	0.00	±	0.00
bigbench_logical_deduction_five_objects	0	multiple_choice_grade	30.00	±	2.05
bigbench_logical_deduction_seven_objects	0	multiple_choice_grade	24.29	±	1.62
bigbench_logical_deduction_three_objects	0	multiple_choice_grade	56.00	±	2.87
bigbench_movie_recommendation	0	multiple_choice_grade	38.20	±	2.18
bigbench_navigate	0	multiple_choice_grade	50.20	±	1.58
bigbench_reasoning_about_colored_objects	0	multiple_choice_grade	69.50	±	1.03
bigbench_ruin_names	0	multiple_choice_grade	54.46	±	2.36
bigbench_salient_translation_error_detection	0	multiple_choice_grade	32.77	±	1.49
bigbench_snarks	0	multiple_choice_grade	65.19	±	3.55
bigbench_sports_understanding	0	multiple_choice_grade	50.30	±	1.59
bigbench_temporal_sequences	0	multiple_choice_grade	45.70	±	1.58
bigbench_tracking_shuffled_objects_five_objects	0	multiple_choice_grade	22.08	±	1.17
bigbench_tracking_shuffled_objects_seven_objects	0	multiple_choice_grade	17.03	±	0.90
bigbench_tracking_shuffled_objects_three_objects	0	multiple_choice_grade	56.00	±	2.87

Average: 44.75%

merge tree

The merge tree involves the following models:

NousResearch/Hermes-3-Llama-3.1-8B
Undi95/Meta-Llama-3.1-8B-Claude
Dampfinchen/Llama-3.1-8B-Ultra-Instruct
VAGOsolutions/Llama-3.1-SauerkrautLM-8b-Instruct
akjindal53244/Llama-3.1-Storm-8B
nbeerbower/llama3.1-gutenberg-8B
Undi95/Meta-Llama-3.1-8B-Claude
DiscoResearch/Llama3-DiscoLeo-Instruct-8B-v0.1
nbeerbower/llama-3-wissenschaft-8B-v2
Azure99/blossom-v5-llama3-8b
VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
princeton-nlp/Llama-3-Instruct-8B-SimPO
Locutusque/llama-3-neural-chat-v1-8b
Locutusque/Llama-3-Orca-1.0-8B
DiscoResearch/Llama3_DiscoLM_German_8b_v0.1_experimental
seedboxai/Llama-3-Kafka-8B-v0.2
VAGOsolutions/Llama-3-SauerkrautLM-8b-Instruct
nbeerbower/llama-3-wissenschaft-8B-v2
mlabonne/Daredevil-8B-abliterated-dpomix

There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.

🧩 Configuration

The final merge for this was:

models:
  - model: cstr/llama3.1-8b-spaetzle-v59
    # no parameters necessary for base model
  - model: cstr/llama3.1-8b-spaetzle-v85
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v86
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v74
    parameters:
      density: 0.65
      weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base

Among the previous steps:

models:
  - model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
  t:
    - value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16