Spaetzle
					Collection
				
German-English models, mostly merged, some sft/dpo
					• 
				118 items
				• 
				Updated
					
				•
					
					1
llama3.1-8b-spaetzle-v90 is a progressive merge of merges.
German EQ-Bench v2_de: 69.93 (171/171). English (v2): 77.88 (171/171)
Open LLM Leaderboard Evaluation Results Detailed results can be found here
| Metric | Value | 
|---|---|
| Avg. | 27.59 | 
| IFEval (0-Shot) | 73.56 | 
| BBH (3-Shot) | 32.76 | 
| MATH Lvl 5 (4-Shot) | 13.37 | 
| GPQA (0-shot) | 4.36 | 
| MuSR (0-shot) | 11.15 | 
| MMLU-PRO (5-shot) | 30.34 | 
| Model | AGIEval | TruthfulQA | Bigbench | 
|---|---|---|---|
| llama3.1-8b-spaetzle-v90 | 42.05 | 57.2 | 44.75 | 
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| agieval_aqua_rat | 0 | acc | 24.02 | ± | 2.69 | 
| acc_norm | 23.62 | ± | 2.67 | ||
| agieval_logiqa_en | 0 | acc | 40.09 | ± | 1.92 | 
| acc_norm | 39.78 | ± | 1.92 | ||
| agieval_lsat_ar | 0 | acc | 22.17 | ± | 2.75 | 
| acc_norm | 21.74 | ± | 2.73 | ||
| agieval_lsat_lr | 0 | acc | 50.39 | ± | 2.22 | 
| acc_norm | 45.29 | ± | 2.21 | ||
| agieval_lsat_rc | 0 | acc | 64.31 | ± | 2.93 | 
| acc_norm | 58.36 | ± | 3.01 | ||
| agieval_sat_en | 0 | acc | 81.07 | ± | 2.74 | 
| acc_norm | 73.79 | ± | 3.07 | ||
| agieval_sat_en_without_passage | 0 | acc | 45.15 | ± | 3.48 | 
| acc_norm | 38.83 | ± | 3.40 | ||
| agieval_sat_math | 0 | acc | 40.91 | ± | 3.32 | 
| acc_norm | 35.00 | ± | 3.22 | 
Average: 42.05%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| truthfulqa_mc | 1 | mc1 | 39.66 | ± | 1.71 | 
| mc2 | 57.20 | ± | 1.51 | 
Average: 57.2%
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| bigbench_causal_judgement | 0 | multiple_choice_grade | 58.42 | ± | 3.59 | 
| bigbench_date_understanding | 0 | multiple_choice_grade | 70.46 | ± | 2.38 | 
| bigbench_disambiguation_qa | 0 | multiple_choice_grade | 31.40 | ± | 2.89 | 
| bigbench_geometric_shapes | 0 | multiple_choice_grade | 33.43 | ± | 2.49 | 
| exact_str_match | 0.00 | ± | 0.00 | ||
| bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 30.00 | ± | 2.05 | 
| bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 24.29 | ± | 1.62 | 
| bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 | 
| bigbench_movie_recommendation | 0 | multiple_choice_grade | 38.20 | ± | 2.18 | 
| bigbench_navigate | 0 | multiple_choice_grade | 50.20 | ± | 1.58 | 
| bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 69.50 | ± | 1.03 | 
| bigbench_ruin_names | 0 | multiple_choice_grade | 54.46 | ± | 2.36 | 
| bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 32.77 | ± | 1.49 | 
| bigbench_snarks | 0 | multiple_choice_grade | 65.19 | ± | 3.55 | 
| bigbench_sports_understanding | 0 | multiple_choice_grade | 50.30 | ± | 1.59 | 
| bigbench_temporal_sequences | 0 | multiple_choice_grade | 45.70 | ± | 1.58 | 
| bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 22.08 | ± | 1.17 | 
| bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 17.03 | ± | 0.90 | 
| bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 56.00 | ± | 2.87 | 
Average: 44.75%
The merge tree involves the following models:
There have been a number of steps involved, among which, slep merging of only middle layers compensating for tokenizer / chat template differences. An illustration below.
The final merge for this was:
models:
  - model: cstr/llama3.1-8b-spaetzle-v59
    # no parameters necessary for base model
  - model: cstr/llama3.1-8b-spaetzle-v85
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v86
    parameters:
      density: 0.65
      weight: 0.3
  - model: cstr/llama3.1-8b-spaetzle-v74
    parameters:
      density: 0.65
      weight: 0.3
merge_method: dare_ties
base_model: cstr/llama3.1-8b-spaetzle-v59
parameters:
  int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
Among the previous steps:
models:
  - model: NousResearch/Hermes-3-Llama-3.1-8B
merge_method: slerp
base_model: cstr/llama3.1-8b-spaetzle-v74
parameters:
  t:
    - value: [0, 0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0, 0]
dtype: float16
Use with llama3 chat template as common. Here are GGUF quants for use with llama.cpp & wrappers as e.g. ollama: cstr/llama3.1-8b-spaetzle-v90-GGUF