eaddario commited on
Commit
db27882
·
verified ·
1 Parent(s): 50c9a1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -51
README.md CHANGED
@@ -14,9 +14,9 @@ tags:
14
  - experimental
15
  ---
16
 
17
- # Experimental GGUF quantized versions of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
18
 
19
- Using [LLaMA C++](<https://github.com/ggerganov/llama.cpp>) release [b4998](<https://github.com/ggerganov/llama.cpp/releases/tag/b4998>) for quantization.
20
 
21
  Original model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](<https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B>)
22
 
@@ -30,11 +30,9 @@ From the original model creators:
30
 
31
  An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.
32
 
33
- The method that I'm using to produce these experimental versions is explained in [Squeezing Tensor Bits: the quest for smaller LLMs](https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca), but at a high level it involves using a custom version of the `llama-quantize` tool to selectively quantize different tensors at different levels.
34
 
35
- There’re two pull requests ([#12511](https://github.com/ggml-org/llama.cpp/pull/12511) & [#12512](https://github.com/ggml-org/llama.cpp/pull/12512)) to merge these changes back into the core llama.cpp project. This may or may not ever happen but until then, the modified version will be available on my [GitHub](https://github.com/EAddario/llama.cpp).
36
-
37
- In addition to [llama-quantize](https://github.com/EAddario/llama.cpp/tree/quantize), there’s a version of [llama-perplexity](https://github.com/EAddario/llama.cpp/tree/perplexity) that allows you to continue generating test scores even if there’s a context window overflow (original behaviour is to stop).
38
 
39
  For testing and comparison I use models produced by [Unsloth](<https://huggingface.co/unsloth>) ([Daniel and Michael Han](<https://unsloth.ai/>) do some really advanced level stuff!) and [Bartowski](<https://huggingface.co/bartowski>) (see credits below).
40
 
@@ -43,12 +41,13 @@ All experimental versions were generated using an appropriate imatrix created fr
43
  The process to generate these models is roughly as follows:
44
 
45
  1. Convert the the original model's tensors to [GGUF](<https://huggingface.co/docs/hub/en/gguf>) F16*
46
- 2. Estimate the Perplexity score for the F16 model (baseline) using the [wikitext-2-raw-v1](<https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1>) dataset, and save the [logits](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main/logits>)
47
  3. Generate an [imatrix](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main/imatrix>) from selected calibration datasets
48
- 4. Select an appropiate quant level for each tensor using a modified version of `llama-quantize`
49
- 5. Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
50
- 6. Keep versions with the best scores
51
- 7. Repeat until all desired quants are created. I find that quantizations below Q3/IQ3 are not fit for my purposes and therefore do not usually generate them, but happy to provide other quants on request.
 
52
 
53
  *[BF16](<https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>) would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16
54
 
@@ -57,34 +56,34 @@ The process to generate these models is roughly as follows:
57
  ### Sizes (in GB)
58
  | Model | Bartowski | Unsloth | Repo | Shrinkage |
59
  | ------------------------------------------------------------------------------- | --------: | ------: | ---: | --------: |
60
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 3.57 | N/A | 3.29 | 7.8% |
61
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | N/A | N/A | 3.09 | N/A |
62
- | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 4.44 | N/A | 4.09 | 7.9% |
63
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 4.09 | N/A | 3.23 | 21.0% |
64
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 3.81 | 3.81 | 3.18 | 16.5% |
65
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 3.49 | N/A | 3.12 | 10.6% |
66
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 4.68 | 4.68 | 4.22 | 9.8% |
67
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 4.46 | N/A | 4.10 | 8.1% |
68
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 5.44 | 5.44 | 5.06 | 7.0% |
69
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 5.32 | N/A | 4.92 | 7.5% |
70
- | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 6.25 | 6.25 | 5.81 | 7.0% |
71
- | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 8.10 | 8.10 | 7.36 | 9.1% |
72
 
73
  ### Perplexity and KL Divergence scores
74
  | Model | μPPL | 𝜌PPL | μKLD | RMS Δp |
75
  | ------------------------------------------------------------------------------- | ------------------: | -----: | -----------------: | ------------: |
76
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 29.649250 ±0.294595 | 96.03% | 0.327503 ±0.001046 | 14.505 ±0.057 |
77
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 29.629428 ±0.297710 | 95.87% | 0.342098 ±0.001134 | 14.049 ±0.058 |
78
- | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 27.257341 ±0.278686 | 98.29% | 0.141175 ±0.000478 | 9.225 ±0.042 |
79
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 24.231749 ±0.227987 | 97.26% | 0.226427 ±0.000710 | 11.830 ±0.050 |
80
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 25.241862 ±0.238190 | 96.17% | 0.310675 ±0.000978 | 13.592 ±0.056 |
81
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 25.258309 ±0.236501 | 95.63% | 0.352969 ±0.001095 | 14.453 ±0.058 |
82
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 23.557663 ±0.224589 | 99.09% | 0.070963 ±0.000232 | 6.641 ±0.032 |
83
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 23.530143 ±0.224303 | 99.09% | 0.071101 ±0.000236 | 6.637 ±0.032 |
84
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 23.038797 ±0.218976 | 99.47% | 0.041530 ±0.000125 | 5.010 ±0.023 |
85
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 23.109739 ±0.219906 | 99.46% | 0.041738 ±0.000126 | 5.034 ±0.023 |
86
- | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 23.115197 ±0.220246 | 99.55% | 0.034778 ±0.000104 | 4.579 ±0.021 |
87
- | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 23.093601 ±0.219879 | 99.58% | 0.032917 ±0.000099 | 4.449 ±0.020 |
88
  | [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 22.656280 ±0.216110 | 100% | N/A | N/A |
89
 
90
  ### ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores
@@ -94,30 +93,30 @@ For the test data used in the generation of these scores, follow the appropiate
94
 
95
  | Model | ARC | HellaSwag | MMLU | Truthful QA | WinoGrande | Avg Score |
96
  | ----------------------------------------------------------------------------------------------------------------- | --------------: | --------: | --------------: | --------------: | --------------: | --------: |
97
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 51.8072 ±1.8294 | 57.33 | 32.6667 ±1.7137 | 31.4985 ±2.5727 | 55.6000 ±1.8155 | 45.78 |
98
- | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 46.7202 ±1.8267 | 57.47 | 32.2667 ±1.7082 | 28.7500 ±2.5341 | 57.3333 ±1.8072 | 44.51 |
99
- | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 51.8717 ±1.8281 | 57.86 | 32.4000 ±1.7100 | 28.2609 ±2.5131 | 60.8000 ±1.7838 | 46.24 |
100
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 52.2727 ±1.8275 | 57.87 | 31.4667 ±1.6968 | 31.4985 ±2.5727 | 59.3333 ±1.7948 | 46.49 |
101
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 48.8621 ±1.8302 | 55.87 | 32.1333 ±1.7063 | 33.3333 ±2.6603 | 61.2000 ±1.7805 | 46.28 |
102
- | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 48.3266 ±1.8296 | 56.80 | 31.7333 ±1.7007 | 34.3750 ±2.6593 | 58.9333 ±1.7976 | 46.03 |
103
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 49.0642 ±1.8291 | 56.00 | 33.7333 ±1.7276 | 28.0000 ±2.4944 | 60.1333 ±1.7890 | 45.39 |
104
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.9358 ±1.8291 | 60.67 | 35.8667 ±1.7525 | 28.7037 ±2.5171 | 61.0667 ±1.7816 | 47.45 |
105
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.5348 ±1.8293 | 57.07 | 31.0667 ±1.6909 | 29.5031 ±2.5455 | 61.2000 ±1.7805 | 45.87 |
106
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 50.8701 ±1.8304 | 61.07 | 32.2667 ±1.7082 | 28.9231 ±2.5189 | 61.0667 ±1.7816 | 46.84 |
107
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 52.0750 ±1.8291 | 60.93 | 31.3333 ±1.6949 | 26.7081 ±2.4694 | 60.9333 ±1.7827 | 46.40 |
108
- | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 52.6104 ±1.8281 | 59.47 | 31.7333 ±1.7007 | 26.6871 ±2.4536 | 59.6000 ±1.7930 | 46.02 |
109
- | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 52.2088 ±1.8288 | 60.53 | 36.2667 ±1.7567 | 28.0864 ±2.5006 | 62.1333 ±1.7724 | 47.85 |
110
- | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 51.5395 ±1.8298 | 58.27 | 31.7333 ±1.7007 | 27.9635 ±2.4782 | 56.9333 ±1.8093 | 45.29 |
111
- | [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 53.2798 ±1.8267 | 57.73 | 35.2000 ±1.7451 | 29.1793 ±2.5100 | 58.6667 ±1.7993 | 46.81 |
112
 
113
  ### Tokens per Second - Benchmarks
114
  Scores generated using [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/examples/llama-bench). Q4_K_M quantizations from [Bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) and [Unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) included for comparison.
115
 
116
  | model | size | params | backend | threads | test | t/s |
117
  | ----------------------------------------------------------------------------------------------------------------- | -------: | -----: | ---------- | ------: | ------------: | ------------: |
118
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.92 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 360.61 ± 0.59 |
119
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.92 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 29.26 ± 0.14 |
120
- | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.92 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 48.72 ± 0.14 |
121
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 354.27 ± 0.64 |
122
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 27.95 ± 0.04 |
123
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 46.60 ± 0.16 |
 
14
  - experimental
15
  ---
16
 
17
+ # Experimental layer-wise quantization of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
18
 
19
+ Using [LLaMA C++](<https://github.com/ggerganov/llama.cpp>) release [b5120](<https://github.com/ggerganov/llama.cpp/releases/tag/b5120>) for quantization.
20
 
21
  Original model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](<https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B>)
22
 
 
30
 
31
  An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.
32
 
33
+ The method used to produce these experimental versions is covered in [Squeezing Tensor Bits: the quest for smaller LLMs](https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca), but at a high level it involves using custom versions of `llama-imatrix` and `llama-quantize` to identify the influential tensors, and quantize the most important layers to higher bit precision and the less important to lower bits. This process was partly inspired by Dumitru's et al [Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels](https://arxiv.org/abs/2406.17415).
34
 
35
+ There’re two pull requests ([imatrix](https://github.com/ggml-org/llama.cpp/pull/12718) & [quantize](https://github.com/ggml-org/llama.cpp/pull/12511)) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on [GitHub](https://github.com/EAddario/llama.cpp).
 
 
36
 
37
  For testing and comparison I use models produced by [Unsloth](<https://huggingface.co/unsloth>) ([Daniel and Michael Han](<https://unsloth.ai/>) do some really advanced level stuff!) and [Bartowski](<https://huggingface.co/bartowski>) (see credits below).
38
 
 
41
  The process to generate these models is roughly as follows:
42
 
43
  1. Convert the the original model's tensors to [GGUF](<https://huggingface.co/docs/hub/en/gguf>) F16*
44
+ 2. Estimate the Perplexity score for the F16 model (baseline) using the [wikitext-2-raw-v1](<https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1>) dataset, and save the [logits](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF/tree/main/logits>)
45
  3. Generate an [imatrix](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main/imatrix>) from selected calibration datasets
46
+ 4. Determine tensor and layer Importance Score contribution using a modified version of `llama-imatrix`
47
+ 5. Select an appropiate quant level for each tensor using a modified version of `llama-quantize`
48
+ 6. Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
49
+ 7. Keep versions with the best scores
50
+ 8. Repeat until all desired quants are created. I find that quantizations below Q3/IQ3 are not fit for my purposes and therefore do not usually generate them, but happy to provide other quants on request.
51
 
52
  *[BF16](<https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>) would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16
53
 
 
56
  ### Sizes (in GB)
57
  | Model | Bartowski | Unsloth | Repo | Shrinkage |
58
  | ------------------------------------------------------------------------------- | --------: | ------: | ---: | --------: |
59
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 3.57 | N/A | 3.48 | 2.5% |
60
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | N/A | N/A | 3.26 | N/A |
61
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 4.44 | N/A | 4.18 | 5.9% |
62
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 4.09 | N/A | 3.54 | 13.4% |
63
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 3.81 | 3.81 | 3.37 | 11.5% |
64
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 3.49 | N/A | 3.14 | 10.0% |
65
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 4.68 | 4.68 | 4.18 | 10.7% |
66
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 4.46 | N/A | 4.06 | 9.0% |
67
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 5.44 | 5.44 | 5.09 | 6.4% |
68
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 5.32 | N/A | 4.97 | 6.6% |
69
+ | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 6.25 | 6.25 | 6.23 | 0.3% |
70
+ | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 8.10 | 8.10 | 7.27 | 10.2% |
71
 
72
  ### Perplexity and KL Divergence scores
73
  | Model | μPPL | 𝜌PPL | μKLD | RMS Δp |
74
  | ------------------------------------------------------------------------------- | ------------------: | -----: | -----------------: | ------------: |
75
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 28.740047 ±0.291290 | 97.19% | 0.229742 ±0.000770 | 11.793 ±0.050 |
76
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 30.290800 ±0.307742 | 96.32% | 0.310982 ±0.001014 | 13.415 ±0.057 |
77
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 23.570503 ±0.226124 | 98.59% | 0.102854 ±0.000465 | 8.080 ±0.046 |
78
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 24.160705 ±0.229989 | 97.75% | 0.173336 ±0.000603 | 10.337 ±0.048 |
79
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 24.967196 ±0.239198 | 97.50% | 0.194299 ±0.000681 | 10.877 ±0.050 |
80
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 25.661098 ±0.246635 | 96.84% | 0.243850 ±0.000852 | 12.143 ±0.054 |
81
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 23.125382 ±0.221860 | 99.24% | 0.053997 ±0.000215 | 5.795 ±0.032 |
82
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 23.156199 ±0.222000 | 99.18% | 0.058337 ±0.000233 | 6.026 ±0.034 |
83
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 22.726887 ±0.217691 | 99.75% | 0.013562 ±0.000062 | 2.924 ±0.020 |
84
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 22.766826 ±0.218244 | 99.74% | 0.014589 ±0.000070 | 3.024 ±0.020 |
85
+ | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 22.859294 ±0.219461 | 99.87% | 0.004317 ±0.000022 | 1.682 ±0.016 |
86
+ | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 22.840693 ±0.219408 | 99.90% | 0.001614 ±0.000011 | 1.050 ±0.010 |
87
  | [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 22.656280 ±0.216110 | 100% | N/A | N/A |
88
 
89
  ### ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores
 
93
 
94
  | Model | ARC | HellaSwag | MMLU | Truthful QA | WinoGrande | Avg Score |
95
  | ----------------------------------------------------------------------------------------------------------------- | --------------: | --------: | --------------: | --------------: | --------------: | --------: |
96
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 50.4000 ±1.8269 | 55.73 | 31.3333 ±1.6949 | 26.4000 ±1.6106 | 59.7333 ±1.7920 | 45.78 |
97
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 43.8667 ±1.8132 | 53.87 | 30.5333 ±1.6828 | 26.2667 ±1.6080 | 57.8667 ±1.8042 | 44.51 |
98
+ | [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 50.6667 ±1.8268 | 57.60 | 32.8000 ±1.7155 | 28.2667 ±1.6453 | 60.5333 ±1.7860 | 46.24 |
99
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 48.5333 ±1.8262 | 56.93 | 30.4000 ±1.6807 | 27.0667 ±1.6235 | 59.8667 ±1.7910 | 46.49 |
100
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 48.1333 ±1.8257 | 56.27 | 29.8667 ±1.6723 | 27.3333 ±1.6284 | 59.8667 ±1.7910 | 46.28 |
101
+ | [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 48.0000 ±1.8255 | 55.47 | 28.5333 ±1.6500 | 26.0000 ±1.6027 | 59.2000 ±1.7958 | 46.03 |
102
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 52.5333 ±1.8246 | 58.53 | 32.1333 ±1.7063 | 26.2667 ±1.6080 | 60.9333 ±1.7827 | 45.39 |
103
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.9358 ±1.8291 | 60.67 | 35.8667 ±1.7525 | 28.7037 ±2.5171 | 61.0667 ±1.7816 | 47.45 |
104
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.5348 ±1.8293 | 57.07 | 31.0667 ±1.6909 | 29.5031 ±2.5455 | 61.2000 ±1.7805 | 45.87 |
105
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 52.4000 ±1.8249 | 58.40 | 31.8667 ±1.7026 | 25.8667 ±1.6001 | 61.4667 ±1.7783 | 46.84 |
106
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 52.8000 ±1.8241 | 59.33 | 32.6667 ±1.7137 | 27.2000 ±1.6260 | 60.1333 ±1.7890 | 46.40 |
107
+ | [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 52.9333 ±1.8238 | 59.60 | 32.8000 ±1.7155 | 27.0667 ±1.6235 | 61.2000 ±1.7805 | 46.02 |
108
+ | [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 53.0667 ±1.8235 | 59.07 | 31.3333 ±1.6949 | 26.9333 ±1.6209 | 60.5333 ±1.7860 | 47.85 |
109
+ | [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 53.0667 ±1.8235 | 58.67 | 32.2667 ±1.7082 | 26.9333 ±1.6209 | 60.2667 ±1.7880 | 45.29 |
110
+ | [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 52.9333 ±1.8238 | 58.67 | 32.1333 ±1.7063 | 26.6667 ±1.6158 | 60.0000 ±1.7900 | 46.81 |
111
 
112
  ### Tokens per Second - Benchmarks
113
  Scores generated using [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/examples/llama-bench). Q4_K_M quantizations from [Bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) and [Unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) included for comparison.
114
 
115
  | model | size | params | backend | threads | test | t/s |
116
  | ----------------------------------------------------------------------------------------------------------------- | -------: | -----: | ---------- | ------: | ------------: | ------------: |
117
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 333.33 ± 3.02 |
118
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 29.14 ± 0.16 |
119
+ | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 48.07 ± 0.09 |
120
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 354.27 ± 0.64 |
121
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 27.95 ± 0.04 |
122
  | [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 46.60 ± 0.16 |