Update README.md
Browse files
README.md
CHANGED
|
@@ -14,9 +14,9 @@ tags:
|
|
| 14 |
- experimental
|
| 15 |
---
|
| 16 |
|
| 17 |
-
# Experimental
|
| 18 |
|
| 19 |
-
Using [LLaMA C++](<https://github.com/ggerganov/llama.cpp>) release [
|
| 20 |
|
| 21 |
Original model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](<https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B>)
|
| 22 |
|
|
@@ -30,11 +30,9 @@ From the original model creators:
|
|
| 30 |
|
| 31 |
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.
|
| 32 |
|
| 33 |
-
The method
|
| 34 |
|
| 35 |
-
There’re two pull requests ([
|
| 36 |
-
|
| 37 |
-
In addition to [llama-quantize](https://github.com/EAddario/llama.cpp/tree/quantize), there’s a version of [llama-perplexity](https://github.com/EAddario/llama.cpp/tree/perplexity) that allows you to continue generating test scores even if there’s a context window overflow (original behaviour is to stop).
|
| 38 |
|
| 39 |
For testing and comparison I use models produced by [Unsloth](<https://huggingface.co/unsloth>) ([Daniel and Michael Han](<https://unsloth.ai/>) do some really advanced level stuff!) and [Bartowski](<https://huggingface.co/bartowski>) (see credits below).
|
| 40 |
|
|
@@ -43,12 +41,13 @@ All experimental versions were generated using an appropriate imatrix created fr
|
|
| 43 |
The process to generate these models is roughly as follows:
|
| 44 |
|
| 45 |
1. Convert the the original model's tensors to [GGUF](<https://huggingface.co/docs/hub/en/gguf>) F16*
|
| 46 |
-
2. Estimate the Perplexity score for the F16 model (baseline) using the [wikitext-2-raw-v1](<https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1>) dataset, and save the [logits](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-
|
| 47 |
3. Generate an [imatrix](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main/imatrix>) from selected calibration datasets
|
| 48 |
-
4.
|
| 49 |
-
5.
|
| 50 |
-
6.
|
| 51 |
-
7.
|
|
|
|
| 52 |
|
| 53 |
*[BF16](<https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>) would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16
|
| 54 |
|
|
@@ -57,34 +56,34 @@ The process to generate these models is roughly as follows:
|
|
| 57 |
### Sizes (in GB)
|
| 58 |
| Model | Bartowski | Unsloth | Repo | Shrinkage |
|
| 59 |
| ------------------------------------------------------------------------------- | --------: | ------: | ---: | --------: |
|
| 60 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 3.57 | N/A | 3.
|
| 61 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | N/A | N/A | 3.
|
| 62 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 4.44 | N/A | 4.
|
| 63 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 4.09 | N/A | 3.
|
| 64 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 3.81 | 3.81 | 3.
|
| 65 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 3.49 | N/A | 3.
|
| 66 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 4.68 | 4.68 | 4.
|
| 67 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 4.46 | N/A | 4.
|
| 68 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 5.44 | 5.44 | 5.
|
| 69 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 5.32 | N/A | 4.
|
| 70 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 6.25 | 6.25 |
|
| 71 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 8.10 | 8.10 | 7.
|
| 72 |
|
| 73 |
### Perplexity and KL Divergence scores
|
| 74 |
| Model | μPPL | 𝜌PPL | μKLD | RMS Δp |
|
| 75 |
| ------------------------------------------------------------------------------- | ------------------: | -----: | -----------------: | ------------: |
|
| 76 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) |
|
| 77 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) |
|
| 78 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) |
|
| 79 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 24.
|
| 80 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) |
|
| 81 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 25.
|
| 82 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 23.
|
| 83 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 23.
|
| 84 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) |
|
| 85 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) |
|
| 86 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) |
|
| 87 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) |
|
| 88 |
| [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 22.656280 ±0.216110 | 100% | N/A | N/A |
|
| 89 |
|
| 90 |
### ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores
|
|
@@ -94,30 +93,30 @@ For the test data used in the generation of these scores, follow the appropiate
|
|
| 94 |
|
| 95 |
| Model | ARC | HellaSwag | MMLU | Truthful QA | WinoGrande | Avg Score |
|
| 96 |
| ----------------------------------------------------------------------------------------------------------------- | --------------: | --------: | --------------: | --------------: | --------------: | --------: |
|
| 97 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) |
|
| 98 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) |
|
| 99 |
-
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) |
|
| 100 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) |
|
| 101 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 48.
|
| 102 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 48.
|
| 103 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) |
|
| 104 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.9358 ±1.8291 | 60.67 | 35.8667 ±1.7525 | 28.7037 ±2.5171 | 61.0667 ±1.7816 | 47.45 |
|
| 105 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.5348 ±1.8293 | 57.07 | 31.0667 ±1.6909 | 29.5031 ±2.5455 | 61.2000 ±1.7805 | 45.87 |
|
| 106 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) |
|
| 107 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 52.
|
| 108 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 52.
|
| 109 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) |
|
| 110 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) |
|
| 111 |
-
| [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) |
|
| 112 |
|
| 113 |
### Tokens per Second - Benchmarks
|
| 114 |
Scores generated using [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/examples/llama-bench). Q4_K_M quantizations from [Bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) and [Unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) included for comparison.
|
| 115 |
|
| 116 |
| model | size | params | backend | threads | test | t/s |
|
| 117 |
| ----------------------------------------------------------------------------------------------------------------- | -------: | -----: | ---------- | ------: | ------------: | ------------: |
|
| 118 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.
|
| 119 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.
|
| 120 |
-
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.
|
| 121 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 354.27 ± 0.64 |
|
| 122 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 27.95 ± 0.04 |
|
| 123 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 46.60 ± 0.16 |
|
|
|
|
| 14 |
- experimental
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Experimental layer-wise quantization of deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
|
| 18 |
|
| 19 |
+
Using [LLaMA C++](<https://github.com/ggerganov/llama.cpp>) release [b5120](<https://github.com/ggerganov/llama.cpp/releases/tag/b5120>) for quantization.
|
| 20 |
|
| 21 |
Original model: [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](<https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B>)
|
| 22 |
|
|
|
|
| 30 |
|
| 31 |
An area of personal interest is finding ways to optimize the inference performance of LLMs when deployed in resource-constrained environments like commodity hardware, desktops, laptops, mobiles, edge devices, etc. There are many approaches to accomplish this, including architecture simplification and knowledge distillation, but my focus has been primarily on quantization and pruning.
|
| 32 |
|
| 33 |
+
The method used to produce these experimental versions is covered in [Squeezing Tensor Bits: the quest for smaller LLMs](https://medium.com/@eaddario/squeezing-tensor-bits-the-quest-for-smaller-llms-86b23bd052ca), but at a high level it involves using custom versions of `llama-imatrix` and `llama-quantize` to identify the influential tensors, and quantize the most important layers to higher bit precision and the less important to lower bits. This process was partly inspired by Dumitru's et al [Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels](https://arxiv.org/abs/2406.17415).
|
| 34 |
|
| 35 |
+
There’re two pull requests ([imatrix](https://github.com/ggml-org/llama.cpp/pull/12718) & [quantize](https://github.com/ggml-org/llama.cpp/pull/12511)) to merge these changes back into the core llama.cpp project. This may or may not ever happen so, until then, the modified versions will be available on [GitHub](https://github.com/EAddario/llama.cpp).
|
|
|
|
|
|
|
| 36 |
|
| 37 |
For testing and comparison I use models produced by [Unsloth](<https://huggingface.co/unsloth>) ([Daniel and Michael Han](<https://unsloth.ai/>) do some really advanced level stuff!) and [Bartowski](<https://huggingface.co/bartowski>) (see credits below).
|
| 38 |
|
|
|
|
| 41 |
The process to generate these models is roughly as follows:
|
| 42 |
|
| 43 |
1. Convert the the original model's tensors to [GGUF](<https://huggingface.co/docs/hub/en/gguf>) F16*
|
| 44 |
+
2. Estimate the Perplexity score for the F16 model (baseline) using the [wikitext-2-raw-v1](<https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1>) dataset, and save the [logits](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Llama-8B-GGUF/tree/main/logits>)
|
| 45 |
3. Generate an [imatrix](<https://huggingface.co/eaddario/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main/imatrix>) from selected calibration datasets
|
| 46 |
+
4. Determine tensor and layer Importance Score contribution using a modified version of `llama-imatrix`
|
| 47 |
+
5. Select an appropiate quant level for each tensor using a modified version of `llama-quantize`
|
| 48 |
+
6. Calculate Perplexity, KL Divergence, ARC (Easy+Challenge), HellaSwag, MMLU, Truthful QA and WinoGrande scores for each quantized model
|
| 49 |
+
7. Keep versions with the best scores
|
| 50 |
+
8. Repeat until all desired quants are created. I find that quantizations below Q3/IQ3 are not fit for my purposes and therefore do not usually generate them, but happy to provide other quants on request.
|
| 51 |
|
| 52 |
*[BF16](<https://en.wikipedia.org/wiki/Bfloat16_floating-point_format>) would be preferred, but Apple's GPUs don't support it yet, and therefore any operations are executed in the CPU, making it unacceptably slow. This is expected to change in the near term but until then, if you are using Apple kit avoid using any models tagged BF16
|
| 53 |
|
|
|
|
| 56 |
### Sizes (in GB)
|
| 57 |
| Model | Bartowski | Unsloth | Repo | Shrinkage |
|
| 58 |
| ------------------------------------------------------------------------------- | --------: | ------: | ---: | --------: |
|
| 59 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 3.57 | N/A | 3.48 | 2.5% |
|
| 60 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | N/A | N/A | 3.26 | N/A |
|
| 61 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 4.44 | N/A | 4.18 | 5.9% |
|
| 62 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 4.09 | N/A | 3.54 | 13.4% |
|
| 63 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 3.81 | 3.81 | 3.37 | 11.5% |
|
| 64 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 3.49 | N/A | 3.14 | 10.0% |
|
| 65 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 4.68 | 4.68 | 4.18 | 10.7% |
|
| 66 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 4.46 | N/A | 4.06 | 9.0% |
|
| 67 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 5.44 | 5.44 | 5.09 | 6.4% |
|
| 68 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 5.32 | N/A | 4.97 | 6.6% |
|
| 69 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 6.25 | 6.25 | 6.23 | 0.3% |
|
| 70 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 8.10 | 8.10 | 7.27 | 10.2% |
|
| 71 |
|
| 72 |
### Perplexity and KL Divergence scores
|
| 73 |
| Model | μPPL | 𝜌PPL | μKLD | RMS Δp |
|
| 74 |
| ------------------------------------------------------------------------------- | ------------------: | -----: | -----------------: | ------------: |
|
| 75 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 28.740047 ±0.291290 | 97.19% | 0.229742 ±0.000770 | 11.793 ±0.050 |
|
| 76 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 30.290800 ±0.307742 | 96.32% | 0.310982 ±0.001014 | 13.415 ±0.057 |
|
| 77 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 23.570503 ±0.226124 | 98.59% | 0.102854 ±0.000465 | 8.080 ±0.046 |
|
| 78 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 24.160705 ±0.229989 | 97.75% | 0.173336 ±0.000603 | 10.337 ±0.048 |
|
| 79 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 24.967196 ±0.239198 | 97.50% | 0.194299 ±0.000681 | 10.877 ±0.050 |
|
| 80 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 25.661098 ±0.246635 | 96.84% | 0.243850 ±0.000852 | 12.143 ±0.054 |
|
| 81 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 23.125382 ±0.221860 | 99.24% | 0.053997 ±0.000215 | 5.795 ±0.032 |
|
| 82 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 23.156199 ±0.222000 | 99.18% | 0.058337 ±0.000233 | 6.026 ±0.034 |
|
| 83 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 22.726887 ±0.217691 | 99.75% | 0.013562 ±0.000062 | 2.924 ±0.020 |
|
| 84 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 22.766826 ±0.218244 | 99.74% | 0.014589 ±0.000070 | 3.024 ±0.020 |
|
| 85 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 22.859294 ±0.219461 | 99.87% | 0.004317 ±0.000022 | 1.682 ±0.016 |
|
| 86 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 22.840693 ±0.219408 | 99.90% | 0.001614 ±0.000011 | 1.050 ±0.010 |
|
| 87 |
| [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 22.656280 ±0.216110 | 100% | N/A | N/A |
|
| 88 |
|
| 89 |
### ARC, HellaSwag, MMLU, Truthful QA and WinoGrande scores
|
|
|
|
| 93 |
|
| 94 |
| Model | ARC | HellaSwag | MMLU | Truthful QA | WinoGrande | Avg Score |
|
| 95 |
| ----------------------------------------------------------------------------------------------------------------- | --------------: | --------: | --------------: | --------------: | --------------: | --------: |
|
| 96 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_M](./DeepSeek-R1-Distill-Qwen-7B-IQ3_M.gguf) | 50.4000 ±1.8269 | 55.73 | 31.3333 ±1.6949 | 26.4000 ±1.6106 | 59.7333 ±1.7920 | 45.78 |
|
| 97 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ3_S](./DeepSeek-R1-Distill-Qwen-7B-IQ3_S.gguf) | 43.8667 ±1.8132 | 53.87 | 30.5333 ±1.6828 | 26.2667 ±1.6080 | 57.8667 ±1.8042 | 44.51 |
|
| 98 |
+
| [DeepSeek-R1-Distill-Qwen-7B-IQ4_NL](./DeepSeek-R1-Distill-Qwen-7B-IQ4_NL.gguf) | 50.6667 ±1.8268 | 57.60 | 32.8000 ±1.7155 | 28.2667 ±1.6453 | 60.5333 ±1.7860 | 46.24 |
|
| 99 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_L](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_L.gguf) | 48.5333 ±1.8262 | 56.93 | 30.4000 ±1.6807 | 27.0667 ±1.6235 | 59.8667 ±1.7910 | 46.49 |
|
| 100 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_M.gguf) | 48.1333 ±1.8257 | 56.27 | 29.8667 ±1.6723 | 27.3333 ±1.6284 | 59.8667 ±1.7910 | 46.28 |
|
| 101 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q3_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q3_K_S.gguf) | 48.0000 ±1.8255 | 55.47 | 28.5333 ±1.6500 | 26.0000 ±1.6027 | 59.2000 ±1.7958 | 46.03 |
|
| 102 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 52.5333 ±1.8246 | 58.53 | 32.1333 ±1.7063 | 26.2667 ±1.6080 | 60.9333 ±1.7827 | 45.39 |
|
| 103 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.9358 ±1.8291 | 60.67 | 35.8667 ±1.7525 | 28.7037 ±2.5171 | 61.0667 ±1.7816 | 47.45 |
|
| 104 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 50.5348 ±1.8293 | 57.07 | 31.0667 ±1.6909 | 29.5031 ±2.5455 | 61.2000 ±1.7805 | 45.87 |
|
| 105 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_S.gguf) | 52.4000 ±1.8249 | 58.40 | 31.8667 ±1.7026 | 25.8667 ±1.6001 | 61.4667 ±1.7783 | 46.84 |
|
| 106 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_M.gguf) | 52.8000 ±1.8241 | 59.33 | 32.6667 ±1.7137 | 27.2000 ±1.6260 | 60.1333 ±1.7890 | 46.40 |
|
| 107 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q5_K_S](./DeepSeek-R1-Distill-Qwen-7B-Q5_K_S.gguf) | 52.9333 ±1.8238 | 59.60 | 32.8000 ±1.7155 | 27.0667 ±1.6235 | 61.2000 ±1.7805 | 46.02 |
|
| 108 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q6_K](./DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf) | 53.0667 ±1.8235 | 59.07 | 31.3333 ±1.6949 | 26.9333 ±1.6209 | 60.5333 ±1.7860 | 47.85 |
|
| 109 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q8_0](./DeepSeek-R1-Distill-Qwen-7B-Q8_0.gguf) | 53.0667 ±1.8235 | 58.67 | 32.2667 ±1.7082 | 26.9333 ±1.6209 | 60.2667 ±1.7880 | 45.29 |
|
| 110 |
+
| [DeepSeek-R1-Distill-Qwen-7B-F16](./DeepSeek-R1-Distill-Qwen-7B-F16.gguf) | 52.9333 ±1.8238 | 58.67 | 32.1333 ±1.7063 | 26.6667 ±1.6158 | 60.0000 ±1.7900 | 46.81 |
|
| 111 |
|
| 112 |
### Tokens per Second - Benchmarks
|
| 113 |
Scores generated using [llama-bench](https://github.com/ggml-org/llama.cpp/tree/master/examples/llama-bench). Q4_K_M quantizations from [Bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) and [Unsloth](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main) included for comparison.
|
| 114 |
|
| 115 |
| model | size | params | backend | threads | test | t/s |
|
| 116 |
| ----------------------------------------------------------------------------------------------------------------- | -------: | -----: | ---------- | ------: | ------------: | ------------: |
|
| 117 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 333.33 ± 3.02 |
|
| 118 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 29.14 ± 0.16 |
|
| 119 |
+
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M](./DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf) | 3.89 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 48.07 ± 0.09 |
|
| 120 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp512 | 354.27 ± 0.64 |
|
| 121 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | tg128 | 27.95 ± 0.04 |
|
| 122 |
| [DeepSeek-R1-Distill-Qwen-7B-Q4_K_M-bartowski](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-7B-GGUF) | 4.36 GiB | 7.62 B | Metal,BLAS | 6 | pp1024+tg1024 | 46.60 ± 0.16 |
|