geoffmunn commited on
Commit
d5e67a8
·
verified ·
1 Parent(s): 0098522

Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload

Browse files
.gitattributes CHANGED
@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ Qwen3-8B-f16:Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ Qwen3-8B-f16:Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
38
+ Qwen3-8B-f16:Q3_K_S.gguf filter=lfs diff=lfs merge=lfs -text
39
+ Qwen3-8B-f16:Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
40
+ Qwen3-8B-f16:Q4_K_S.gguf filter=lfs diff=lfs merge=lfs -text
41
+ Qwen3-8B-f16:Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
42
+ Qwen3-8B-f16:Q5_K_S.gguf filter=lfs diff=lfs merge=lfs -text
43
+ Qwen3-8B-f16:Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
44
+ Qwen3-8B-f16:Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
MODELFILE ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MODELFILE for Qwen3-8B-GGUF
2
+ # Used by LM Studio, OpenWebUI, GPT4All, etc.
3
+
4
+ context_length: 32768
5
+ embedding: false
6
+ f16: cpu
7
+
8
+ # Chat template using ChatML (used by Qwen)
9
+ prompt_template: >-
10
+ <|im_start|>system
11
+ You are a helpful assistant.<|im_end|>
12
+ <|im_start|>user
13
+ {prompt}<|im_end|>
14
+ <|im_start|>assistant
15
+
16
+ # Stop sequences help end generation cleanly
17
+ stop: "<|im_end|>"
18
+ stop: "<|im_start|>"
19
+
20
+ # Default sampling (optimized for thinking mode)
21
+ temperature: 0.6
22
+ top_p: 0.95
23
+ top_k: 20
24
+ min_p: 0.0
25
+ repeat_penalty: 1.1
Qwen3-8B-Q2_K/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q2_K
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q2_K** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 3.1G
25
+ - **Precision**: Q2_K
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Very Low |
34
+ | **Speed** | ⚡ Fast |
35
+ | **RAM Required** | ~3.0 GB |
36
+ | **Recommendation** | Only on very weak hardware; poor reasoning. Avoid if possible. Not suitable for thinking mode. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q2_K;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
127
+ "temperature": 0.5,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q3_K_M/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q3_K_M
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q3_K_M** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 3.9G
25
+ - **Precision**: Q3_K_M
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Low-Medium |
34
+ | **Speed** | ⚡ Fast |
35
+ | **RAM Required** | ~3.6 GB |
36
+ | **Recommendation** | Acceptable for basic chat on older CPUs. Do not expect coherent logic. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q3_K_M;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
127
+ "temperature": 0.5,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q3_K_S/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q3_K_S
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q3_K_S** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 3.6G
25
+ - **Precision**: Q3_K_S
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Low |
34
+ | **Speed** | ⚡ Fast |
35
+ | **RAM Required** | ~3.4 GB |
36
+ | **Recommendation** | Minimal viable for simple tasks. Avoid for reasoning or multilingual use. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q3_K_S;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
127
+ "temperature": 0.5,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q4_K_M/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q4_K_M
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q4_K_M** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 4.7G
25
+ - **Precision**: Q4_K_M
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Balanced |
34
+ | **Speed** | 🚀 Fast |
35
+ | **RAM Required** | ~4.3 GB |
36
+ | **Recommendation** | Best speed/quality balance for most users. Ideal for laptops & general use. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q4_K_M;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Write a short haiku about autumn leaves falling gently in a quiet forest.",
127
+ "temperature": 0.7,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q4_K_S/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q4_K_S
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q4_K_S** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 4.5G
25
+ - **Precision**: Q4_K_S
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Medium |
34
+ | **Speed** | 🚀 Fast |
35
+ | **RAM Required** | ~4.1 GB |
36
+ | **Recommendation** | Good for low-end devices; decent performance. Suitable for mobile/embedded. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q4_K_S;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Summarize what a neural network is in one sentence.",
127
+ "temperature": 0.5,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q5_K_M/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q5_K_M
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q5_K_M** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 5.5G
25
+ - **Precision**: Q5_K_M
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | High+ |
34
+ | **Speed** | 🐢 Medium |
35
+ | **RAM Required** | ~4.9 GB |
36
+ | **Recommendation** | Top choice for reasoning & coding. Recommended for desktops & strong laptops. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q5_K_M;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.",
127
+ "temperature": 0.4,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q5_K_S/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q5_K_S
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q5_K_S** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 5.4G
25
+ - **Precision**: Q5_K_S
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | High |
34
+ | **Speed** | 🐢 Medium |
35
+ | **RAM Required** | ~4.8 GB |
36
+ | **Recommendation** | Great for reasoning; slightly faster than Q5_K_M. Recommended for coding. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q5_K_S;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Write a short haiku about autumn leaves falling gently in a quiet forest.",
127
+ "temperature": 0.7,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q6_K/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q6_K
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q6_K** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 6.3G
25
+ - **Precision**: Q6_K
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Near-FP16 |
34
+ | **Speed** | 🐌 Slow |
35
+ | **RAM Required** | ~5.5 GB |
36
+ | **Recommendation** | Excellent fidelity; ideal for RAG, complex logic. Use if RAM allows. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q6_K;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.",
127
+ "temperature": 0.4,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-Q8_0/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ ---
16
+
17
+ # Qwen3-8B-Q8_0
18
+
19
+ Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q8_0** level, derived from **f16** base weights.
20
+
21
+ ## Model Info
22
+
23
+ - **Format**: GGUF (for llama.cpp and compatible runtimes)
24
+ - **Size**: 8.2G
25
+ - **Precision**: Q8_0
26
+ - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)
27
+ - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
28
+
29
+ ## Quality & Performance
30
+
31
+ | Metric | Value |
32
+ |-------|-------|
33
+ | **Quality** | Lossless* |
34
+ | **Speed** | 🐌 Slow |
35
+ | **RAM Required** | ~7.1 GB |
36
+ | **Recommendation** | Highest quality without FP16; perfect for accuracy-critical tasks, benchmarks. |
37
+
38
+ ## Prompt Template (ChatML)
39
+
40
+ This model uses the **ChatML** format used by Qwen:
41
+
42
+ ```text
43
+ <|im_start|>system
44
+ You are a helpful assistant.<|im_end|>
45
+ <|im_start|>user
46
+ {prompt}<|im_end|>
47
+ <|im_start|>assistant
48
+ ```
49
+
50
+ Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
51
+
52
+ ## Generation Parameters
53
+
54
+ ### Thinking Mode (Recommended for Logic)
55
+ Use when solving math, coding, or logical problems.
56
+
57
+ | Parameter | Value |
58
+ |---------|-------|
59
+ | Temperature | 0.6 |
60
+ | Top-P | 0.95 |
61
+ | Top-K | 20 |
62
+ | Min-P | 0.0 |
63
+ | Repeat Penalty | 1.1 |
64
+
65
+ > ❗ DO NOT use greedy decoding — it causes infinite loops.
66
+
67
+ Enable via:
68
+ - `enable_thinking=True` in tokenizer
69
+ - Or add `/think` in user input during conversation
70
+
71
+ ### Non-Thinking Mode (Fast Dialogue)
72
+ For casual chat and quick replies.
73
+
74
+ | Parameter | Value |
75
+ |---------|-------|
76
+ | Temperature | 0.7 |
77
+ | Top-P | 0.8 |
78
+ | Top-K | 20 |
79
+ | Min-P | 0.0 |
80
+ | Repeat Penalty | 1.1 |
81
+
82
+ Enable via:
83
+ - `enable_thinking=False`
84
+ - Or add `/no_think` in prompt
85
+
86
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
87
+
88
+ ## 💡 Usage Tips
89
+
90
+ > This model supports two operational modes:
91
+ >
92
+ > ### 🔍 Thinking Mode (Recommended for Logic)
93
+ > Activate with `enable_thinking=True` or append `/think` in prompt.
94
+ >
95
+ > - Ideal for: math, coding, planning, analysis
96
+ > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20`
97
+ > - Avoid greedy decoding
98
+ >
99
+ > ### ⚡ Non-Thinking Mode (Fast Chat)
100
+ > Use `enable_thinking=False` or `/no_think`.
101
+ >
102
+ > - Best for: casual conversation, quick answers
103
+ > - Sampling: `temp=0.7`, `top_p=0.8`
104
+ >
105
+ > ---
106
+ >
107
+ > 🔄 **Switch Dynamically**
108
+ > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence.
109
+ >
110
+ > 🔁 **Avoid Repetition**
111
+ > Set `presence_penalty=1.5` if stuck in loops.
112
+ >
113
+ > 📏 **Use Full Context**
114
+ > Allow up to 32,768 output tokens for complex tasks.
115
+ >
116
+ > 🧰 **Agent Ready**
117
+ > Works with Qwen-Agent, MCP servers, and custom tools.
118
+
119
+ ## 🖥️ CLI Example Using Ollama or TGI Server
120
+
121
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
122
+
123
+ ```bash
124
+ curl http://localhost:11434/api/generate -s -N -d '{
125
+ "model": "hf.co/geoffmunn/Qwen3-8B:Q8_0;2D",
126
+ "prompt": "Repeat the following instruction exactly as given: Explain why the sky appears blue during the day but red at sunrise and sunset, using physics principles like Rayleigh scattering.",
127
+ "temperature": 0.4,
128
+ "top_p": 0.95,
129
+ "top_k": 20,
130
+ "min_p": 0.0,
131
+ "repeat_penalty": 1.1,
132
+ "stream": false
133
+ }' | jq -r '.response'
134
+ ```
135
+
136
+ 🎯 **Why this works well**:
137
+ - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level.
138
+ - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`).
139
+ - Uses `jq` to extract clean output.
140
+
141
+ > 💬 Tip: For interactive streaming, set `"stream": true` and process line-by-line.
142
+
143
+ ## Verification
144
+
145
+ Check integrity:
146
+
147
+ ```bash
148
+ sha256sum -c ../SHA256SUMS.txt
149
+ ```
150
+
151
+ ## Usage
152
+
153
+ Compatible with:
154
+ - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration
155
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools
156
+ - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
157
+ - Directly via `llama.cpp`
158
+
159
+ Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations.
160
+
161
+ ## License
162
+
163
+ Apache 2.0 – see base model for full terms.
Qwen3-8B-f16:Q2_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e576995c60d8cad3daf851eb52e40d4d35fa3f472470d7b0a6898f183005d69c
3
+ size 3281732896
Qwen3-8B-f16:Q3_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fc24666cc6638401bbb703f52adc340bbfb07b9521675d43cb23e7f923890b06
3
+ size 4124161312
Qwen3-8B-f16:Q3_K_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57e354f6bf6ab5ba4ce246a29d69236ba1e45e542b90eefd080c12389491b569
3
+ size 3769611552
Qwen3-8B-f16:Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c167f66114c97c126c432976b9a081e12baf932126cb90476cac2932deec8b3
3
+ size 5027783968
Qwen3-8B-f16:Q4_K_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7647177f7a10f06642a6c62285b8547648da4149fcd019f6a99e8e945856e5bf
3
+ size 4802012448
Qwen3-8B-f16:Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8cabfa609126bf7250dc89b4d7c637b35a28aa894873d98a4c9e1decde589bc5
3
+ size 5851112736
Qwen3-8B-f16:Q5_K_S.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f6637d53d5bccaa0b9c938fd3df78661db6e54a9d4a0d3fb57c99410dc230bb
3
+ size 5720761632
Qwen3-8B-f16:Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:863da90f35b7c22b1cd184a20d04cc0eaf0df67f8f52ab0a6d4f68d192600898
3
+ size 6725899552
Qwen3-8B-f16:Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21962d706e584d0058a3f078ba42a40b7c3c82a1aa1a25588d372d41c99a8b6e
3
+ size 8709518624
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - reasoning
10
+ - agent
11
+ - chat
12
+ - multilingual
13
+ base_model: Qwen/Qwen3-8B
14
+ author: geoffmunn
15
+ pipeline_tag: text-generation
16
+ language:
17
+ - en
18
+ - zh
19
+ - es
20
+ - fr
21
+ - de
22
+ - ru
23
+ - ar
24
+ - ja
25
+ - ko
26
+ - hi
27
+ ---
28
+
29
+ # Qwen3-8B-GGUF
30
+
31
+ This is a **GGUF-quantized version** of the **[Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B)** language model — an **8-billion-parameter** LLM from Alibaba's Qwen series, designed for **advanced reasoning, agentic behavior, and multilingual tasks**.
32
+
33
+ Converted for use with `llama.cpp` and compatible tools like OpenWebUI, LM Studio, GPT4All, and more.
34
+
35
+ > 💡 **Key Features of Qwen3-8B**:
36
+ > - 🤔 **Thinking Mode**: Use `enable_thinking=True` or `/think` for step-by-step logic, math, and code.
37
+ > - ⚡ **Non-Thinking Mode**: Use `/no_think` for fast, lightweight dialogue.
38
+ > - 🧰 **Agent Capable**: Integrates with tools via MCP, APIs, and plugins.
39
+ > - 🌍 **Multilingual Support**: Fluent in 100+ languages including Chinese, English, Spanish, Arabic, Japanese, etc.
40
+
41
+ ## Available Quantizations (from f16)
42
+
43
+ These variants were built from a **f16** base model to ensure consistency across quant levels.
44
+
45
+ | Level | Quality | Speed | Size | Recommendation |
46
+ |----------|--------------|----------|-----------|----------------|
47
+ | Q2_K | Very Low | ⚡ Fastest | 2.7 GB | Only on severely memory-constrained systems (<6GB RAM). Avoid for reasoning. |
48
+ | Q3_K_S | Low | ⚡ Fast | 3.1 GB | Minimal viability; basic completion only. Not recommended. |
49
+ | Q3_K_M | Low-Medium | ⚡ Fast | 3.3 GB | Acceptable for simple chat on older systems. No complex logic. |
50
+ | Q4_K_S | Medium | 🚀 Fast | 3.8 GB | Good balance for low-end laptops or embedded platforms. |
51
+ | Q4_K_M | ✅ Balanced | 🚀 Fast | 4.0 GB | Best overall for general use on average hardware. Great speed/quality trade-off. |
52
+ | Q5_K_S | High | 🐢 Medium | 4.5 GB | Better reasoning; slightly faster than Q5_K_M. Ideal for coding. |
53
+ | Q5_K_M | ✅✅ High | 🐢 Medium | 4.6 GB | Top pick for deep interactions, logic, and tool use. Recommended for desktops. |
54
+ | Q6_K | 🔥 Near-FP16 | 🐌 Slow | 5.2 GB | Excellent fidelity; ideal for RAG, retrieval, and accuracy-critical tasks. |
55
+ | Q8_0 | 🏆 Lossless* | 🐌 Slow | 6.8 GB | Maximum accuracy; best for research, benchmarking, or archival. |
56
+
57
+ > 💡 **Recommendations by Use Case**
58
+ >
59
+ > - 💻 **Low-end CPU / Old Laptop**: `Q4_K_M` (best balance under pressure)
60
+ > - 🖥️ **Standard/Mid-tier Laptop (i5/i7/M1/M2)**: `Q5_K_M` (optimal quality)
61
+ > - 🧠 **Reasoning, Coding, Math**: `Q5_K_M` or `Q6_K` (use thinking mode!)
62
+ > - 🤖 **Agent & Tool Integration**: `Q5_K_M` — handles JSON, function calls well
63
+ > - 🔍 **RAG, Retrieval, Precision Tasks**: `Q6_K` or `Q8_0`
64
+ > - 📦 **Storage-Constrained Devices**: `Q4_K_S` or `Q4_K_M`
65
+ > - 🛠️ **Development & Testing**: Test from `Q4_K_M` up to `Q8_0` to assess trade-offs
66
+
67
+ ## Usage
68
+
69
+ Load this model using:
70
+ - [OpenWebUI](https://openwebui.com) – self-hosted AI interface with RAG & tools
71
+ - [LM Studio](https://lmstudio.ai) – desktop app with GPU support and chat templates
72
+ - [GPT4All](https://gpt4all.io) – private, local AI chatbot (offline-first)
73
+ - Or directly via \`llama.cpp\`
74
+
75
+ Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration.
76
+
77
+ ## Author
78
+
79
+ 👤 Geoff Munn (@geoffmunn)
80
+ 🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn)
81
+
82
+ ## Disclaimer
83
+
84
+ This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
SHA256SUMS.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ e576995c60d8cad3daf851eb52e40d4d35fa3f472470d7b0a6898f183005d69c Qwen3-8B-f16:Q2_K.gguf
2
+ fc24666cc6638401bbb703f52adc340bbfb07b9521675d43cb23e7f923890b06 Qwen3-8B-f16:Q3_K_M.gguf
3
+ 57e354f6bf6ab5ba4ce246a29d69236ba1e45e542b90eefd080c12389491b569 Qwen3-8B-f16:Q3_K_S.gguf
4
+ 8c167f66114c97c126c432976b9a081e12baf932126cb90476cac2932deec8b3 Qwen3-8B-f16:Q4_K_M.gguf
5
+ 7647177f7a10f06642a6c62285b8547648da4149fcd019f6a99e8e945856e5bf Qwen3-8B-f16:Q4_K_S.gguf
6
+ 8cabfa609126bf7250dc89b4d7c637b35a28aa894873d98a4c9e1decde589bc5 Qwen3-8B-f16:Q5_K_M.gguf
7
+ 8f6637d53d5bccaa0b9c938fd3df78661db6e54a9d4a0d3fb57c99410dc230bb Qwen3-8B-f16:Q5_K_S.gguf
8
+ 863da90f35b7c22b1cd184a20d04cc0eaf0df67f8f52ab0a6d4f68d192600898 Qwen3-8B-f16:Q6_K.gguf
9
+ 21962d706e584d0058a3f078ba42a40b7c3c82a1aa1a25588d372d41c99a8b6e Qwen3-8B-f16:Q8_0.gguf