--- license: apache-2.0 tags: - gguf - qwen - qwen3-8b - qwen3-8b-q5 - qwen3-8b-q5_k_s - qwen3-8b-q5_k_s-gguf - llama.cpp - quantized - text-generation - reasoning - agent - chat - multilingual base_model: Qwen/Qwen3-8B author: geoffmunn --- # Qwen3-8B:Q5_K_S Quantized version of [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) at **Q5_K_S** level, derived from **f16** base weights. ## Model Info - **Format**: GGUF (for llama.cpp and compatible runtimes) - **Size**: 5.72 GB - **Precision**: Q5_K_S - **Base Model**: [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B) - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp) ## Quality & Performance | Metric | Value | |--------------------|---------------------------------------------------| | **Speed** | 🐒 Medium | | **RAM Required** | ~4.8 GB | | **Recommendation** | πŸ₯ˆ A good second place. Good for all query types. | ## Prompt Template (ChatML) This model uses the **ChatML** format used by Qwen: ```text <|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user {prompt}<|im_end|> <|im_start|>assistant ``` Set this in your app (LM Studio, OpenWebUI, etc.) for best results. ## Generation Parameters ### Thinking Mode (Recommended for Logic) Use when solving math, coding, or logical problems. | Parameter | Value | |----------------|-------| | Temperature | 0.6 | | Top-P | 0.95 | | Top-K | 20 | | Min-P | 0.0 | | Repeat Penalty | 1.1 | > ❗ DO NOT use greedy decoding β€” it causes infinite loops. Enable via: - `enable_thinking=True` in tokenizer - Or add `/think` in user input during conversation ### Non-Thinking Mode (Fast Dialogue) For casual chat and quick replies. | Parameter | Value | |----------------|-------| | Temperature | 0.7 | | Top-P | 0.8 | | Top-K | 20 | | Min-P | 0.0 | | Repeat Penalty | 1.1 | Enable via: - `enable_thinking=False` - Or add `/no_think` in prompt Stop sequences: `<|im_end|>`, `<|im_start|>` ## πŸ’‘ Usage Tips > This model supports two operational modes: > > ### πŸ” Thinking Mode (Recommended for Logic) > Activate with `enable_thinking=True` or append `/think` in prompt. > > - Ideal for: math, coding, planning, analysis > - Use sampling: `temp=0.6`, `top_p=0.95`, `top_k=20` > - Avoid greedy decoding > > ### ⚑ Non-Thinking Mode (Fast Chat) > Use `enable_thinking=False` or `/no_think`. > > - Best for: casual conversation, quick answers > - Sampling: `temp=0.7`, `top_p=0.8` > > --- > > πŸ”„ **Switch Dynamically** > In multi-turn chats, the last `/think` or `/no_think` directive takes precedence. > > πŸ” **Avoid Repetition** > Set `presence_penalty=1.5` if stuck in loops. > > πŸ“ **Use Full Context** > Allow up to 32,768 output tokens for complex tasks. > > 🧰 **Agent Ready** > Works with Qwen-Agent, MCP servers, and custom tools. ## Customisation & Troubleshooting Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`. In this case try these steps: 1. `wget https://huggingface.co/geoffmunn/Qwen3-8B/resolve/main/Qwen3-8B-f16%3AQ5_K_S.gguf` 2. `nano Modelfile` and enter these details: ```text FROM ./Qwen3-8B-f16:Q5_K_S.gguf # Chat template using ChatML (used by Qwen) SYSTEM You are a helpful assistant TEMPLATE "{{ if .System }}<|im_start|>system {{ .System }}<|im_end|>{{ end }}<|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant " PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> # Default sampling PARAMETER temperature 0.6 PARAMETER top_p 0.95 PARAMETER top_k 20 PARAMETER min_p 0.0 PARAMETER repeat_penalty 1.1 PARAMETER num_ctx 4096 ``` The `num_ctx` value has been dropped to increase speed significantly. 3. Then run this command: `ollama create Qwen3-8B-f16:Q5_K_S -f Modelfile` You will now see "Qwen3-8B-f16:Q5_K_S" in your Ollama model list. These import steps are also useful if you want to customise the default parameters or system prompt. ## πŸ–₯️ CLI Example Using Ollama or TGI Server Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference). ```bash curl http://localhost:11434/api/generate -s -N -d '{ "model": "hf.co/geoffmunn/Qwen3-8B:Q5_K_S", "prompt": "Repeat the following instruction exactly as given: Write a short haiku about autumn leaves falling gently in a quiet forest.", "temperature": 0.7, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "repeat_penalty": 1.1, "stream": false }' | jq -r '.response' ``` 🎯 **Why this works well**: - The prompt is meaningful and demonstrates either **reasoning**, **creativity**, or **clarity** depending on quant level. - Temperature is tuned appropriately: lower for factual responses (`0.4`), higher for creative ones (`0.7`). - Uses `jq` to extract clean output. > πŸ’¬ Tip: For interactive streaming, set `"stream": true` and process line-by-line. ## Verification Check integrity: ```bash sha256sum -c ../SHA256SUMS.txt ``` ## Usage Compatible with: - [LM Studio](https://lmstudio.ai) – local AI model runner with GPU acceleration - [OpenWebUI](https://openwebui.com) – self-hosted AI platform with RAG and tools - [GPT4All](https://gpt4all.io) – private, offline AI chatbot - Directly via `llama.cpp` Supports dynamic switching between thinking modes via `/think` and `/no_think` in multi-turn conversations. ## License Apache 2.0 – see base model for full terms.