|
|
--- |
|
|
title: ZeroGPU-LLM-Inference |
|
|
emoji: π§ |
|
|
colorFrom: indigo |
|
|
colorTo: purple |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: apache-2.0 |
|
|
short_description: Streaming LLM chat with web search and controls |
|
|
--- |
|
|
|
|
|
# π§ ZeroGPU LLM Inference |
|
|
|
|
|
A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer modelsβpowered by ZeroGPU for free GPU acceleration on Hugging Face Spaces. |
|
|
|
|
|
## β¨ Key Features |
|
|
|
|
|
### π¨ Modern UI/UX |
|
|
- **Clean, intuitive interface** with organized layout and visual hierarchy |
|
|
- **Collapsible advanced settings** for both simple and power users |
|
|
- **Smooth animations and transitions** for better user experience |
|
|
- **Responsive design** that works on all screen sizes |
|
|
- **Copy-to-clipboard** functionality for easy sharing of responses |
|
|
|
|
|
### π Web Search Integration |
|
|
- **Real-time DuckDuckGo search** with background threading |
|
|
- **Configurable timeout** and result limits |
|
|
- **Automatic context injection** into system prompts |
|
|
- **Smart toggle** - search settings auto-hide when disabled |
|
|
|
|
|
### π‘ Smart Features |
|
|
- **Thought vs. Answer streaming**: `<think>β¦</think>` blocks shown separately as "π Thought" |
|
|
- **Working cancel button** - immediately stops generation without errors |
|
|
- **Debug panel** for prompt engineering insights |
|
|
- **Duration estimates** based on model size and settings |
|
|
- **Example prompts** to help users get started |
|
|
- **Dynamic system prompts** with automatic date insertion |
|
|
|
|
|
### π― Model Variety |
|
|
- **30+ LLM options** from leading providers (Qwen, Microsoft, Meta, Mistral, etc.) |
|
|
- Models ranging from **135M to 32B+** parameters |
|
|
- Specialized models for **reasoning, coding, and general chat** |
|
|
- **Efficient model loading** - one at a time with automatic cache clearing |
|
|
|
|
|
### βοΈ Advanced Controls |
|
|
- **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty |
|
|
- **Web search settings**: max results, chars per result, timeout |
|
|
- **Custom system prompts** with dynamic date insertion |
|
|
- **Organized in collapsible sections** to keep interface clean |
|
|
|
|
|
## π Supported Models |
|
|
|
|
|
### Compact Models (< 2B) |
|
|
- **SmolLM2-135M-Instruct** - Tiny but capable |
|
|
- **SmolLM2-360M-Instruct** - Lightweight conversation |
|
|
- **Taiwan-ELM-270M/1.1B** - Multilingual support |
|
|
- **Qwen3-0.6B/1.7B** - Fast inference |
|
|
|
|
|
### Mid-Size Models (2B-8B) |
|
|
- **Qwen3-4B/8B** - Balanced performance |
|
|
- **Phi-4-mini** (4.3B) - Reasoning & Instruct variants |
|
|
- **MiniCPM3-4B** - Efficient mid-size |
|
|
- **Gemma-3-4B-IT** - Instruction-tuned |
|
|
- **Llama-3.2-Taiwan-3B** - Regional optimization |
|
|
- **Mistral-7B-Instruct** - Classic performer |
|
|
- **DeepSeek-R1-Distill-Llama-8B** - Reasoning specialist |
|
|
|
|
|
### Large Models (14B+) |
|
|
- **Qwen3-14B** - Strong general purpose |
|
|
- **Apriel-1.5-15b-Thinker** - Multimodal reasoning |
|
|
- **gpt-oss-20b** - Open GPT-style |
|
|
- **Qwen3-32B** - Top-tier performance |
|
|
|
|
|
## π How It Works |
|
|
|
|
|
1. **Select Model** - Choose from 30+ pre-configured models |
|
|
2. **Configure Settings** - Adjust generation parameters or use defaults |
|
|
3. **Enable Web Search** (optional) - Get real-time information |
|
|
4. **Start Chatting** - Type your message or use example prompts |
|
|
5. **Stream Response** - Watch as tokens are generated in real-time |
|
|
6. **Cancel Anytime** - Stop generation mid-stream if needed |
|
|
|
|
|
### Technical Flow |
|
|
|
|
|
1. User message enters chat history |
|
|
2. If search enabled, background thread fetches DuckDuckGo results |
|
|
3. Search snippets merge into system prompt (within timeout limit) |
|
|
4. Selected model pipeline loads on ZeroGPU (bf16βf16βf32 fallback) |
|
|
5. Prompt formatted with thinking mode detection |
|
|
6. Tokens stream to UI with thought/answer separation |
|
|
7. Cancel button available for immediate interruption |
|
|
8. Memory cleared after generation for next request |
|
|
|
|
|
## βοΈ Generation Parameters |
|
|
|
|
|
| Parameter | Range | Default | Description | |
|
|
|-----------|-------|---------|-------------| |
|
|
| Max Tokens | 64-16384 | 1024 | Maximum response length | |
|
|
| Temperature | 0.1-2.0 | 0.7 | Creativity vs focus | |
|
|
| Top-K | 1-100 | 40 | Token sampling pool size | |
|
|
| Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold | |
|
|
| Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition | |
|
|
|
|
|
## π Web Search Settings |
|
|
|
|
|
| Setting | Range | Default | Description | |
|
|
|---------|-------|---------|-------------| |
|
|
| Max Results | Integer | 4 | Number of search results | |
|
|
| Max Chars/Result | Integer | 50 | Character limit per result | |
|
|
| Search Timeout | 0-30s | 5s | Maximum wait time | |
|
|
|
|
|
## π» Local Development |
|
|
|
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference |
|
|
cd ZeroGPU-LLM-Inference |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Run the app |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
## π¨ UI Design Philosophy |
|
|
|
|
|
The interface follows these principles: |
|
|
|
|
|
1. **Simplicity First** - Core features immediately visible |
|
|
2. **Progressive Disclosure** - Advanced options hidden but accessible |
|
|
3. **Visual Hierarchy** - Clear organization with groups and sections |
|
|
4. **Feedback** - Status indicators and helpful messages |
|
|
5. **Accessibility** - Responsive, keyboard-friendly, with tooltips |
|
|
|
|
|
## π§ Customization |
|
|
|
|
|
### Adding New Models |
|
|
|
|
|
Edit `MODELS` dictionary in `app.py`: |
|
|
|
|
|
```python |
|
|
"Your-Model-Name": { |
|
|
"repo_id": "org/model-name", |
|
|
"description": "Model description", |
|
|
"params_b": 7.0 # Size in billions |
|
|
} |
|
|
``` |
|
|
|
|
|
### Modifying UI Theme |
|
|
|
|
|
Adjust theme parameters in `gr.Blocks()`: |
|
|
|
|
|
```python |
|
|
theme=gr.themes.Soft( |
|
|
primary_hue="indigo", |
|
|
secondary_hue="purple", |
|
|
# ... more options |
|
|
) |
|
|
``` |
|
|
|
|
|
## π Performance |
|
|
|
|
|
- **Token streaming** for responsive feel |
|
|
- **Background search** doesn't block UI |
|
|
- **Efficient memory** management with cache clearing |
|
|
- **ZeroGPU acceleration** for fast inference |
|
|
- **Optimized loading** with dtype fallbacks |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
Contributions welcome! Areas for improvement: |
|
|
|
|
|
- Additional model integrations |
|
|
- UI/UX enhancements |
|
|
- Performance optimizations |
|
|
- Bug fixes and testing |
|
|
- Documentation improvements |
|
|
|
|
|
## π License |
|
|
|
|
|
Apache 2.0 - See LICENSE file for details |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- Built with [Gradio](https://gradio.app) |
|
|
- Powered by [Hugging Face Transformers](https://huggingface.co/transformers) |
|
|
- Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration |
|
|
- Search via [DuckDuckGo](https://duckduckgo.com) |
|
|
|
|
|
--- |
|
|
|
|
|
**Made with β€οΈ for the open source community** |
|
|
|