Mahmoud Elsamadony
Update
988a3de
---
title: VTT with Diarization
emoji: πŸŽ™οΈ
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
---
# Voice-to-Text with Speaker Diarization
Powered by **faster-whisper** and **pyannote.audio** running locally on this Space.
## Features
- 🎯 **High-quality transcription** using faster-whisper (2-4x faster than OpenAI Whisper)
- πŸ‘₯ **Speaker diarization** with pyannote.audio 3.1
- 🌍 **Multi-language support** (Arabic, English, French, German, Spanish, Russian, Chinese, etc.)
- βš™οΈ **Configurable parameters** (beam size, best_of, model size)
- πŸ”§ **Optimized for Arabic customer service calls** with specialized prompts
## Usage
1. Upload an audio file (mp3, wav, m4a, flac, etc.)
2. Select language (or leave blank for auto-detect)
3. Enable speaker diarization if needed (requires HF_TOKEN)
4. Adjust quality parameters if desired
5. Click "Transcribe"
## Configuration
Set these in Space Settings β†’ Variables:
- `WHISPER_MODEL_SIZE`: Model size (`tiny`, `base`, `small`, `medium`, `large-v3`) - default: `small`
- `WHISPER_DEVICE`: Device (`cpu` or `cuda`) - default: `cpu`
- `WHISPER_COMPUTE_TYPE`: Compute type (`int8`, `int16`, `float32`) - default: `int8`
- `DEFAULT_LANGUAGE`: Default language code - default: `ar` (Arabic)
- `WHISPER_BEAM_SIZE`: Beam search size (1-10) - default: `5`
- `WHISPER_BEST_OF`: Best of candidates (1-10) - default: `5`
### Secrets (required for diarization):
- `HF_TOKEN`: Your Hugging Face token with access to `pyannote/speaker-diarization-3.1`
## Model Information
### Whisper Models
| Model | Size | RAM | Quality | Speed |
|-------|------|-----|---------|-------|
| tiny | 75MB | 1GB | ⭐⭐ | Very Fast |
| base | 150MB | 1GB | ⭐⭐⭐ | Fast |
| small | 500MB | 2GB | ⭐⭐⭐⭐ | Moderate |
| medium | 1.5GB | 5GB | ⭐⭐⭐⭐⭐ | Slow |
| large-v3 | 3GB | 10GB | ⭐⭐⭐⭐⭐⭐ | Very Slow |
### First Run
- First transcription will download the selected Whisper model automatically
- Diarization downloads ~700MB on first use (cached afterward)
- Models are stored in the Space's persistent storage
## Technical Details
- Uses the same model loading approach as the Django backend
- faster-whisper automatically downloads models from Hugging Face
- Diarization pipeline is downloaded locally to avoid repeated API calls
- All processing happens on this Space (no external inference APIs)
## Credits
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) by Guillaume Klein
- [pyannote.audio](https://github.com/pyannote/pyannote-audio) by HervΓ© Bredin
- Original Django backend by IZI Techs
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference