metadata
title: VTT with Diarization
emoji: 🎙️
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit
Voice-to-Text with Speaker Diarization
Powered by faster-whisper and pyannote.audio running locally on this Space.
Features
- 🎯 High-quality transcription using faster-whisper (2-4x faster than OpenAI Whisper)
- 👥 Speaker diarization with pyannote.audio 3.1
- 🌍 Multi-language support (Arabic, English, French, German, Spanish, Russian, Chinese, etc.)
- ⚙️ Configurable parameters (beam size, best_of, model size)
- 🔧 Optimized for Arabic customer service calls with specialized prompts
Usage
- Upload an audio file (mp3, wav, m4a, flac, etc.)
- Select language (or leave blank for auto-detect)
- Enable speaker diarization if needed (requires HF_TOKEN)
- Adjust quality parameters if desired
- Click "Transcribe"
Configuration
Set these in Space Settings → Variables:
WHISPER_MODEL_SIZE: Model size (tiny,base,small,medium,large-v3) - default:smallWHISPER_DEVICE: Device (cpuorcuda) - default:cpuWHISPER_COMPUTE_TYPE: Compute type (int8,int16,float32) - default:int8DEFAULT_LANGUAGE: Default language code - default:ar(Arabic)WHISPER_BEAM_SIZE: Beam search size (1-10) - default:5WHISPER_BEST_OF: Best of candidates (1-10) - default:5
Secrets (required for diarization):
HF_TOKEN: Your Hugging Face token with access topyannote/speaker-diarization-3.1
Model Information
Whisper Models
| Model | Size | RAM | Quality | Speed |
|---|---|---|---|---|
| tiny | 75MB | 1GB | ⭐⭐ | Very Fast |
| base | 150MB | 1GB | ⭐⭐⭐ | Fast |
| small | 500MB | 2GB | ⭐⭐⭐⭐ | Moderate |
| medium | 1.5GB | 5GB | ⭐⭐⭐⭐⭐ | Slow |
| large-v3 | 3GB | 10GB | ⭐⭐⭐⭐⭐⭐ | Very Slow |
First Run
- First transcription will download the selected Whisper model automatically
- Diarization downloads ~700MB on first use (cached afterward)
- Models are stored in the Space's persistent storage
Technical Details
- Uses the same model loading approach as the Django backend
- faster-whisper automatically downloads models from Hugging Face
- Diarization pipeline is downloaded locally to avoid repeated API calls
- All processing happens on this Space (no external inference APIs)
Credits
- faster-whisper by Guillaume Klein
- pyannote.audio by Hervé Bredin
- Original Django backend by IZI Techs
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference