|
|
--- |
|
|
title: VTT with Diarization |
|
|
emoji: ποΈ |
|
|
colorFrom: purple |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
sdk_version: 5.49.1 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Voice-to-Text with Speaker Diarization |
|
|
|
|
|
Powered by **faster-whisper** and **pyannote.audio** running locally on this Space. |
|
|
|
|
|
## Features |
|
|
|
|
|
- π― **High-quality transcription** using faster-whisper (2-4x faster than OpenAI Whisper) |
|
|
- π₯ **Speaker diarization** with pyannote.audio 3.1 |
|
|
- π **Multi-language support** (Arabic, English, French, German, Spanish, Russian, Chinese, etc.) |
|
|
- βοΈ **Configurable parameters** (beam size, best_of, model size) |
|
|
- π§ **Optimized for Arabic customer service calls** with specialized prompts |
|
|
|
|
|
## Usage |
|
|
|
|
|
1. Upload an audio file (mp3, wav, m4a, flac, etc.) |
|
|
2. Select language (or leave blank for auto-detect) |
|
|
3. Enable speaker diarization if needed (requires HF_TOKEN) |
|
|
4. Adjust quality parameters if desired |
|
|
5. Click "Transcribe" |
|
|
|
|
|
## Configuration |
|
|
|
|
|
Set these in Space Settings β Variables: |
|
|
|
|
|
- `WHISPER_MODEL_SIZE`: Model size (`tiny`, `base`, `small`, `medium`, `large-v3`) - default: `small` |
|
|
- `WHISPER_DEVICE`: Device (`cpu` or `cuda`) - default: `cpu` |
|
|
- `WHISPER_COMPUTE_TYPE`: Compute type (`int8`, `int16`, `float32`) - default: `int8` |
|
|
- `DEFAULT_LANGUAGE`: Default language code - default: `ar` (Arabic) |
|
|
- `WHISPER_BEAM_SIZE`: Beam search size (1-10) - default: `5` |
|
|
- `WHISPER_BEST_OF`: Best of candidates (1-10) - default: `5` |
|
|
|
|
|
### Secrets (required for diarization): |
|
|
|
|
|
- `HF_TOKEN`: Your Hugging Face token with access to `pyannote/speaker-diarization-3.1` |
|
|
|
|
|
## Model Information |
|
|
|
|
|
### Whisper Models |
|
|
|
|
|
| Model | Size | RAM | Quality | Speed | |
|
|
|-------|------|-----|---------|-------| |
|
|
| tiny | 75MB | 1GB | ββ | Very Fast | |
|
|
| base | 150MB | 1GB | βββ | Fast | |
|
|
| small | 500MB | 2GB | ββββ | Moderate | |
|
|
| medium | 1.5GB | 5GB | βββββ | Slow | |
|
|
| large-v3 | 3GB | 10GB | ββββββ | Very Slow | |
|
|
|
|
|
### First Run |
|
|
|
|
|
- First transcription will download the selected Whisper model automatically |
|
|
- Diarization downloads ~700MB on first use (cached afterward) |
|
|
- Models are stored in the Space's persistent storage |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
- Uses the same model loading approach as the Django backend |
|
|
- faster-whisper automatically downloads models from Hugging Face |
|
|
- Diarization pipeline is downloaded locally to avoid repeated API calls |
|
|
- All processing happens on this Space (no external inference APIs) |
|
|
|
|
|
## Credits |
|
|
|
|
|
- [faster-whisper](https://github.com/guillaumekln/faster-whisper) by Guillaume Klein |
|
|
- [pyannote.audio](https://github.com/pyannote/pyannote-audio) by HervΓ© Bredin |
|
|
- Original Django backend by IZI Techs |
|
|
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
|
|
|