Mahmoud Elsamadony
Update
988a3de
metadata
title: VTT with Diarization
emoji: 🎙️
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.49.1
app_file: app.py
pinned: false
license: mit

Voice-to-Text with Speaker Diarization

Powered by faster-whisper and pyannote.audio running locally on this Space.

Features

  • 🎯 High-quality transcription using faster-whisper (2-4x faster than OpenAI Whisper)
  • 👥 Speaker diarization with pyannote.audio 3.1
  • 🌍 Multi-language support (Arabic, English, French, German, Spanish, Russian, Chinese, etc.)
  • ⚙️ Configurable parameters (beam size, best_of, model size)
  • 🔧 Optimized for Arabic customer service calls with specialized prompts

Usage

  1. Upload an audio file (mp3, wav, m4a, flac, etc.)
  2. Select language (or leave blank for auto-detect)
  3. Enable speaker diarization if needed (requires HF_TOKEN)
  4. Adjust quality parameters if desired
  5. Click "Transcribe"

Configuration

Set these in Space Settings → Variables:

  • WHISPER_MODEL_SIZE: Model size (tiny, base, small, medium, large-v3) - default: small
  • WHISPER_DEVICE: Device (cpu or cuda) - default: cpu
  • WHISPER_COMPUTE_TYPE: Compute type (int8, int16, float32) - default: int8
  • DEFAULT_LANGUAGE: Default language code - default: ar (Arabic)
  • WHISPER_BEAM_SIZE: Beam search size (1-10) - default: 5
  • WHISPER_BEST_OF: Best of candidates (1-10) - default: 5

Secrets (required for diarization):

  • HF_TOKEN: Your Hugging Face token with access to pyannote/speaker-diarization-3.1

Model Information

Whisper Models

Model Size RAM Quality Speed
tiny 75MB 1GB ⭐⭐ Very Fast
base 150MB 1GB ⭐⭐⭐ Fast
small 500MB 2GB ⭐⭐⭐⭐ Moderate
medium 1.5GB 5GB ⭐⭐⭐⭐⭐ Slow
large-v3 3GB 10GB ⭐⭐⭐⭐⭐⭐ Very Slow

First Run

  • First transcription will download the selected Whisper model automatically
  • Diarization downloads ~700MB on first use (cached afterward)
  • Models are stored in the Space's persistent storage

Technical Details

  • Uses the same model loading approach as the Django backend
  • faster-whisper automatically downloads models from Hugging Face
  • Diarization pipeline is downloaded locally to avoid repeated API calls
  • All processing happens on this Space (no external inference APIs)

Credits

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference