Spaces:

MahmoudElsamadony
/

vtt-with-diariazation

Paused

App Files Files Community

vtt-with-diariazation / README.md

Mahmoud Elsamadony

Update

988a3de about 1 month ago

preview code

raw

history blame contribute delete

2.76 kB

	---
	title: VTT with Diarization
	emoji: 🎙️
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.49.1
	app_file: app.py
	pinned: false
	license: mit
	---

	# Voice-to-Text with Speaker Diarization

	Powered by faster-whisper and pyannote.audio running locally on this Space.

	## Features

	- 🎯 High-quality transcription using faster-whisper (2-4x faster than OpenAI Whisper)
	- 👥 Speaker diarization with pyannote.audio 3.1
	- 🌍 Multi-language support (Arabic, English, French, German, Spanish, Russian, Chinese, etc.)
	- ⚙️ Configurable parameters (beam size, best_of, model size)
	- 🔧 Optimized for Arabic customer service calls with specialized prompts

	## Usage

	1. Upload an audio file (mp3, wav, m4a, flac, etc.)
	2. Select language (or leave blank for auto-detect)
	3. Enable speaker diarization if needed (requires HF_TOKEN)
	4. Adjust quality parameters if desired
	5. Click "Transcribe"

	## Configuration

	Set these in Space Settings → Variables:

	- `WHISPER_MODEL_SIZE`: Model size (`tiny`, `base`, `small`, `medium`, `large-v3`) - default: `small`
	- `WHISPER_DEVICE`: Device (`cpu` or `cuda`) - default: `cpu`
	- `WHISPER_COMPUTE_TYPE`: Compute type (`int8`, `int16`, `float32`) - default: `int8`
	- `DEFAULT_LANGUAGE`: Default language code - default: `ar` (Arabic)
	- `WHISPER_BEAM_SIZE`: Beam search size (1-10) - default: `5`
	- `WHISPER_BEST_OF`: Best of candidates (1-10) - default: `5`

	### Secrets (required for diarization):

	- `HF_TOKEN`: Your Hugging Face token with access to `pyannote/speaker-diarization-3.1`

	## Model Information

	### Whisper Models

	\| Model \| Size \| RAM \| Quality \| Speed \|
	\|-------\|------\|-----\|---------\|-------\|
	\| tiny \| 75MB \| 1GB \| ⭐⭐ \| Very Fast \|
	\| base \| 150MB \| 1GB \| ⭐⭐⭐ \| Fast \|
	\| small \| 500MB \| 2GB \| ⭐⭐⭐⭐ \| Moderate \|
	\| medium \| 1.5GB \| 5GB \| ⭐⭐⭐⭐⭐ \| Slow \|
	\| large-v3 \| 3GB \| 10GB \| ⭐⭐⭐⭐⭐⭐ \| Very Slow \|

	### First Run

	- First transcription will download the selected Whisper model automatically
	- Diarization downloads ~700MB on first use (cached afterward)
	- Models are stored in the Space's persistent storage

	## Technical Details

	- Uses the same model loading approach as the Django backend
	- faster-whisper automatically downloads models from Hugging Face
	- Diarization pipeline is downloaded locally to avoid repeated API calls
	- All processing happens on this Space (no external inference APIs)

	## Credits

	- [faster-whisper](https://github.com/guillaumekln/faster-whisper) by Guillaume Klein
	- [pyannote.audio](https://github.com/pyannote/pyannote-audio) by Hervé Bredin
	- Original Django backend by IZI Techs

	Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference