vtt-with-diariazation / API_USAGE.md
Mahmoud Elsamadony
Update
988a3de

Using VTT with Diarization Space via API

This guide shows you how to use your Hugging Face Space via API.

Option 1: Using Python (Gradio Client)

Installation

pip install gradio_client

Quick Start

from gradio_client import Client

# Initialize client
client = Client("MahmoudElsamadony/vtt-with-diariazation")

# Transcribe audio
result = client.predict(
    audio_path="path/to/your/audio.mp3",
    language="ar",  # or "en", "fr", etc., or "" for auto-detect
    enable_diarization=False,
    beam_size=5,
    best_of=5,
    api_name="/predict"
)

transcript, details = result
print(f"Transcript: {transcript}")
print(f"Language: {details['language']}")
print(f"Duration: {details['duration']} seconds")

With Speaker Diarization

# Enable diarization to identify different speakers
result = client.predict(
    audio_path="path/to/your/audio.mp3",
    language="ar",
    enable_diarization=True,  # Enable speaker diarization
    beam_size=5,
    best_of=5,
    api_name="/predict"
)

transcript, details = result

# Access speaker information
for segment in details['segments']:
    speaker = segment.get('speaker', 'Unknown')
    text = segment['text']
    start = segment['start']
    print(f"[{start:.2f}s] {speaker}: {text}")

Full Example Script

See api_client.py for a complete example with multiple use cases.

python api_client.py

Option 2: Using JavaScript/TypeScript

Installation

npm install @gradio/client

Usage

import { client } from "@gradio/client";

const app = await client("MahmoudElsamadony/vtt-with-diariazation");

const result = await app.predict("/predict", [
    "path/to/audio.mp3",  // audio_path
    "ar",                  // language
    false,                 // enable_diarization
    5,                     // beam_size
    5                      // best_of
]);

const [transcript, details] = result.data;
console.log("Transcript:", transcript);
console.log("Language:", details.language);
console.log("Duration:", details.duration);

Option 3: Using cURL (REST API)

First, get your Space's API endpoint:

curl https://mahmoudelsamadony-vtt-with-diariazation.hf.space/info

Then make a prediction (you'll need to upload the file first):

# This is more complex with cURL as you need to handle file uploads
# It's recommended to use the Python or JavaScript clients instead

Parameters

Parameter Type Default Description
audio_path string required Path to audio file (mp3, wav, m4a, etc.)
language string "ar" Language code ("ar", "en", "fr", "de", "es", "ru", "zh") or "" for auto-detect
enable_diarization boolean false Enable speaker diarization (identifies different speakers)
beam_size integer 5 Beam size for Whisper (1-10, higher = more accurate but slower)
best_of integer 5 Best of parameter for Whisper (1-10)

Response Format

The API returns a tuple (transcript, details):

transcript (string)

The complete transcribed text.

details (object)

{
  "text": "Complete transcript text",
  "language": "ar",
  "language_probability": 0.98,
  "duration": 123.45,
  "segments": [
    {
      "start": 0.0,
      "end": 5.2,
      "text": "Segment text",
      "speaker": "SPEAKER_00",  // Only if diarization is enabled
      "words": [
        {
          "start": 0.0,
          "end": 0.5,
          "word": "word",
          "probability": 0.95
        }
      ]
    }
  ],
  "speakers": [  // Only if diarization is enabled
    {
      "start": 0.0,
      "end": 10.5,
      "speaker": "SPEAKER_00"
    }
  ]
}

Error Handling

from gradio_client import Client

try:
    client = Client("MahmoudElsamadony/vtt-with-diariazation")
    result = client.predict(
        audio_path="audio.mp3",
        language="ar",
        enable_diarization=False,
        beam_size=5,
        best_of=5,
        api_name="/predict"
    )
    transcript, details = result
    print(transcript)
except Exception as e:
    print(f"Error: {e}")

Tips

  1. First run takes longer - The space needs to download models (~1.2GB total)
  2. Diarization requires HF token - Make sure you've set HF_TOKEN in your Space secrets
  3. Use appropriate beam_size - Higher values (8-10) are more accurate but slower
  4. Language auto-detection - Pass empty string "" for language to auto-detect
  5. Rate limits - Hugging Face Spaces have rate limits for free usage

Local Testing

To test the API locally before deploying:

# In your space directory
python app.py

Then access via:

client = Client("http://127.0.0.1:7860")

Advanced: Async Usage

from gradio_client import Client

async def transcribe_async():
    client = Client("MahmoudElsamadony/vtt-with-diariazation")
    
    # Submit job
    job = client.submit(
        audio_path="audio.mp3",
        language="ar",
        enable_diarization=False,
        beam_size=5,
        best_of=5,
        api_name="/predict"
    )
    
    # Do other work while waiting...
    
    # Get result when ready
    result = job.result()
    return result

# Use with asyncio
import asyncio
result = asyncio.run(transcribe_async())

Support

For issues with the API, check: