Spaces:

MahmoudElsamadony
/

vtt-with-diariazation

Paused

App Files Files Community

Mahmoud Elsamadony commited on Oct 19

Commit

2271844

1 Parent(s): b5282ea

up

Browse files

Files changed (4) hide show

NVIDIA_NEMO_MIGRATION.md +163 -0
__pycache__/app.cpython-311.pyc +0 -0
app.py +77 -62
requirements.txt +6 -1

NVIDIA_NEMO_MIGRATION.md ADDED Viewed

	@@ -0,0 +1,163 @@

+# Migration to NVIDIA NeMo Sortformer Diarization
+## Overview
+The application has been updated to use **NVIDIA NeMo's Sortformer** (`nvidia/diar_streaming_sortformer_4spk-v2`) instead of the previous pyannote diarization model.
+## Key Changes
+### 1. Model Architecture
+- **Old**: pyannote.audio pipeline (gated model requiring HF token and acceptance)
+- **New**: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)
+### 2. Features
+- **Streaming capability**: Real-time processing with configurable latency
+- **Max 4 speakers**: Optimized for up to 4 speakers (performance degrades beyond)
+- **Better accuracy**: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
+- **No HF token required**: Model downloads directly without authentication
+### 3. Technical Improvements
+- **Arrival-Order Speaker Cache (AOSC)**: Tracks speakers by arrival time
+- **Frame-level processing**: 80ms frames (0.08 seconds per frame)
+- **Configurable streaming**: Can adjust latency/accuracy trade-off
+## Installation
+### Updated Dependencies
+```bash
+pip install -r requirements.txt
+```
+Key additions:
+- `nemo_toolkit[asr]` - NVIDIA NeMo framework with ASR components
+- `Cython` and `packaging` - Required for NeMo installation
+### System Requirements
+```bash
+# Install system dependencies (Ubuntu/Debian)
+apt-get update && apt-get install -y libsndfile1 ffmpeg
+```
+## Configuration
+### Environment Variables
+#### Diarization Model
+```bash
+DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2
+```
+#### Streaming Configuration (80ms frames)
+Current preset: **High Latency** (10 seconds, better accuracy)
+```bash
+DIAR_CHUNK_SIZE=124        # Processing chunk size (frames)
+DIAR_RIGHT_CONTEXT=1       # Future frames after chunk
+DIAR_FIFO_SIZE=124         # Previous frames before chunk
+DIAR_UPDATE_PERIOD=124     # Speaker cache update period
+DIAR_CACHE_SIZE=188        # Total speaker cache size
+```
+#### Available Presets
+| Preset | Latency | RTF | CHUNK_SIZE | RIGHT_CONTEXT | FIFO_SIZE | UPDATE_PERIOD | CACHE_SIZE |
+|--------|---------|-----|------------|---------------|-----------|---------------|------------|
+| Very High Latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
+| **High Latency** (current) | **10.0s** | **0.005** | **124** | **1** | **124** | **124** | **188** |
+| Low Latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
+| Ultra Low Latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
+*RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)*
+## API Changes
+### Input Format
+Same as before - audio file path:
+```python
+diar_model.diarize(audio=audio_path, batch_size=1)
+```
+### Output Format
+NeMo returns: `[[start_seconds, end_seconds, speaker_index], ...]`
+Example:
+```python
+[
+    [0.0, 5.2, 0],      # SPEAKER_0 from 0s to 5.2s
+    [5.3, 10.1, 1],     # SPEAKER_1 from 5.3s to 10.1s
+    [10.2, 15.0, 0],    # SPEAKER_0 from 10.2s to 15.0s
+]
+```
+Converted to:
+```json
+{
+    "start": 0.0,
+    "end": 5.2,
+    "speaker": "SPEAKER_0"
+}
+```
+## Model Limitations
+1. **Maximum 4 speakers**: Performance degrades with 5+ speakers
+2. **English-optimized**: Trained primarily on English datasets (but works with other languages)
+3. **Long recordings**: May degrade on very long recordings (several hours)
+4. **Audio format**: Requires single-channel (mono) 16kHz audio
+## Performance Benchmarks
+### DIHARD III Eval (1-4 speakers)
+- DER: 13.24%
+### CALLHOME (2 speakers)
+- DER: 6.57%
+### CALLHOME (full, 2-6 speakers)
+- DER: 10.70%
+## Troubleshooting
+### NeMo Installation Issues
+```bash
+# Install prerequisites first
+pip install Cython packaging
+# Then install NeMo
+pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
+```
+### Model Download Issues
+The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:
+- Check internet connection
+- Verify disk space (~1GB free recommended)
+- Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2
+### Memory Issues
+If running out of memory:
+- Reduce `DIAR_CACHE_SIZE` (default: 188)
+- Use "Low Latency" preset (smaller buffers)
+- Process shorter audio segments
+## Migration Steps for Hugging Face Space
+1. **Update requirements.txt**: Already done ✓
+2. **Update app.py**: Already done ✓
+3. **Remove HF_TOKEN requirement**: No longer needed for diarization
+4. **Restart/Rebuild Space**: Click "Restart" or "Rebuild" in Space settings
+5. **First run**: Model will download automatically (~700MB, one-time)
+## References
+- [NVIDIA NeMo Repository](https://github.com/NVIDIA/NeMo)
+- [Sortformer Paper](https://arxiv.org/abs/2409.06656)
+- [Streaming Sortformer Paper](https://arxiv.org/abs/2507.18446)
+- [Model Card on Hugging Face](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2)
+- [NeMo Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/)
+## License
+NVIDIA Sortformer is licensed under **CC-BY-4.0** (Creative Commons Attribution 4.0 International).
+- ✓ Commercial use allowed
+- ✓ Modification allowed
+- ✓ Distribution allowed
+- ✓ Attribution required

__pycache__/app.cpython-311.pyc CHANGED Viewed

Binary files a/__pycache__/app.cpython-311.pyc and b/__pycache__/app.cpython-311.pyc differ

app.py CHANGED Viewed

@@ -7,7 +7,14 @@ import torch
 from dotenv import load_dotenv
 from faster_whisper import WhisperModel
 from huggingface_hub import snapshot_download
-from pyannote.audio import Pipeline
 load_dotenv()
@@ -20,14 +27,19 @@ WHISPER_MODEL_SIZE = os.environ.get("WHISPER_MODEL_SIZE", "large-v3")
 WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
 WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
-# Diarization: download locally to avoid repeated API calls
-DIARIZATION_REPO_ID = os.environ.get(
-    "DIARIZATION_REPO_ID", "Revai/reverb-diarization-v2"
-)
-DIARIZATION_LOCAL_DIR = os.environ.get(
-    "DIARIZATION_LOCAL_DIR", "models/Revai-reverb-diarization-v2"
 )
 HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN")
 # Preload prompts/parameters
@@ -44,7 +56,7 @@ best_of_default = int(os.environ.get("WHISPER_BEST_OF", 5))
 # Lazy singletons for the heavy models
 # ---------------------------------------------------------------------------
 _whisper_model: Optional[WhisperModel] = None
-_diarization_pipeline: Optional[Pipeline] = None
 def _ensure_snapshot(repo_id: str, local_dir: str, allow_patterns: Optional[List[str]] = None) -> str:
@@ -79,61 +91,53 @@ def _load_whisper_model() -> WhisperModel:
     return _whisper_model
-def _load_diarization_pipeline() -> Optional[Pipeline]:
-    """Load speaker diarization pipeline lazily (singleton)"""
-    global _diarization_pipeline
-    if _diarization_pipeline is None:
-        if HF_TOKEN is None:
             raise gr.Error(
-                "HF_TOKEN secret is missing. Add it in Space settings to enable diarization."
             )
-        print("Loading diarization pipeline...")
-        # Download the pipeline locally to avoid repeated API calls
         try:
-            local_dir = _ensure_snapshot(DIARIZATION_REPO_ID, DIARIZATION_LOCAL_DIR)
-        except Exception as e:
-            # snapshot_download failed (likely authentication or gating)
-            raise gr.Error(
-                f"Could not download '{DIARIZATION_REPO_ID}' to '{DIARIZATION_LOCAL_DIR}': {e}\n\n"
-                "Make sure your HF token has the required 'read' scope and that you have accepted "
-                "the model's terms at https://hf.co/models/pyannote/speaker-diarization-3.1. "
-                "Add the token as the 'HF_TOKEN' Space secret and restart the Space."
             )
-        # Try loading the pipeline. Different pyannote versions accept either
-        # use_auth_token or token as the auth parameter.
-        try:
-            try:
-                _diarization_pipeline = Pipeline.from_pretrained(
-                    local_dir, use_auth_token=HF_TOKEN
-                )
-            except TypeError:
-                # older/newer versions may use 'token'
-                _diarization_pipeline = Pipeline.from_pretrained(
-                    local_dir, token=HF_TOKEN
-                )
         except Exception as e:
-            # Loading failed - likely due to gated model or invalid token
             raise gr.Error(
-                "Failed to load the diarization pipeline. This can happen if the model is gated or "
-                "your Hugging Face token doesn't have permission.\n\n"
                 f"Error details: {e}\n\n"
                 "Solutions:\n"
-                " - Go to https://hf.co/models/pyannote/speaker-diarization-3.1 and accept the terms if required.\n"
-                " - Create a token at https://hf.co/settings/tokens with 'read' scope and add it to your Space Secrets as 'HF_TOKEN'.\n"
-                " - Restart/rebuild the Space after adding the secret."
             )
-        if _diarization_pipeline is None:
             raise gr.Error(
-                "Diarization pipeline returned None after loading. Check that the model files were downloaded "
-                "and that your token has access to the repository."
             )
-        # Move pipeline to CPU
-        _diarization_pipeline.to(torch.device("cpu"))
-    return _diarization_pipeline
 # ---------------------------------------------------------------------------
@@ -220,19 +224,29 @@ def transcribe(
     }
     if enable_diarization:
-        pipeline = _load_diarization_pipeline()
-        diarization_result = pipeline(audio_path)
         speaker_turns: List[Dict[str, float]] = []
-        for turn, _, speaker in diarization_result.itertracks(yield_label=True):
             speaker_turns.append(
                 {
-                    "start": turn.start,
-                    "end": turn.end,
-                    "speaker": speaker,
                 }
             )
         for segment in response["segments"]:
             mid = (segment["start"] + segment["end"]) / 2
             segment["speaker"] = None
@@ -251,11 +265,11 @@ def transcribe(
 # Gradio UI definition
 # ---------------------------------------------------------------------------
 def build_interface() -> gr.Blocks:
-    with gr.Blocks(title="VTT with Diarization (faster-whisper + pyannote)") as demo:
         gr.Markdown(
             """
             # Voice-to-Text with Optional Diarization
-            Powered by **faster-whisper** and **pyannote.audio** (runs locally on this Space).
             """
         )
@@ -273,7 +287,7 @@ def build_interface() -> gr.Blocks:
             diarization_toggle = gr.Checkbox(
                 label="Enable Speaker Diarization",
                 value=False,
-                info="Requires HF token with access to pyannote/speaker-diarization-3.1.",
             )
             beam_slider = gr.Slider(
                 label="Beam Size",
@@ -310,11 +324,12 @@ def build_interface() -> gr.Blocks:
         gr.Markdown(
             f"""
             ## Tips
-            - **Current model**: `{WHISPER_MODEL_SIZE}` (first run downloads model automatically)
             - Diarization downloads ~700MB on first use (cached afterward)
-            - Store your Hugging Face token in Space Secrets as **HF_TOKEN** (required for diarization)
             - Change `WHISPER_MODEL_SIZE` in Space Variables to `medium` or `large-v3` for higher accuracy
             - Optimized for Arabic customer service calls with specialized initial prompt
             """
         )

 from dotenv import load_dotenv
 from faster_whisper import WhisperModel
 from huggingface_hub import snapshot_download
+# Import NeMo for NVIDIA Sortformer diarization
+try:
+    from nemo.collections.asr.models import SortformerEncLabelModel
+    NEMO_AVAILABLE = True
+except ImportError:
+    NEMO_AVAILABLE = False
+    SortformerEncLabelModel = None
 load_dotenv()
 WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
 WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
+# Diarization: NVIDIA NeMo Sortformer model
+DIARIZATION_MODEL_NAME = os.environ.get(
+    "DIARIZATION_MODEL_NAME", "nvidia/diar_streaming_sortformer_4spk-v2"
 )
+# Diarization streaming configuration (using "high latency" preset for better accuracy)
+# See https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2 for other configs
+CHUNK_SIZE = int(os.environ.get("DIAR_CHUNK_SIZE", 124))  # high latency: 124 frames
+RIGHT_CONTEXT = int(os.environ.get("DIAR_RIGHT_CONTEXT", 1))
+FIFO_SIZE = int(os.environ.get("DIAR_FIFO_SIZE", 124))
+UPDATE_PERIOD = int(os.environ.get("DIAR_UPDATE_PERIOD", 124))
+SPEAKER_CACHE_SIZE = int(os.environ.get("DIAR_CACHE_SIZE", 188))
 HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN")
 # Preload prompts/parameters
 # Lazy singletons for the heavy models
 # ---------------------------------------------------------------------------
 _whisper_model: Optional[WhisperModel] = None
+_diarization_model: Optional[SortformerEncLabelModel] = None
 def _ensure_snapshot(repo_id: str, local_dir: str, allow_patterns: Optional[List[str]] = None) -> str:
     return _whisper_model
+def _load_diarization_model() -> Optional[SortformerEncLabelModel]:
+    """Load NVIDIA NeMo Sortformer diarization model lazily (singleton)"""
+    global _diarization_model
+    if _diarization_model is None:
+        if not NEMO_AVAILABLE:
             raise gr.Error(
+                "NeMo is not installed. Please install it with: pip install nemo_toolkit[asr]"
             )
+        print(f"Loading NVIDIA Sortformer diarization model: {DIARIZATION_MODEL_NAME}...")
         try:
+            # Load model directly from Hugging Face
+            _diarization_model = SortformerEncLabelModel.from_pretrained(
+                DIARIZATION_MODEL_NAME
             )
+            # Switch to evaluation mode
+            _diarization_model.eval()
+            # Configure streaming parameters (high latency preset for better accuracy)
+            # See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2#setting-up-streaming-configuration
+            _diarization_model.sortformer_modules.chunk_len = CHUNK_SIZE
+            _diarization_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
+            _diarization_model.sortformer_modules.fifo_len = FIFO_SIZE
+            _diarization_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD
+            _diarization_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
+            _diarization_model.sortformer_modules._check_streaming_parameters()
+            print("Sortformer model loaded successfully!")
         except Exception as e:
             raise gr.Error(
+                f"Failed to load NVIDIA Sortformer diarization model.\n\n"
                 f"Error details: {e}\n\n"
                 "Solutions:\n"
+                " - Make sure NeMo is properly installed: pip install nemo_toolkit[asr]\n"
+                " - Check that you have internet access to download from Hugging Face.\n"
+                " - The model will be downloaded automatically on first use (~700MB)."
             )
+        if _diarization_model is None:
             raise gr.Error(
+                "Diarization model returned None after loading."
             )
+    return _diarization_model
 # ---------------------------------------------------------------------------
     }
     if enable_diarization:
+        diar_model = _load_diarization_model()
+        # Run diarization using NeMo Sortformer
+        # Returns list of tuples: (start_time, end_time, speaker_id)
+        try:
+            predicted_segments = diar_model.diarize(audio=audio_path, batch_size=1)
+        except Exception as e:
+            raise gr.Error(f"Diarization failed: {e}")
+        # Convert NeMo output format to our speaker_turns format
+        # NeMo returns: [[start_seconds, end_seconds, speaker_index], ...]
         speaker_turns: List[Dict[str, float]] = []
+        for segment in predicted_segments:
+            start_time, end_time, speaker_idx = segment
             speaker_turns.append(
                 {
+                    "start": float(start_time),
+                    "end": float(end_time),
+                    "speaker": f"SPEAKER_{int(speaker_idx)}",
                 }
             )
+        # Assign speakers to transcript segments based on temporal overlap
         for segment in response["segments"]:
             mid = (segment["start"] + segment["end"]) / 2
             segment["speaker"] = None
 # Gradio UI definition
 # ---------------------------------------------------------------------------
 def build_interface() -> gr.Blocks:
+    with gr.Blocks(title="VTT with Diarization (faster-whisper + NVIDIA NeMo)") as demo:
         gr.Markdown(
             """
             # Voice-to-Text with Optional Diarization
+            Powered by **faster-whisper** and **NVIDIA NeMo Sortformer** (runs locally on this Space).
             """
         )
             diarization_toggle = gr.Checkbox(
                 label="Enable Speaker Diarization",
                 value=False,
+                info="Uses NVIDIA Sortformer model (max 4 speakers, downloads ~700MB on first use).",
             )
             beam_slider = gr.Slider(
                 label="Beam Size",
         gr.Markdown(
             f"""
             ## Tips
+            - **Whisper model**: `{WHISPER_MODEL_SIZE}` (first run downloads model automatically)
+            - **Diarization model**: NVIDIA `{DIARIZATION_MODEL_NAME}` (streaming, max 4 speakers)
             - Diarization downloads ~700MB on first use (cached afterward)
             - Change `WHISPER_MODEL_SIZE` in Space Variables to `medium` or `large-v3` for higher accuracy
             - Optimized for Arabic customer service calls with specialized initial prompt
+            - Streaming configuration: High latency preset (10s latency, better accuracy)
             """
         )

requirements.txt CHANGED Viewed

@@ -1,6 +1,5 @@
 gradio>=4.42.0
 faster-whisper>=1.0.0
-pyannote.audio>=3.1.1
 huggingface-hub>=0.23.0
 torch==2.3.0
 torchaudio==2.3.0
@@ -8,3 +7,9 @@ soundfile>=0.12.1
 numpy>=1.26.0
 ffmpeg-python>=0.2.0
 python-dotenv>=1.0.1

 gradio>=4.42.0
 faster-whisper>=1.0.0
 huggingface-hub>=0.23.0
 torch==2.3.0
 torchaudio==2.3.0
 numpy>=1.26.0
 ffmpeg-python>=0.2.0
 python-dotenv>=1.0.1
+# NVIDIA NeMo for Sortformer diarization
+# Install with ASR components for speaker diarization
+Cython
+packaging
+nemo_toolkit[asr]