Mahmoud Elsamadony commited on
Commit
2271844
·
1 Parent(s): b5282ea
NVIDIA_NEMO_MIGRATION.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Migration to NVIDIA NeMo Sortformer Diarization
2
+
3
+ ## Overview
4
+ The application has been updated to use **NVIDIA NeMo's Sortformer** (`nvidia/diar_streaming_sortformer_4spk-v2`) instead of the previous pyannote diarization model.
5
+
6
+ ## Key Changes
7
+
8
+ ### 1. Model Architecture
9
+ - **Old**: pyannote.audio pipeline (gated model requiring HF token and acceptance)
10
+ - **New**: NVIDIA NeMo Sortformer (open CC-BY-4.0 license, no gating)
11
+
12
+ ### 2. Features
13
+ - **Streaming capability**: Real-time processing with configurable latency
14
+ - **Max 4 speakers**: Optimized for up to 4 speakers (performance degrades beyond)
15
+ - **Better accuracy**: State-of-the-art DER (Diarization Error Rate) on benchmark datasets
16
+ - **No HF token required**: Model downloads directly without authentication
17
+
18
+ ### 3. Technical Improvements
19
+ - **Arrival-Order Speaker Cache (AOSC)**: Tracks speakers by arrival time
20
+ - **Frame-level processing**: 80ms frames (0.08 seconds per frame)
21
+ - **Configurable streaming**: Can adjust latency/accuracy trade-off
22
+
23
+ ## Installation
24
+
25
+ ### Updated Dependencies
26
+ ```bash
27
+ pip install -r requirements.txt
28
+ ```
29
+
30
+ Key additions:
31
+ - `nemo_toolkit[asr]` - NVIDIA NeMo framework with ASR components
32
+ - `Cython` and `packaging` - Required for NeMo installation
33
+
34
+ ### System Requirements
35
+ ```bash
36
+ # Install system dependencies (Ubuntu/Debian)
37
+ apt-get update && apt-get install -y libsndfile1 ffmpeg
38
+ ```
39
+
40
+ ## Configuration
41
+
42
+ ### Environment Variables
43
+
44
+ #### Diarization Model
45
+ ```bash
46
+ DIARIZATION_MODEL_NAME=nvidia/diar_streaming_sortformer_4spk-v2
47
+ ```
48
+
49
+ #### Streaming Configuration (80ms frames)
50
+ Current preset: **High Latency** (10 seconds, better accuracy)
51
+
52
+ ```bash
53
+ DIAR_CHUNK_SIZE=124 # Processing chunk size (frames)
54
+ DIAR_RIGHT_CONTEXT=1 # Future frames after chunk
55
+ DIAR_FIFO_SIZE=124 # Previous frames before chunk
56
+ DIAR_UPDATE_PERIOD=124 # Speaker cache update period
57
+ DIAR_CACHE_SIZE=188 # Total speaker cache size
58
+ ```
59
+
60
+ #### Available Presets
61
+
62
+ | Preset | Latency | RTF | CHUNK_SIZE | RIGHT_CONTEXT | FIFO_SIZE | UPDATE_PERIOD | CACHE_SIZE |
63
+ |--------|---------|-----|------------|---------------|-----------|---------------|------------|
64
+ | Very High Latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
65
+ | **High Latency** (current) | **10.0s** | **0.005** | **124** | **1** | **124** | **124** | **188** |
66
+ | Low Latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
67
+ | Ultra Low Latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
68
+
69
+ *RTF = Real-Time Factor (measured on NVIDIA RTX 6000 Ada)*
70
+
71
+ ## API Changes
72
+
73
+ ### Input Format
74
+ Same as before - audio file path:
75
+ ```python
76
+ diar_model.diarize(audio=audio_path, batch_size=1)
77
+ ```
78
+
79
+ ### Output Format
80
+ NeMo returns: `[[start_seconds, end_seconds, speaker_index], ...]`
81
+
82
+ Example:
83
+ ```python
84
+ [
85
+ [0.0, 5.2, 0], # SPEAKER_0 from 0s to 5.2s
86
+ [5.3, 10.1, 1], # SPEAKER_1 from 5.3s to 10.1s
87
+ [10.2, 15.0, 0], # SPEAKER_0 from 10.2s to 15.0s
88
+ ]
89
+ ```
90
+
91
+ Converted to:
92
+ ```json
93
+ {
94
+ "start": 0.0,
95
+ "end": 5.2,
96
+ "speaker": "SPEAKER_0"
97
+ }
98
+ ```
99
+
100
+ ## Model Limitations
101
+
102
+ 1. **Maximum 4 speakers**: Performance degrades with 5+ speakers
103
+ 2. **English-optimized**: Trained primarily on English datasets (but works with other languages)
104
+ 3. **Long recordings**: May degrade on very long recordings (several hours)
105
+ 4. **Audio format**: Requires single-channel (mono) 16kHz audio
106
+
107
+ ## Performance Benchmarks
108
+
109
+ ### DIHARD III Eval (1-4 speakers)
110
+ - DER: 13.24%
111
+
112
+ ### CALLHOME (2 speakers)
113
+ - DER: 6.57%
114
+
115
+ ### CALLHOME (full, 2-6 speakers)
116
+ - DER: 10.70%
117
+
118
+ ## Troubleshooting
119
+
120
+ ### NeMo Installation Issues
121
+ ```bash
122
+ # Install prerequisites first
123
+ pip install Cython packaging
124
+
125
+ # Then install NeMo
126
+ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
127
+ ```
128
+
129
+ ### Model Download Issues
130
+ The model (~700MB) downloads automatically from Hugging Face on first use. If download fails:
131
+ - Check internet connection
132
+ - Verify disk space (~1GB free recommended)
133
+ - Try manual download from: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2
134
+
135
+ ### Memory Issues
136
+ If running out of memory:
137
+ - Reduce `DIAR_CACHE_SIZE` (default: 188)
138
+ - Use "Low Latency" preset (smaller buffers)
139
+ - Process shorter audio segments
140
+
141
+ ## Migration Steps for Hugging Face Space
142
+
143
+ 1. **Update requirements.txt**: Already done ✓
144
+ 2. **Update app.py**: Already done ✓
145
+ 3. **Remove HF_TOKEN requirement**: No longer needed for diarization
146
+ 4. **Restart/Rebuild Space**: Click "Restart" or "Rebuild" in Space settings
147
+ 5. **First run**: Model will download automatically (~700MB, one-time)
148
+
149
+ ## References
150
+
151
+ - [NVIDIA NeMo Repository](https://github.com/NVIDIA/NeMo)
152
+ - [Sortformer Paper](https://arxiv.org/abs/2409.06656)
153
+ - [Streaming Sortformer Paper](https://arxiv.org/abs/2507.18446)
154
+ - [Model Card on Hugging Face](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2)
155
+ - [NeMo Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/)
156
+
157
+ ## License
158
+
159
+ NVIDIA Sortformer is licensed under **CC-BY-4.0** (Creative Commons Attribution 4.0 International).
160
+ - ✓ Commercial use allowed
161
+ - ✓ Modification allowed
162
+ - ✓ Distribution allowed
163
+ - ✓ Attribution required
__pycache__/app.cpython-311.pyc CHANGED
Binary files a/__pycache__/app.cpython-311.pyc and b/__pycache__/app.cpython-311.pyc differ
 
app.py CHANGED
@@ -7,7 +7,14 @@ import torch
7
  from dotenv import load_dotenv
8
  from faster_whisper import WhisperModel
9
  from huggingface_hub import snapshot_download
10
- from pyannote.audio import Pipeline
 
 
 
 
 
 
 
11
 
12
  load_dotenv()
13
 
@@ -20,14 +27,19 @@ WHISPER_MODEL_SIZE = os.environ.get("WHISPER_MODEL_SIZE", "large-v3")
20
  WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
21
  WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
22
 
23
- # Diarization: download locally to avoid repeated API calls
24
- DIARIZATION_REPO_ID = os.environ.get(
25
- "DIARIZATION_REPO_ID", "Revai/reverb-diarization-v2"
26
- )
27
- DIARIZATION_LOCAL_DIR = os.environ.get(
28
- "DIARIZATION_LOCAL_DIR", "models/Revai-reverb-diarization-v2"
29
  )
30
 
 
 
 
 
 
 
 
 
31
  HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN")
32
 
33
  # Preload prompts/parameters
@@ -44,7 +56,7 @@ best_of_default = int(os.environ.get("WHISPER_BEST_OF", 5))
44
  # Lazy singletons for the heavy models
45
  # ---------------------------------------------------------------------------
46
  _whisper_model: Optional[WhisperModel] = None
47
- _diarization_pipeline: Optional[Pipeline] = None
48
 
49
 
50
  def _ensure_snapshot(repo_id: str, local_dir: str, allow_patterns: Optional[List[str]] = None) -> str:
@@ -79,61 +91,53 @@ def _load_whisper_model() -> WhisperModel:
79
  return _whisper_model
80
 
81
 
82
- def _load_diarization_pipeline() -> Optional[Pipeline]:
83
- """Load speaker diarization pipeline lazily (singleton)"""
84
- global _diarization_pipeline
85
- if _diarization_pipeline is None:
86
- if HF_TOKEN is None:
87
  raise gr.Error(
88
- "HF_TOKEN secret is missing. Add it in Space settings to enable diarization."
89
  )
90
 
91
- print("Loading diarization pipeline...")
92
- # Download the pipeline locally to avoid repeated API calls
93
  try:
94
- local_dir = _ensure_snapshot(DIARIZATION_REPO_ID, DIARIZATION_LOCAL_DIR)
95
- except Exception as e:
96
- # snapshot_download failed (likely authentication or gating)
97
- raise gr.Error(
98
- f"Could not download '{DIARIZATION_REPO_ID}' to '{DIARIZATION_LOCAL_DIR}': {e}\n\n"
99
- "Make sure your HF token has the required 'read' scope and that you have accepted "
100
- "the model's terms at https://hf.co/models/pyannote/speaker-diarization-3.1. "
101
- "Add the token as the 'HF_TOKEN' Space secret and restart the Space."
102
  )
103
-
104
- # Try loading the pipeline. Different pyannote versions accept either
105
- # use_auth_token or token as the auth parameter.
106
- try:
107
- try:
108
- _diarization_pipeline = Pipeline.from_pretrained(
109
- local_dir, use_auth_token=HF_TOKEN
110
- )
111
- except TypeError:
112
- # older/newer versions may use 'token'
113
- _diarization_pipeline = Pipeline.from_pretrained(
114
- local_dir, token=HF_TOKEN
115
- )
 
 
116
  except Exception as e:
117
- # Loading failed - likely due to gated model or invalid token
118
  raise gr.Error(
119
- "Failed to load the diarization pipeline. This can happen if the model is gated or "
120
- "your Hugging Face token doesn't have permission.\n\n"
121
  f"Error details: {e}\n\n"
122
  "Solutions:\n"
123
- " - Go to https://hf.co/models/pyannote/speaker-diarization-3.1 and accept the terms if required.\n"
124
- " - Create a token at https://hf.co/settings/tokens with 'read' scope and add it to your Space Secrets as 'HF_TOKEN'.\n"
125
- " - Restart/rebuild the Space after adding the secret."
126
  )
127
 
128
- if _diarization_pipeline is None:
129
  raise gr.Error(
130
- "Diarization pipeline returned None after loading. Check that the model files were downloaded "
131
- "and that your token has access to the repository."
132
  )
133
-
134
- # Move pipeline to CPU
135
- _diarization_pipeline.to(torch.device("cpu"))
136
- return _diarization_pipeline
137
 
138
 
139
  # ---------------------------------------------------------------------------
@@ -220,19 +224,29 @@ def transcribe(
220
  }
221
 
222
  if enable_diarization:
223
- pipeline = _load_diarization_pipeline()
224
- diarization_result = pipeline(audio_path)
225
-
 
 
 
 
 
 
 
 
226
  speaker_turns: List[Dict[str, float]] = []
227
- for turn, _, speaker in diarization_result.itertracks(yield_label=True):
 
228
  speaker_turns.append(
229
  {
230
- "start": turn.start,
231
- "end": turn.end,
232
- "speaker": speaker,
233
  }
234
  )
235
 
 
236
  for segment in response["segments"]:
237
  mid = (segment["start"] + segment["end"]) / 2
238
  segment["speaker"] = None
@@ -251,11 +265,11 @@ def transcribe(
251
  # Gradio UI definition
252
  # ---------------------------------------------------------------------------
253
  def build_interface() -> gr.Blocks:
254
- with gr.Blocks(title="VTT with Diarization (faster-whisper + pyannote)") as demo:
255
  gr.Markdown(
256
  """
257
  # Voice-to-Text with Optional Diarization
258
- Powered by **faster-whisper** and **pyannote.audio** (runs locally on this Space).
259
  """
260
  )
261
 
@@ -273,7 +287,7 @@ def build_interface() -> gr.Blocks:
273
  diarization_toggle = gr.Checkbox(
274
  label="Enable Speaker Diarization",
275
  value=False,
276
- info="Requires HF token with access to pyannote/speaker-diarization-3.1.",
277
  )
278
  beam_slider = gr.Slider(
279
  label="Beam Size",
@@ -310,11 +324,12 @@ def build_interface() -> gr.Blocks:
310
  gr.Markdown(
311
  f"""
312
  ## Tips
313
- - **Current model**: `{WHISPER_MODEL_SIZE}` (first run downloads model automatically)
 
314
  - Diarization downloads ~700MB on first use (cached afterward)
315
- - Store your Hugging Face token in Space Secrets as **HF_TOKEN** (required for diarization)
316
  - Change `WHISPER_MODEL_SIZE` in Space Variables to `medium` or `large-v3` for higher accuracy
317
  - Optimized for Arabic customer service calls with specialized initial prompt
 
318
  """
319
  )
320
 
 
7
  from dotenv import load_dotenv
8
  from faster_whisper import WhisperModel
9
  from huggingface_hub import snapshot_download
10
+
11
+ # Import NeMo for NVIDIA Sortformer diarization
12
+ try:
13
+ from nemo.collections.asr.models import SortformerEncLabelModel
14
+ NEMO_AVAILABLE = True
15
+ except ImportError:
16
+ NEMO_AVAILABLE = False
17
+ SortformerEncLabelModel = None
18
 
19
  load_dotenv()
20
 
 
27
  WHISPER_DEVICE = os.environ.get("WHISPER_DEVICE", "cpu")
28
  WHISPER_COMPUTE_TYPE = os.environ.get("WHISPER_COMPUTE_TYPE", "int8_float32")
29
 
30
+ # Diarization: NVIDIA NeMo Sortformer model
31
+ DIARIZATION_MODEL_NAME = os.environ.get(
32
+ "DIARIZATION_MODEL_NAME", "nvidia/diar_streaming_sortformer_4spk-v2"
 
 
 
33
  )
34
 
35
+ # Diarization streaming configuration (using "high latency" preset for better accuracy)
36
+ # See https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2 for other configs
37
+ CHUNK_SIZE = int(os.environ.get("DIAR_CHUNK_SIZE", 124)) # high latency: 124 frames
38
+ RIGHT_CONTEXT = int(os.environ.get("DIAR_RIGHT_CONTEXT", 1))
39
+ FIFO_SIZE = int(os.environ.get("DIAR_FIFO_SIZE", 124))
40
+ UPDATE_PERIOD = int(os.environ.get("DIAR_UPDATE_PERIOD", 124))
41
+ SPEAKER_CACHE_SIZE = int(os.environ.get("DIAR_CACHE_SIZE", 188))
42
+
43
  HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGINGFACEHUB_API_TOKEN")
44
 
45
  # Preload prompts/parameters
 
56
  # Lazy singletons for the heavy models
57
  # ---------------------------------------------------------------------------
58
  _whisper_model: Optional[WhisperModel] = None
59
+ _diarization_model: Optional[SortformerEncLabelModel] = None
60
 
61
 
62
  def _ensure_snapshot(repo_id: str, local_dir: str, allow_patterns: Optional[List[str]] = None) -> str:
 
91
  return _whisper_model
92
 
93
 
94
+ def _load_diarization_model() -> Optional[SortformerEncLabelModel]:
95
+ """Load NVIDIA NeMo Sortformer diarization model lazily (singleton)"""
96
+ global _diarization_model
97
+ if _diarization_model is None:
98
+ if not NEMO_AVAILABLE:
99
  raise gr.Error(
100
+ "NeMo is not installed. Please install it with: pip install nemo_toolkit[asr]"
101
  )
102
 
103
+ print(f"Loading NVIDIA Sortformer diarization model: {DIARIZATION_MODEL_NAME}...")
104
+
105
  try:
106
+ # Load model directly from Hugging Face
107
+ _diarization_model = SortformerEncLabelModel.from_pretrained(
108
+ DIARIZATION_MODEL_NAME
 
 
 
 
 
109
  )
110
+
111
+ # Switch to evaluation mode
112
+ _diarization_model.eval()
113
+
114
+ # Configure streaming parameters (high latency preset for better accuracy)
115
+ # See: https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2#setting-up-streaming-configuration
116
+ _diarization_model.sortformer_modules.chunk_len = CHUNK_SIZE
117
+ _diarization_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
118
+ _diarization_model.sortformer_modules.fifo_len = FIFO_SIZE
119
+ _diarization_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD
120
+ _diarization_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
121
+ _diarization_model.sortformer_modules._check_streaming_parameters()
122
+
123
+ print("Sortformer model loaded successfully!")
124
+
125
  except Exception as e:
 
126
  raise gr.Error(
127
+ f"Failed to load NVIDIA Sortformer diarization model.\n\n"
 
128
  f"Error details: {e}\n\n"
129
  "Solutions:\n"
130
+ " - Make sure NeMo is properly installed: pip install nemo_toolkit[asr]\n"
131
+ " - Check that you have internet access to download from Hugging Face.\n"
132
+ " - The model will be downloaded automatically on first use (~700MB)."
133
  )
134
 
135
+ if _diarization_model is None:
136
  raise gr.Error(
137
+ "Diarization model returned None after loading."
 
138
  )
139
+
140
+ return _diarization_model
 
 
141
 
142
 
143
  # ---------------------------------------------------------------------------
 
224
  }
225
 
226
  if enable_diarization:
227
+ diar_model = _load_diarization_model()
228
+
229
+ # Run diarization using NeMo Sortformer
230
+ # Returns list of tuples: (start_time, end_time, speaker_id)
231
+ try:
232
+ predicted_segments = diar_model.diarize(audio=audio_path, batch_size=1)
233
+ except Exception as e:
234
+ raise gr.Error(f"Diarization failed: {e}")
235
+
236
+ # Convert NeMo output format to our speaker_turns format
237
+ # NeMo returns: [[start_seconds, end_seconds, speaker_index], ...]
238
  speaker_turns: List[Dict[str, float]] = []
239
+ for segment in predicted_segments:
240
+ start_time, end_time, speaker_idx = segment
241
  speaker_turns.append(
242
  {
243
+ "start": float(start_time),
244
+ "end": float(end_time),
245
+ "speaker": f"SPEAKER_{int(speaker_idx)}",
246
  }
247
  )
248
 
249
+ # Assign speakers to transcript segments based on temporal overlap
250
  for segment in response["segments"]:
251
  mid = (segment["start"] + segment["end"]) / 2
252
  segment["speaker"] = None
 
265
  # Gradio UI definition
266
  # ---------------------------------------------------------------------------
267
  def build_interface() -> gr.Blocks:
268
+ with gr.Blocks(title="VTT with Diarization (faster-whisper + NVIDIA NeMo)") as demo:
269
  gr.Markdown(
270
  """
271
  # Voice-to-Text with Optional Diarization
272
+ Powered by **faster-whisper** and **NVIDIA NeMo Sortformer** (runs locally on this Space).
273
  """
274
  )
275
 
 
287
  diarization_toggle = gr.Checkbox(
288
  label="Enable Speaker Diarization",
289
  value=False,
290
+ info="Uses NVIDIA Sortformer model (max 4 speakers, downloads ~700MB on first use).",
291
  )
292
  beam_slider = gr.Slider(
293
  label="Beam Size",
 
324
  gr.Markdown(
325
  f"""
326
  ## Tips
327
+ - **Whisper model**: `{WHISPER_MODEL_SIZE}` (first run downloads model automatically)
328
+ - **Diarization model**: NVIDIA `{DIARIZATION_MODEL_NAME}` (streaming, max 4 speakers)
329
  - Diarization downloads ~700MB on first use (cached afterward)
 
330
  - Change `WHISPER_MODEL_SIZE` in Space Variables to `medium` or `large-v3` for higher accuracy
331
  - Optimized for Arabic customer service calls with specialized initial prompt
332
+ - Streaming configuration: High latency preset (10s latency, better accuracy)
333
  """
334
  )
335
 
requirements.txt CHANGED
@@ -1,6 +1,5 @@
1
  gradio>=4.42.0
2
  faster-whisper>=1.0.0
3
- pyannote.audio>=3.1.1
4
  huggingface-hub>=0.23.0
5
  torch==2.3.0
6
  torchaudio==2.3.0
@@ -8,3 +7,9 @@ soundfile>=0.12.1
8
  numpy>=1.26.0
9
  ffmpeg-python>=0.2.0
10
  python-dotenv>=1.0.1
 
 
 
 
 
 
 
1
  gradio>=4.42.0
2
  faster-whisper>=1.0.0
 
3
  huggingface-hub>=0.23.0
4
  torch==2.3.0
5
  torchaudio==2.3.0
 
7
  numpy>=1.26.0
8
  ffmpeg-python>=0.2.0
9
  python-dotenv>=1.0.1
10
+
11
+ # NVIDIA NeMo for Sortformer diarization
12
+ # Install with ASR components for speaker diarization
13
+ Cython
14
+ packaging
15
+ nemo_toolkit[asr]