Taejin commited on
Commit
f1e1ad7
·
2 Parent(s): 1dd84ea a036d24

Merge branch 'main' of hf.co:nvidia/diar_sortformer_4spk-v1 into main

Browse files
Files changed (1) hide show
  1. README.md +57 -24
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: cc-by-nc-sa-4.0
3
  library_name: nemo
4
  datasets:
5
  - fisher_english
@@ -95,6 +95,7 @@ metrics:
95
  - der
96
  pipeline_tag: audio-classification
97
  ---
 
98
 
99
 
100
  # Sortformer Diarizer 4spk v1
@@ -109,6 +110,7 @@ img {
109
  | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
110
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
111
 
 
112
  [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
113
 
114
  <div align="center">
@@ -130,7 +132,10 @@ and two feedforward layers with 4 sigmoid outputs for each frame input at the to
130
  ## NVIDIA NeMo
131
 
132
  To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
 
133
  ```
 
 
134
  pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
135
  ```
136
 
@@ -143,44 +148,54 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
143
  ```python
144
  from nemo.collections.asr.models import SortformerEncLabelModel
145
 
146
- # load model
147
- diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_sortformer_4spk-v1", map_location=torch.device('cuda'), strict=False)
 
 
 
 
 
 
 
148
  ```
149
 
150
  ### Input Format
151
- Input to Sortformer can be either a list of paths to audio files or a jsonl manifest file.
152
-
153
  ```python
154
- pred_outputs = diar_model.diarize(audio=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"], batch_size=1)
155
  ```
156
-
157
- Individual audio file can be fed into Sortformer model as follows:
158
  ```python
159
- pred_output1 = diar_model.diarize(audio="/path/to/multispeaker_audio1.wav", batch_size=1)
160
  ```
161
-
162
-
163
- To use Sortformer for performing diarization on a multi-speaker audio recording, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
164
-
 
165
  ```yaml
166
  # Example of a line in `multispeaker_manifest.json`
167
  {
168
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
169
- "offset": 0 # offset (start) time of the input audio
170
  "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
171
  }
172
  {
173
  "audio_filepath": "/path/to/multispeaker_audio2.wav",
174
- "offset": 0,
175
  "duration": 580,
176
  }
177
  ```
178
 
179
- and then use:
180
- ```python
181
- pred_outputs = diar_model.diarize(audio="/path/to/multispeaker_manifest.json", batch_size=1)
 
 
 
 
 
182
  ```
183
-
184
 
185
  ### Input
186
 
@@ -190,10 +205,10 @@ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
190
 
191
  ### Output
192
 
193
- The output of the model is an T x S matrix, where:
194
  - S is the maximum number of speakers (in this model, S = 4).
195
  - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
196
- Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
197
 
198
 
199
  ## Train and evaluate Sortformer diarizer using NeMo
@@ -202,9 +217,27 @@ Each element of the T x S matrix represents the speaker activity probability in
202
  Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
203
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
204
 
205
- ### Inference
 
 
 
 
 
 
 
 
 
206
 
207
- Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
 
 
 
 
 
 
 
 
 
208
 
209
  ### Technical Limitations
210
 
@@ -301,4 +334,4 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
301
 
302
  ## Licence
303
 
304
- License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.
 
1
  ---
2
+ license: cc-by-nc-4.0
3
  library_name: nemo
4
  datasets:
5
  - fisher_english
 
95
  - der
96
  pipeline_tag: audio-classification
97
  ---
98
+ A newer streaming Sortformer is available at [huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2).
99
 
100
 
101
  # Sortformer Diarizer 4spk v1
 
110
  | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
111
  <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
112
 
113
+
114
  [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
115
 
116
  <div align="center">
 
132
  ## NVIDIA NeMo
133
 
134
  To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
135
+
136
  ```
137
+ apt-get update && apt-get install -y libsndfile1 ffmpeg
138
+ pip install Cython packaging
139
  pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
140
  ```
141
 
 
148
  ```python
149
  from nemo.collections.asr.models import SortformerEncLabelModel
150
 
151
+ # load model from Hugging Face model card directly (You need a Hugging Face token)
152
+ diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")
153
+
154
+ # If you have a downloaded model in "/path/to/diar_sortformer_4spk-v1.nemo", load model from a downloaded file
155
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location='cuda', strict=False)
156
+
157
+
158
+ # switch to inference mode
159
+ diar_model.eval()
160
  ```
161
 
162
  ### Input Format
163
+ Input to Sortformer can be an individual audio file:
 
164
  ```python
165
+ audio_input="/path/to/multispeaker_audio1.wav"
166
  ```
167
+ or a list of paths to audio files:
 
168
  ```python
169
+ audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
170
  ```
171
+ or a jsonl manifest file:
172
+ ```python
173
+ audio_input="/path/to/multispeaker_manifest.json"
174
+ ```
175
+ where each line is a dictionary containing the following fields:
176
  ```yaml
177
  # Example of a line in `multispeaker_manifest.json`
178
  {
179
  "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
180
+ "offset": 0, # offset (start) time of the input audio
181
  "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
182
  }
183
  {
184
  "audio_filepath": "/path/to/multispeaker_audio2.wav",
185
+ "offset": 900,
186
  "duration": 580,
187
  }
188
  ```
189
 
190
+ ### Getting Diarization Results
191
+ To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
192
+ ```python3
193
+ predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
194
+ ```
195
+ To obtain tensors of speaker activity probabilities, use:
196
+ ```python3
197
+ predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
198
  ```
 
199
 
200
  ### Input
201
 
 
205
 
206
  ### Output
207
 
208
+ The output of the model is a T x S matrix, where:
209
  - S is the maximum number of speakers (in this model, S = 4).
210
  - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
211
+ - Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
212
 
213
 
214
  ## Train and evaluate Sortformer diarizer using NeMo
 
217
  Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
218
  The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
219
 
220
+ ### Evaluation
221
+
222
+ To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
223
+ ```bash
224
+ python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
225
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
226
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
227
+ collar=COLLAR \
228
+ out_rttm_dir="/path/to/output_rttms"
229
+ ```
230
 
231
+ You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
232
+ ```bash
233
+ python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
234
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
235
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
236
+ collar=COLLAR \
237
+ bypass_postprocessing=False \
238
+ postprocessing_yaml="/path/to/postprocessing_config.yaml" \
239
+ out_rttm_dir="/path/to/output_rttms"
240
+ ```
241
 
242
  ### Technical Limitations
243
 
 
334
 
335
  ## Licence
336
 
337
+ License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.