Merge branch 'main' of hf.co:nvidia/diar_sortformer_4spk-v1 into main
Browse files
README.md
CHANGED
|
@@ -1,5 +1,5 @@
|
|
| 1 |
---
|
| 2 |
-
license: cc-by-nc-
|
| 3 |
library_name: nemo
|
| 4 |
datasets:
|
| 5 |
- fisher_english
|
|
@@ -95,6 +95,7 @@ metrics:
|
|
| 95 |
- der
|
| 96 |
pipeline_tag: audio-classification
|
| 97 |
---
|
|
|
|
| 98 |
|
| 99 |
|
| 100 |
# Sortformer Diarizer 4spk v1
|
|
@@ -109,6 +110,7 @@ img {
|
|
| 109 |
| [](#model-architecture)
|
| 110 |
<!-- | [](#datasets) -->
|
| 111 |
|
|
|
|
| 112 |
[Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
| 113 |
|
| 114 |
<div align="center">
|
|
@@ -130,7 +132,10 @@ and two feedforward layers with 4 sigmoid outputs for each frame input at the to
|
|
| 130 |
## NVIDIA NeMo
|
| 131 |
|
| 132 |
To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
|
|
|
|
| 133 |
```
|
|
|
|
|
|
|
| 134 |
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
|
| 135 |
```
|
| 136 |
|
|
@@ -143,44 +148,54 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
|
|
| 143 |
```python
|
| 144 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 145 |
|
| 146 |
-
# load model
|
| 147 |
-
diar_model = SortformerEncLabelModel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
```
|
| 149 |
|
| 150 |
### Input Format
|
| 151 |
-
Input to Sortformer can be
|
| 152 |
-
|
| 153 |
```python
|
| 154 |
-
|
| 155 |
```
|
| 156 |
-
|
| 157 |
-
Individual audio file can be fed into Sortformer model as follows:
|
| 158 |
```python
|
| 159 |
-
|
| 160 |
```
|
| 161 |
-
|
| 162 |
-
|
| 163 |
-
|
| 164 |
-
|
|
|
|
| 165 |
```yaml
|
| 166 |
# Example of a line in `multispeaker_manifest.json`
|
| 167 |
{
|
| 168 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
| 169 |
-
"offset": 0 # offset (start) time of the input audio
|
| 170 |
"duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
|
| 171 |
}
|
| 172 |
{
|
| 173 |
"audio_filepath": "/path/to/multispeaker_audio2.wav",
|
| 174 |
-
"offset":
|
| 175 |
"duration": 580,
|
| 176 |
}
|
| 177 |
```
|
| 178 |
|
| 179 |
-
|
| 180 |
-
|
| 181 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
```
|
| 183 |
-
|
| 184 |
|
| 185 |
### Input
|
| 186 |
|
|
@@ -190,10 +205,10 @@ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
|
|
| 190 |
|
| 191 |
### Output
|
| 192 |
|
| 193 |
-
The output of the model is
|
| 194 |
- S is the maximum number of speakers (in this model, S = 4).
|
| 195 |
- T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
|
| 196 |
-
Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
|
| 197 |
|
| 198 |
|
| 199 |
## Train and evaluate Sortformer diarizer using NeMo
|
|
@@ -202,9 +217,27 @@ Each element of the T x S matrix represents the speaker activity probability in
|
|
| 202 |
Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
|
| 203 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
|
| 204 |
|
| 205 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 206 |
|
| 207 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 208 |
|
| 209 |
### Technical Limitations
|
| 210 |
|
|
@@ -301,4 +334,4 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
|
|
| 301 |
|
| 302 |
## Licence
|
| 303 |
|
| 304 |
-
License to use this model is covered by the [CC-BY-NC-
|
|
|
|
| 1 |
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
library_name: nemo
|
| 4 |
datasets:
|
| 5 |
- fisher_english
|
|
|
|
| 95 |
- der
|
| 96 |
pipeline_tag: audio-classification
|
| 97 |
---
|
| 98 |
+
A newer streaming Sortformer is available at [huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2).
|
| 99 |
|
| 100 |
|
| 101 |
# Sortformer Diarizer 4spk v1
|
|
|
|
| 110 |
| [](#model-architecture)
|
| 111 |
<!-- | [](#datasets) -->
|
| 112 |
|
| 113 |
+
|
| 114 |
[Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
|
| 115 |
|
| 116 |
<div align="center">
|
|
|
|
| 132 |
## NVIDIA NeMo
|
| 133 |
|
| 134 |
To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
|
| 135 |
+
|
| 136 |
```
|
| 137 |
+
apt-get update && apt-get install -y libsndfile1 ffmpeg
|
| 138 |
+
pip install Cython packaging
|
| 139 |
pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
|
| 140 |
```
|
| 141 |
|
|
|
|
| 148 |
```python
|
| 149 |
from nemo.collections.asr.models import SortformerEncLabelModel
|
| 150 |
|
| 151 |
+
# load model from Hugging Face model card directly (You need a Hugging Face token)
|
| 152 |
+
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")
|
| 153 |
+
|
| 154 |
+
# If you have a downloaded model in "/path/to/diar_sortformer_4spk-v1.nemo", load model from a downloaded file
|
| 155 |
+
diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location='cuda', strict=False)
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
# switch to inference mode
|
| 159 |
+
diar_model.eval()
|
| 160 |
```
|
| 161 |
|
| 162 |
### Input Format
|
| 163 |
+
Input to Sortformer can be an individual audio file:
|
|
|
|
| 164 |
```python
|
| 165 |
+
audio_input="/path/to/multispeaker_audio1.wav"
|
| 166 |
```
|
| 167 |
+
or a list of paths to audio files:
|
|
|
|
| 168 |
```python
|
| 169 |
+
audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
|
| 170 |
```
|
| 171 |
+
or a jsonl manifest file:
|
| 172 |
+
```python
|
| 173 |
+
audio_input="/path/to/multispeaker_manifest.json"
|
| 174 |
+
```
|
| 175 |
+
where each line is a dictionary containing the following fields:
|
| 176 |
```yaml
|
| 177 |
# Example of a line in `multispeaker_manifest.json`
|
| 178 |
{
|
| 179 |
"audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
|
| 180 |
+
"offset": 0, # offset (start) time of the input audio
|
| 181 |
"duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
|
| 182 |
}
|
| 183 |
{
|
| 184 |
"audio_filepath": "/path/to/multispeaker_audio2.wav",
|
| 185 |
+
"offset": 900,
|
| 186 |
"duration": 580,
|
| 187 |
}
|
| 188 |
```
|
| 189 |
|
| 190 |
+
### Getting Diarization Results
|
| 191 |
+
To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
|
| 192 |
+
```python3
|
| 193 |
+
predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
|
| 194 |
+
```
|
| 195 |
+
To obtain tensors of speaker activity probabilities, use:
|
| 196 |
+
```python3
|
| 197 |
+
predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
|
| 198 |
```
|
|
|
|
| 199 |
|
| 200 |
### Input
|
| 201 |
|
|
|
|
| 205 |
|
| 206 |
### Output
|
| 207 |
|
| 208 |
+
The output of the model is a T x S matrix, where:
|
| 209 |
- S is the maximum number of speakers (in this model, S = 4).
|
| 210 |
- T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
|
| 211 |
+
- Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
|
| 212 |
|
| 213 |
|
| 214 |
## Train and evaluate Sortformer diarizer using NeMo
|
|
|
|
| 217 |
Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
|
| 218 |
The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
|
| 219 |
|
| 220 |
+
### Evaluation
|
| 221 |
+
|
| 222 |
+
To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
|
| 223 |
+
```bash
|
| 224 |
+
python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
|
| 225 |
+
model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
|
| 226 |
+
manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
|
| 227 |
+
collar=COLLAR \
|
| 228 |
+
out_rttm_dir="/path/to/output_rttms"
|
| 229 |
+
```
|
| 230 |
|
| 231 |
+
You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
|
| 232 |
+
```bash
|
| 233 |
+
python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
|
| 234 |
+
model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
|
| 235 |
+
manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
|
| 236 |
+
collar=COLLAR \
|
| 237 |
+
bypass_postprocessing=False \
|
| 238 |
+
postprocessing_yaml="/path/to/postprocessing_config.yaml" \
|
| 239 |
+
out_rttm_dir="/path/to/output_rttms"
|
| 240 |
+
```
|
| 241 |
|
| 242 |
### Technical Limitations
|
| 243 |
|
|
|
|
| 334 |
|
| 335 |
## Licence
|
| 336 |
|
| 337 |
+
License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.
|