Merge branch 'main' of hf.co:nvidia/diar_sortformer_4spk-v1 into main

Browse files

Files changed (1) hide show

README.md +57 -24

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-license: cc-by-nc-sa-4.0
 library_name: nemo
 datasets:
 - fisher_english
@@ -95,6 +95,7 @@ metrics:
 - der
 pipeline_tag: audio-classification
 ---
 # Sortformer Diarizer 4spk v1
@@ -109,6 +110,7 @@ img {
 | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
 <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
 [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 <div align="center">
@@ -130,7 +132,10 @@ and two feedforward layers with 4 sigmoid outputs for each frame input at the to
 ## NVIDIA NeMo
 To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
 pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
 ```
@@ -143,44 +148,54 @@ The model is available for use in the NeMo Framework[5], and can be used as a pr
 ```python
 from nemo.collections.asr.models import SortformerEncLabelModel
-# load model
-diar_model = SortformerEncLabelModel.restore_from(restore_path="diar_sortformer_4spk-v1", map_location=torch.device('cuda'), strict=False)
 ```
 ### Input Format
-Input to Sortformer can be either a list of paths to audio files or a jsonl manifest file.
 ```python
-pred_outputs = diar_model.diarize(audio=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"], batch_size=1)
 ```
-Individual audio file can be fed into Sortformer model as follows:
 ```python
-pred_output1 = diar_model.diarize(audio="/path/to/multispeaker_audio1.wav", batch_size=1)
 ```
-To use Sortformer for performing diarization on a multi-speaker audio recording, specify the input as jsonl manifest file, where each line in the file is a dictionary containing the following fields:
 ```yaml
 # Example of a line in `multispeaker_manifest.json`
 {
     "audio_filepath": "/path/to/multispeaker_audio1.wav",  # path to the input audio file
-    "offset": 0 # offset (start) time of the input audio
     "duration": 600,  # duration of the audio, can be set to `null` if using NeMo main branch
 }
 {
     "audio_filepath": "/path/to/multispeaker_audio2.wav",
-    "offset": 0,
     "duration": 580,
 }
 ```
-and then use:
-```python
-pred_outputs = diar_model.diarize(audio="/path/to/multispeaker_manifest.json", batch_size=1)
 ```
 ### Input
@@ -190,10 +205,10 @@ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
 ### Output
-The output of the model is an T x S matrix, where:
 - S is the maximum number of speakers (in this model, S = 4).
 - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
-Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range.  For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
 ## Train and evaluate Sortformer diarizer using NeMo
@@ -202,9 +217,27 @@ Each element of the T x S matrix represents the speaker activity probability in
 Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
 The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
-### Inference
-Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
 ### Technical Limitations
@@ -301,4 +334,4 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
 ## Licence
-License to use this model is covered by the [CC-BY-NC-SA-4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-SA-4.0 license.

 ---
+license: cc-by-nc-4.0
 library_name: nemo
 datasets:
 - fisher_english
 - der
 pipeline_tag: audio-classification
 ---
+A newer streaming Sortformer is available at [huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2).
 # Sortformer Diarizer 4spk v1
 | [![Model size](https://img.shields.io/badge/Params-123M-lightgrey#model-badge)](#model-architecture)
 <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
 [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
 <div align="center">
 ## NVIDIA NeMo
 To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[5]. We recommend you install it after you've installed Cython and latest PyTorch version.
 ```
+apt-get update && apt-get install -y libsndfile1 ffmpeg
+pip install Cython packaging
 pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
 ```
 ```python
 from nemo.collections.asr.models import SortformerEncLabelModel
+# load model from Hugging Face model card directly (You need a Hugging Face token)
+diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_sortformer_4spk-v1")
+# If you have a downloaded model in "/path/to/diar_sortformer_4spk-v1.nemo", load model from a downloaded file
+diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_sortformer_4spk-v1.nemo", map_location='cuda', strict=False)
+# switch to inference mode
+diar_model.eval()
 ```
 ### Input Format
+Input to Sortformer can be an individual audio file:
 ```python
+audio_input="/path/to/multispeaker_audio1.wav"
 ```
+or a list of paths to audio files:
 ```python
+audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
 ```
+or a jsonl manifest file:
+```python
+audio_input="/path/to/multispeaker_manifest.json"
+```
+where each line is a dictionary containing the following fields:
 ```yaml
 # Example of a line in `multispeaker_manifest.json`
 {
     "audio_filepath": "/path/to/multispeaker_audio1.wav",  # path to the input audio file
+    "offset": 0, # offset (start) time of the input audio
     "duration": 600,  # duration of the audio, can be set to `null` if using NeMo main branch
 }
 {
     "audio_filepath": "/path/to/multispeaker_audio2.wav",
+    "offset": 900,
     "duration": 580,
 }
 ```
+### Getting Diarization Results
+To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
+```python3
+predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
+```
+To obtain tensors of speaker activity probabilities, use:
+```python3
+predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
 ```
 ### Input
 ### Output
+The output of the model is a T x S matrix, where:
 - S is the maximum number of speakers (in this model, S = 4).
 - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
+- Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range.  For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
 ## Train and evaluate Sortformer diarizer using NeMo
 Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
 The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
+### Evaluation
+To evaluate Sortformer diarizer and save diarization results in RTTM format, use the inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py):
+```bash
+python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
+ collar=COLLAR \
+ out_rttm_dir="/path/to/output_rttms"
+```
+You can provide the post-processing YAML configs from [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset:
+```bash
+python ${NEMO_GIT_FOLDER}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
+ model_path="/path/to/diar_sortformer_4spk-v1.nemo" \
+ manifest_filepath="/path/to/multispeaker_manifest_with_reference_rttms.json" \
+ collar=COLLAR \
+ bypass_postprocessing=False \
+ postprocessing_yaml="/path/to/postprocessing_config.yaml" \
+ out_rttm_dir="/path/to/output_rttms"
+```
 ### Technical Limitations
 ## Licence
+License to use this model is covered by the [CC-BY-NC-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-NC-4.0 license.