Update README.md
Browse files
README.md
CHANGED
|
@@ -11,13 +11,17 @@ metrics:
|
|
| 11 |
- roc_auc
|
| 12 |
pipeline_tag: voice-activity-detection
|
| 13 |
library_name: nemo
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
# Frame-VAD Multilingual MarbleNet v2.0
|
| 16 |
|
| 17 |
## Description
|
| 18 |
|
| 19 |
-
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. <br>
|
| 20 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
|
|
|
| 21 |
|
| 22 |
This model is ready for commercial use. <br>
|
| 23 |
|
|
@@ -65,7 +69,7 @@ The model is available for use in the NeMo toolkit [2], and can be used as a pre
|
|
| 65 |
|
| 66 |
```python
|
| 67 |
import nemo.collections.asr as nemo_asr
|
| 68 |
-
|
| 69 |
```
|
| 70 |
|
| 71 |
### Perform VAD Inference
|
|
|
|
| 11 |
- roc_auc
|
| 12 |
pipeline_tag: voice-activity-detection
|
| 13 |
library_name: nemo
|
| 14 |
+
tags:
|
| 15 |
+
- multilingual
|
| 16 |
+
- marblenet
|
| 17 |
---
|
| 18 |
# Frame-VAD Multilingual MarbleNet v2.0
|
| 19 |
|
| 20 |
## Description
|
| 21 |
|
| 22 |
+
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
| 23 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
| 24 |
+
The model supports multiple languages, including Chinese, German, Russian, English, Spanish, and French.
|
| 25 |
|
| 26 |
This model is ready for commercial use. <br>
|
| 27 |
|
|
|
|
| 69 |
|
| 70 |
```python
|
| 71 |
import nemo.collections.asr as nemo_asr
|
| 72 |
+
vad_model = nemo_asr.models.EncDecFrameClassificationModel.from_pretrained(model_name="nvidia/frame_vad_multilingual_marblenet_v2.0")
|
| 73 |
```
|
| 74 |
|
| 75 |
### Perform VAD Inference
|