Update README.md
Browse files
README.md
CHANGED
|
@@ -14,10 +14,14 @@ library_name: nemo
|
|
| 14 |
tags:
|
| 15 |
- Multilingual
|
| 16 |
- MarbleNet
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
---
|
| 18 |
# Frame-VAD Multilingual MarbleNet v2.0
|
| 19 |
|
| 20 |
-
## Description
|
| 21 |
|
| 22 |
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
| 23 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
|
@@ -25,14 +29,17 @@ The model supports multiple languages, including Chinese, German, Russian, Engli
|
|
| 25 |
|
| 26 |
This model is ready for commercial use. <br>
|
| 27 |
|
| 28 |
-
### License/Terms of Use
|
| 29 |
GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
|
| 30 |
|
| 31 |
|
| 32 |
-
Deployment Geography:
|
| 33 |
|
| 34 |
-
|
| 35 |
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
## References:
|
| 38 |
[1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
|
|
@@ -62,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
|
|
| 62 |
|
| 63 |
|
| 64 |
|
| 65 |
-
## How to Use the Model
|
| 66 |
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
|
| 67 |
|
| 68 |
### Automatically load the model
|
|
@@ -256,4 +263,4 @@ Model Application(s): | Automatic Speech Recognit
|
|
| 256 |
List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
|
| 257 |
Describe the life critical impact (if present). | Not Applicable
|
| 258 |
Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
|
| 259 |
-
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|
|
|
|
| 14 |
tags:
|
| 15 |
- Multilingual
|
| 16 |
- MarbleNet
|
| 17 |
+
- pytorch
|
| 18 |
+
- speech
|
| 19 |
+
- audio
|
| 20 |
+
- VAD
|
| 21 |
---
|
| 22 |
# Frame-VAD Multilingual MarbleNet v2.0
|
| 23 |
|
| 24 |
+
## Description:
|
| 25 |
|
| 26 |
Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
|
| 27 |
To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
|
|
|
|
| 29 |
|
| 30 |
This model is ready for commercial use. <br>
|
| 31 |
|
| 32 |
+
### License/Terms of Use:
|
| 33 |
GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
|
| 34 |
|
| 35 |
|
| 36 |
+
### Deployment Geography:
|
| 37 |
|
| 38 |
+
Global <br>
|
| 39 |
|
| 40 |
+
### Use Case:
|
| 41 |
+
|
| 42 |
+
Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
|
| 43 |
|
| 44 |
## References:
|
| 45 |
[1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886) <br>
|
|
|
|
| 69 |
|
| 70 |
|
| 71 |
|
| 72 |
+
## How to Use the Model:
|
| 73 |
The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
|
| 74 |
|
| 75 |
### Automatically load the model
|
|
|
|
| 263 |
List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
|
| 264 |
Describe the life critical impact (if present). | Not Applicable
|
| 265 |
Use Case Restrictions: | Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
|
| 266 |
+
Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
|