nvidia
/

Frame_VAD_Multilingual_MarbleNet_v2.0

@@ -14,10 +14,14 @@ library_name: nemo
 tags:
 - Multilingual
 - MarbleNet
 ---
 # Frame-VAD Multilingual MarbleNet v2.0
-## Description
 Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
@@ -25,14 +29,17 @@ The model supports multiple languages, including Chinese, German, Russian, Engli
 This model is ready for commercial use. <br>
-### License/Terms of Use
 GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
-Deployment Geography: Global <br>
-Use Case: Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
 ## References:
 [1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886)  <br>
@@ -62,7 +69,7 @@ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated sys
-## How to Use the Model
 The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
 ### Automatically load the model
@@ -256,4 +263,4 @@ Model Application(s):                               |  Automatic Speech Recognit
 List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
 Describe the life critical impact (if present).   |  Not Applicable
 Use Case Restrictions:                              |  Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
-Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.

 tags:
 - Multilingual
 - MarbleNet
+- pytorch
+- speech
+- audio
+- VAD
 ---
 # Frame-VAD Multilingual MarbleNet v2.0
+## Description:
 Frame-VAD Multilingual MarbleNet v2.0 is a convolutional neural network for voice activity detection (VAD) that serves as the first step for Speech Recognition and Speaker Diarization. It is a frame-based model that outputs a speech probability for each 20 millisecond frame of the input audio. The model has 91.5K parameters, making it lightweight and efficient for real-time applications. <br>
 To reduce false positive errors — cases where the model incorrectly detects speech when none is present — the model was trained with white noise and real-word noise perturbations. During training, the volume of audios was also varied. Additionally, the training data includes non-speech audio samples to help the model distinguish between speech and non-speech sounds (such as coughing, laughter, and breathing, etc.) <br>
 This model is ready for commercial use. <br>
+### License/Terms of Use:
 GOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license).
+### Deployment Geography:
+Global <br>
+### Use Case:
+Developers, speech processing engineers, and AI researchers will use it as the first step for other speech processing models. <br>
 ## References:
 [1] [Jia, Fei, Somshubra Majumdar, and Boris Ginsburg. "MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection." ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021.](https://arxiv.org/abs/2010.13886)  <br>
+## How to Use the Model:
 The model is available for use in the NeMo toolkit [2], and can be used as a pre-trained checkpoint for inference.
 ### Automatically load the model
 List types of specific high-risk AI systems, if any, in which the model can be integrated: Select from the following: [Biometrics] OR [Critical infrastructure] OR [Machinery and Robotics] OR [Medical Devices] OR [Vehicles] OR [Aviation] OR [Education and vocational training] OR [Employment and Workers Management] <br>
 Describe the life critical impact (if present).   |  Not Applicable
 Use Case Restrictions:                              |  Abide by [NVIDIA Open Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license)
+Model and dataset restrictions:            |  The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.