Spaces:
Running
Running
File size: 9,514 Bytes
13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 13a06cd b9cf8c8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
# Chilean Spanish ASR Leaderboard Configuration
# Your leaderboard name
TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
on Chilean Spanish speech data from the Hugging Face Hub. \
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
\nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.
🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**"""
# About section content
ABOUT_TEXT = """
## About This Leaderboard
This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**.
### What is the Open ASR Leaderboard?
The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.
### How This Repository Differs
This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:
| Aspect | Original Open ASR Leaderboard | This Repository |
|--------|-------------------------------|-----------------|
| **Target Language** | English (primarily) | Chilean Spanish |
| **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) |
| **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) |
| **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR |
| **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal |
---
## Models Evaluated
This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support:
| Model | Type | Framework | Parameters | Notes |
|-------|------|-----------|------------|-------|
| **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model |
| **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant |
| **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish |
| **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model |
| **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish |
| **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR |
| **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference |
| **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio |
| **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model |
| **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API |
| **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target |
## Dataset
This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:
### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)
This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:
1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects.
2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)):
- 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
- Recorded by 31 volunteers.
- Intended for speech technologies.
- Restructured from original OpenSLR archives for easier streaming
3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
- Audio fragments from comedy routines at the Festival de Viña del Mar.
- Represents spontaneous, colloquial Chilean Spanish.
- Captures humor and cultural expressions specific to Chile.
**Combined Dataset Properties:**
- **Language**: Spanish (Chilean variant)
- **Split**: `test`
- **Domain**: Mixed (formal recordings, volunteer speech, comedy performances)
- **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish
## Metrics
Following the Open ASR Leaderboard standard, we report:
- **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy
- **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)
### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
Take the following example:
| Reference: | el | gato | se | sentó | en | la | alfombra |
|-------------|-----|-----|---------|-----|-----|-----| -----|
| Prediction: | el | gato | **se** | sentó | en | la | | |
| Label: | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | D |
Here, we have:
* 0 substitutions
* 0 insertions
* 1 deletion ("alfombra" is missing)
This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 7:
```
WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
```
Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.
### Inverse Real Time Factor (RTFx)
Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
model to process a given amount of speech. It is defined as:
```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
```
Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
Thus, **a higher RTFx value indicates lower latency**.
## Text Normalization for Spanish
This repository uses a **multilingual normalizer** configured to preserve Spanish characters:
```python
normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
```
**What it does:**
- ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
- ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
- ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
- ❌ Does NOT remove: Accents or Spanish-specific characters
**Example:**
```python
Input: "¿Cómo estás? [ruido] (suspiro)"
Output: "cómo estás"
```
This is critical for Spanish evaluation, as diacritics change word meaning:
- `esta` (this) vs. `está` (is)
- `si` (if) vs. `sí` (yes)
- `el` (the) vs. `él` (he)
## How to reproduce our results
The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation.
For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
author={Instituto de Data Science UDD},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
}
@misc{astroza2025chilean-dataset,
title={Chilean Spanish ASR Test Dataset},
author={Alonso Astroza},
year={2025},
howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
}
@misc{open-asr-leaderboard,
title={Open Automatic Speech Recognition Leaderboard},
author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
year={2023},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
}
"""
|