File size: 9,514 Bytes
13a06cd
b9cf8c8
13a06cd
 
b9cf8c8
13a06cd
 
 
 
 
 
 
b9cf8c8
13a06cd
 
 
b9cf8c8
13a06cd
b9cf8c8
13a06cd
b9cf8c8
13a06cd
b9cf8c8
13a06cd
b9cf8c8
13a06cd
b9cf8c8
13a06cd
 
 
 
 
 
 
b9cf8c8
13a06cd
b9cf8c8
13a06cd
b9cf8c8
13a06cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9cf8c8
 
13a06cd
 
 
 
b9cf8c8
13a06cd
 
 
 
 
 
 
b9cf8c8
13a06cd
 
b9cf8c8
13a06cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9cf8c8
13a06cd
 
 
 
b9cf8c8
13a06cd
 
 
b9cf8c8
 
 
13a06cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9cf8c8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
# Chilean Spanish ASR Leaderboard Configuration

# Your leaderboard name
TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
    on Chilean Spanish speech data from the Hugging Face Hub. \
    \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
    \nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.
    
🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**"""

# About section content
ABOUT_TEXT = """
## About This Leaderboard

This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**.

### What is the Open ASR Leaderboard?

The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.

### How This Repository Differs

This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:

| Aspect | Original Open ASR Leaderboard | This Repository |
|--------|-------------------------------|-----------------|
| **Target Language** | English (primarily) | Chilean Spanish |
| **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) |
| **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) |
| **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR |
| **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal |

---

## Models Evaluated

This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support:

| Model | Type | Framework | Parameters | Notes |
|-------|------|-----------|------------|-------|
| **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model |
| **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant |
| **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish |
| **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model |
| **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish |
| **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR |
| **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference |
| **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio |
| **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model |
| **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API |
| **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target |

## Dataset

This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:

### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)

This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:

1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects.

2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)): 
   - 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
   - Recorded by 31 volunteers.
   - Intended for speech technologies.
   - Restructured from original OpenSLR archives for easier streaming

3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
   - Audio fragments from comedy routines at the Festival de Viña del Mar.
   - Represents spontaneous, colloquial Chilean Spanish.
   - Captures humor and cultural expressions specific to Chile.

**Combined Dataset Properties:**
- **Language**: Spanish (Chilean variant)
- **Split**: `test`
- **Domain**: Mixed (formal recordings, volunteer speech, comedy performances)
- **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish

## Metrics

Following the Open ASR Leaderboard standard, we report:

- **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy
- **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)

### Word Error Rate (WER)
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage 
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.

Take the following example:
| Reference:  | el | gato | se     | sentó | en  | la | alfombra |
|-------------|-----|-----|---------|-----|-----|-----| -----|
| Prediction: | el | gato | **se** | sentó | en  | la |     |  |
| Label:      | ✅   | ✅   | ✅       | ✅   | ✅   | ✅ | D   |

Here, we have:
* 0 substitutions
* 0 insertions
* 1 deletion ("alfombra" is missing)

This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
reference (N), which for this example is 7:

```
WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
```

Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.

### Inverse Real Time Factor (RTFx)
Inverse Real Time Factor is a measure of  the **latency** of automatic speech recognition systems, i.e. how long it takes an 
model to process a given amount of speech. It is defined as:

```
RTFx = (number of seconds of audio inferred) / (compute time in seconds)
``` 

Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time. 
Thus, **a higher RTFx value indicates lower latency**.

## Text Normalization for Spanish

This repository uses a **multilingual normalizer** configured to preserve Spanish characters:

```python
normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
```

**What it does:**
- ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
- ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
- ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
- ❌ Does NOT remove: Accents or Spanish-specific characters

**Example:**
```python
Input:  "¿Cómo estás? [ruido] (suspiro)"
Output: "cómo estás"
```

This is critical for Spanish evaluation, as diacritics change word meaning:
- `esta` (this) vs. `está` (is)
- `si` (if) vs. `sí` (yes)
- `el` (the) vs. `él` (he)

## How to reproduce our results
The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation. 
For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl 
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
  title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
  author={Instituto de Data Science UDD},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
}

@misc{astroza2025chilean-dataset,
  title={Chilean Spanish ASR Test Dataset},
  author={Alonso Astroza},
  year={2025},
  howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
}

@misc{open-asr-leaderboard,
  title={Open Automatic Speech Recognition Leaderboard},
  author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
  year={2023},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
}
"""