astroza commited on
Commit
13a06cd
·
1 Parent(s): 309f2b5

Update leaderboard configuration and results processing for Chilean Spanish ASR evaluation

Browse files
Files changed (6) hide show
  1. .gitignore +1 -0
  2. README.md +71 -31
  3. app.py +85 -173
  4. requirements.txt +2 -16
  5. results.csv +34 -0
  6. src/about.py +155 -49
.gitignore CHANGED
@@ -11,3 +11,4 @@ eval-results/
11
  eval-queue-bk/
12
  eval-results-bk/
13
  logs/
 
 
11
  eval-queue-bk/
12
  eval-results-bk/
13
  logs/
14
+ .github/copilot-instructions.md
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: Open Asr Leaderboard Cl
3
  emoji: 🥇
4
  colorFrom: green
5
  colorTo: indigo
@@ -7,42 +7,82 @@ sdk: gradio
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
- short_description: Duplicate this leaderboard to initialize your own!
11
- sdk_version: 5.43.1
12
  tags:
13
  - leaderboard
14
  ---
15
 
16
- # Start the configuration
17
-
18
- Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
19
-
20
- Results files should have the following format and be stored as json files:
21
- ```json
22
- {
23
- "config": {
24
- "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
25
- "model_name": "path of the model on the hub: org/model",
26
- "model_sha": "revision on the hub",
27
- },
28
- "results": {
29
- "task_name": {
30
- "metric_name": score,
31
- },
32
- "task_name2": {
33
- "metric_name": score,
34
- }
35
- }
36
- }
 
37
  ```
38
 
39
- Request files are created automatically by this tool.
 
 
40
 
41
- If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
 
 
 
 
42
 
43
- # Code logic for more complex edits
44
 
45
- You'll find
46
- - the main table' columns names and properties in `src/display/utils.py`
47
- - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
48
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Open Asr Leaderboard CL
3
  emoji: 🥇
4
  colorFrom: green
5
  colorTo: indigo
 
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
+ short_description: Open ASR Leaderboard for Chilean Spanish
11
+ sdk_version: 4.44.0
12
  tags:
13
  - leaderboard
14
  ---
15
 
16
+ # Chilean Spanish ASR Leaderboard
17
+
18
+ > **Simple Gradio-based leaderboard displaying ASR evaluation results for Chilean Spanish models.**
19
+
20
+ ## Quick Start
21
+
22
+ This is a simplified version that displays results from a CSV file with two tabs:
23
+ - **🏅 Chilean Spanish ASR Leaderboard**: Shows model rankings based on WER and RTFx metrics
24
+ - **📝 About**: Detailed information about the evaluation methodology and datasets
25
+
26
+ ### Running the Leaderboard
27
+
28
+ ```bash
29
+ # Clone the repository
30
+ git clone https://github.com/aastroza/open_asr_leaderboard_cl.git
31
+ cd open_asr_leaderboard_cl
32
+
33
+ # Install dependencies
34
+ pip install gradio pandas
35
+
36
+ # Run the application
37
+ python app.py
38
  ```
39
 
40
+ The application will load results from `results.csv` and display them in a simple, clean interface.
41
+
42
+ ### Results Format
43
 
44
+ The `results.csv` file should contain the following columns:
45
+ - `model_id`: The model identifier (e.g., "openai/whisper-large-v3")
46
+ - `wer`: Word Error Rate (lower is better)
47
+ - `rtfx`: Real-Time Factor (higher is better)
48
+ - Additional metadata columns (dataset, num_samples, etc.)
49
 
50
+ ### Configuration
51
 
52
+ - **Title and Content**: Edit `src/about.py` to modify the title, introduction text, and about section
53
+ - **Styling**: Customize appearance in `src/display/css_html_js.py`
54
+ - **Data Processing**: Modify the `load_results()` function in `app.py` to change how results are aggregated and displayed
55
+
56
+ ## About the Evaluation
57
+
58
+ This leaderboard evaluates ASR models on Chilean Spanish using three datasets:
59
+ - **Common Voice** (Chilean Spanish subset)
60
+ - **Google Chilean Spanish**
61
+ - **Datarisas**
62
+
63
+ Models are ranked by average Word Error Rate (WER) across all datasets, with Real-Time Factor (RTFx) as a secondary metric for inference speed.
64
+
65
+ ## Models Evaluated
66
+
67
+ - openai/whisper-large-v3
68
+ - openai/whisper-large-v3-turbo
69
+ - openai/whisper-small
70
+ - rcastrovexler/whisper-small-es-cl (Chilean Spanish fine-tuned)
71
+ - nvidia/canary-1b-v2
72
+ - nvidia/parakeet-tdt-0.6b-v3
73
+ - microsoft/Phi-4-multimodal-instruct
74
+ - mistralai/Voxtral-Mini-3B-2507
75
+ - elevenlabs/scribe_v1
76
+
77
+ For detailed methodology and complete evaluation framework, see the Modal-based evaluation code in the original repository.
78
+
79
+ ## Citation
80
+
81
+ ```bibtex
82
+ @misc{astroza2024chilean,
83
+ title={Chilean Spanish ASR Test Dataset},
84
+ author={Alonso Astroza},
85
+ year={2025},
86
+ howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
87
+ }
88
+ ```
app.py CHANGED
@@ -1,93 +1,90 @@
1
  import gradio as gr
2
- from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
3
  import pandas as pd
4
- from apscheduler.schedulers.background import BackgroundScheduler
5
- from huggingface_hub import snapshot_download
6
 
7
  from src.about import (
8
  CITATION_BUTTON_LABEL,
9
  CITATION_BUTTON_TEXT,
10
- EVALUATION_QUEUE_TEXT,
11
  INTRODUCTION_TEXT,
12
- LLM_BENCHMARKS_TEXT,
13
  TITLE,
14
  )
15
  from src.display.css_html_js import custom_css
16
- from src.display.utils import (
17
- BENCHMARK_COLS,
18
- COLS,
19
- EVAL_COLS,
20
- EVAL_TYPES,
21
- AutoEvalColumn,
22
- ModelType,
23
- fields,
24
- WeightType,
25
- Precision
26
- )
27
- from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
28
- from src.populate import get_evaluation_queue_df, get_leaderboard_df
29
- from src.submission.submit import add_new_eval
30
-
31
-
32
- def restart_space():
33
- API.restart_space(repo_id=REPO_ID)
34
-
35
- ### Space initialisation
36
- try:
37
- print(EVAL_REQUESTS_PATH)
38
- snapshot_download(
39
- repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
40
- )
41
- except Exception:
42
- restart_space()
43
- try:
44
- print(EVAL_RESULTS_PATH)
45
- snapshot_download(
46
- repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
47
- )
48
- except Exception:
49
- restart_space()
50
 
51
 
52
- LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
53
-
54
- (
55
- finished_eval_queue_df,
56
- running_eval_queue_df,
57
- pending_eval_queue_df,
58
- ) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
59
-
60
- def init_leaderboard(dataframe):
61
- if dataframe is None or dataframe.empty:
62
- raise ValueError("Leaderboard DataFrame is empty or None.")
63
- return Leaderboard(
64
- value=dataframe,
65
- datatype=[c.type for c in fields(AutoEvalColumn)],
66
- select_columns=SelectColumns(
67
- default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
68
- cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
69
- label="Select Columns to Display:",
70
- ),
71
- search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
72
- hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
73
- filter_columns=[
74
- ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
75
- ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
76
- ColumnFilter(
77
- AutoEvalColumn.params.name,
78
- type="slider",
79
- min=0.01,
80
- max=150,
81
- label="Select the number of parameters (B)",
82
- ),
83
- ColumnFilter(
84
- AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
85
- ),
86
- ],
87
- bool_checkboxgroup_label="Hide models",
88
- interactive=False,
89
- )
90
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  demo = gr.Blocks(css=custom_css)
93
  with demo:
@@ -95,99 +92,17 @@ with demo:
95
  gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
96
 
97
  with gr.Tabs(elem_classes="tab-buttons") as tabs:
98
- with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
99
- leaderboard = init_leaderboard(LEADERBOARD_DF)
100
-
101
- with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
102
- gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
103
-
104
- with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
105
- with gr.Column():
106
- with gr.Row():
107
- gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
108
-
109
- with gr.Column():
110
- with gr.Accordion(
111
- f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
112
- open=False,
113
- ):
114
- with gr.Row():
115
- finished_eval_table = gr.components.Dataframe(
116
- value=finished_eval_queue_df,
117
- headers=EVAL_COLS,
118
- datatype=EVAL_TYPES,
119
- row_count=5,
120
- )
121
- with gr.Accordion(
122
- f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
123
- open=False,
124
- ):
125
- with gr.Row():
126
- running_eval_table = gr.components.Dataframe(
127
- value=running_eval_queue_df,
128
- headers=EVAL_COLS,
129
- datatype=EVAL_TYPES,
130
- row_count=5,
131
- )
132
-
133
- with gr.Accordion(
134
- f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
135
- open=False,
136
- ):
137
- with gr.Row():
138
- pending_eval_table = gr.components.Dataframe(
139
- value=pending_eval_queue_df,
140
- headers=EVAL_COLS,
141
- datatype=EVAL_TYPES,
142
- row_count=5,
143
- )
144
- with gr.Row():
145
- gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
146
-
147
- with gr.Row():
148
- with gr.Column():
149
- model_name_textbox = gr.Textbox(label="Model name")
150
- revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
151
- model_type = gr.Dropdown(
152
- choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
153
- label="Model type",
154
- multiselect=False,
155
- value=None,
156
- interactive=True,
157
- )
158
-
159
- with gr.Column():
160
- precision = gr.Dropdown(
161
- choices=[i.value.name for i in Precision if i != Precision.Unknown],
162
- label="Precision",
163
- multiselect=False,
164
- value="float16",
165
- interactive=True,
166
- )
167
- weight_type = gr.Dropdown(
168
- choices=[i.value.name for i in WeightType],
169
- label="Weights type",
170
- multiselect=False,
171
- value="Original",
172
- interactive=True,
173
- )
174
- base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
175
-
176
- submit_button = gr.Button("Submit Eval")
177
- submission_result = gr.Markdown()
178
- submit_button.click(
179
- add_new_eval,
180
- [
181
- model_name_textbox,
182
- base_model_name_textbox,
183
- revision_name_textbox,
184
- precision,
185
- weight_type,
186
- model_type,
187
- ],
188
- submission_result,
189
  )
190
 
 
 
 
191
  with gr.Row():
192
  with gr.Accordion("📙 Citation", open=False):
193
  citation_button = gr.Textbox(
@@ -198,7 +113,4 @@ with demo:
198
  show_copy_button=True,
199
  )
200
 
201
- scheduler = BackgroundScheduler()
202
- scheduler.add_job(restart_space, "interval", seconds=1800)
203
- scheduler.start()
204
- demo.queue(default_concurrency_limit=40).launch()
 
1
  import gradio as gr
 
2
  import pandas as pd
 
 
3
 
4
  from src.about import (
5
  CITATION_BUTTON_LABEL,
6
  CITATION_BUTTON_TEXT,
 
7
  INTRODUCTION_TEXT,
8
+ ABOUT_TEXT,
9
  TITLE,
10
  )
11
  from src.display.css_html_js import custom_css
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
 
14
+ def load_results():
15
+ """Load and process results from CSV file"""
16
+ try:
17
+ df = pd.read_csv("results.csv")
18
+
19
+ # Get WER by dataset for each model
20
+ wer_by_dataset = df.pivot_table(
21
+ index='model_id',
22
+ columns='dataset',
23
+ values='wer',
24
+ aggfunc='mean'
25
+ ).round(2)
26
+
27
+ # Calculate overall average WER
28
+ wer_by_dataset['Average WER'] = df.groupby('model_id')['wer'].mean().round(2)
29
+
30
+ # Calculate RTFx properly: sum(total_audio_length) / sum(total_time)
31
+ audio_time_sums = df.groupby('model_id').agg({
32
+ 'total_audio_length': 'sum',
33
+ 'total_time': 'sum'
34
+ })
35
+ rtfx_calculated = (audio_time_sums['total_audio_length'] / audio_time_sums['total_time']).round(2)
36
+
37
+ # Combine all metrics
38
+ model_stats = wer_by_dataset.copy()
39
+ model_stats['RTFx'] = rtfx_calculated
40
+
41
+ # Set RTFx to NA for ElevenLabs (API-based, not local model)
42
+ elevenlabs_mask = model_stats.index.str.contains('elevenlabs', case=False, na=False)
43
+ model_stats.loc[elevenlabs_mask, 'RTFx'] = 'N/A'
44
+
45
+ # Sort by average WER (lower is better)
46
+ model_stats = model_stats.sort_values('Average WER')
47
+
48
+ # Reset index to make model_id a column
49
+ model_stats = model_stats.reset_index()
50
+
51
+ # Reorder columns: Model, Average WER first, then Datarisas, then other datasets, then RTFx
52
+ dataset_columns = [col for col in model_stats.columns if col not in ['model_id', 'Average WER', 'RTFx']]
53
+
54
+ # Put datarisas first, then other datasets
55
+ datarisas_col = [col for col in dataset_columns if 'datarisas' in col.lower()]
56
+ other_dataset_cols = [col for col in dataset_columns if 'datarisas' not in col.lower()]
57
+ ordered_dataset_cols = datarisas_col + other_dataset_cols
58
+
59
+ new_column_order = ['model_id', 'Average WER'] + ordered_dataset_cols + ['RTFx']
60
+ model_stats = model_stats[new_column_order]
61
+
62
+ # Convert model names to appropriate links
63
+ def create_model_link(model_name):
64
+ if 'elevenlabs' in model_name.lower():
65
+ return f'<a href="https://elevenlabs.io/speech-to-text" target="_blank">{model_name}</a>'
66
+ else:
67
+ return f'<a href="https://huggingface.co/{model_name}" target="_blank">{model_name}</a>'
68
+
69
+ model_stats['model_id'] = model_stats['model_id'].apply(create_model_link)
70
+
71
+ # Rename columns for better display
72
+ column_mapping = {'model_id': 'Model', 'Average WER': 'Average WER ⬇️', 'RTFx': 'RTFx ⬆️'}
73
+ # Add arrows to dataset WER columns
74
+ for col in dataset_columns:
75
+ column_mapping[col] = f'{col.replace("_", " ").title()} WER ⬇️'
76
+
77
+ model_stats = model_stats.rename(columns=column_mapping)
78
+
79
+ return model_stats
80
+
81
+ except FileNotFoundError:
82
+ # Return empty dataframe if CSV doesn't exist
83
+ return pd.DataFrame(columns=['Model', 'Average WER ⬇️', 'RTFx ⬆️'])
84
+
85
+
86
+ # Load results
87
+ leaderboard_df = load_results()
88
 
89
  demo = gr.Blocks(css=custom_css)
90
  with demo:
 
92
  gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
93
 
94
  with gr.Tabs(elem_classes="tab-buttons") as tabs:
95
+ with gr.TabItem("🏅 Chilean Spanish ASR Leaderboard", elem_id="leaderboard-tab", id=0):
96
+ gr.Dataframe(
97
+ value=leaderboard_df,
98
+ interactive=False,
99
+ wrap=True,
100
+ datatype=["markdown"] + ["number"] * (len(leaderboard_df.columns) - 1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  )
102
 
103
+ with gr.TabItem("📝 About", elem_id="about-tab", id=1):
104
+ gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
105
+
106
  with gr.Row():
107
  with gr.Accordion("📙 Citation", open=False):
108
  citation_button = gr.Textbox(
 
113
  show_copy_button=True,
114
  )
115
 
116
+ demo.launch()
 
 
 
requirements.txt CHANGED
@@ -1,16 +1,2 @@
1
- APScheduler
2
- black
3
- datasets
4
- gradio
5
- gradio[oauth]
6
- gradio_leaderboard==0.0.13
7
- gradio_client
8
- huggingface-hub>=0.18.0
9
- matplotlib
10
- numpy
11
- pandas
12
- python-dateutil
13
- tqdm
14
- transformers
15
- tokenizers>=0.15.0
16
- sentencepiece
 
1
+ gradio==4.44.0
2
+ pandas==2.0.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
results.csv ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset,num_samples,total_time,total_runtime,job_id,model_id,wer,rtfx,total_audio_length
2
+ google_chilean_spanish,4374,169.16428009035442,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,2.86,152.15,25737.9899375
3
+ datarisas,50,1.9580847612551107,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,17.07,190.83,373.662875
4
+ common_voice,152,5.849061822395057,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,4.94,151.65,887.004
5
+ datarisas,50,29.12175529364519,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,16.4,12.83,373.662875
6
+ google_chilean_spanish,4374,2460.3554988628057,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,3.3,10.46,25737.9899375
7
+ common_voice,152,84.19953364344755,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,2.21,10.53,887.004
8
+ datarisas,50,2.8973398348924038,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,16.53,128.97,373.662875
9
+ google_chilean_spanish,4374,252.6714496900961,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,4.6,101.86,25737.9899375
10
+ common_voice,152,8.742226805006748,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,3.64,101.46,887.004
11
+ datarisas,50,0.24104375075209022,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,16.4,1550.19,373.662875
12
+ google_chilean_spanish,4374,21.062983242975022,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,4.44,1221.95,25737.9899375
13
+ common_voice,152,0.721991695274016,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,2.86,1228.55,887.004
14
+ google_chilean_spanish,4374,38.13183395634415,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,4.95,674.97,25737.9899375
15
+ datarisas,50,0.4448383559679799,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,20.93,840.0,373.662875
16
+ common_voice,152,1.3294066856862687,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,3.58,667.22,887.004
17
+ google_chilean_spanish,4374,1046.1813324201853,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,4.65,24.6,25737.9899375
18
+ datarisas,50,12.073279585432461,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,16.8,30.95,373.662875
19
+ common_voice,152,36.39141853038421,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,3.58,24.37,887.004
20
+ datarisas,50,37.41451234264182,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,20.67,9.99,373.662875
21
+ google_chilean_spanish,4374,3188.2717377215713,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,4.44,8.07,25737.9899375
22
+ common_voice,152,110.95904769604698,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,3.25,7.99,887.004
23
+ datarisas,50,1.4263290456974054,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,30.8,261.98,373.662875
24
+ google_chilean_spanish,4374,125.01975872613868,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,7.99,205.87,25737.9899375
25
+ common_voice,152,4.290577019156153,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,10.34,206.73,887.004
26
+ google_chilean_spanish,4374,113.85375795331139,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,2.37,226.06,25737.9899375
27
+ datarisas,50,1.3030618462718029,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,30.13,286.76,373.662875
28
+ common_voice,152,3.9525142864162195,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,13.4,224.42,887.004
29
+ google_chilean_spanish,4374,177.93435856938717,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,4.64,144.65,25737.9899375
30
+ common_voice,152,6.106875077747236,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,2.86,145.25,887.004
31
+ datarisas,50,2.0549551418640357,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,20.93,181.84,373.662875
32
+ datarisas,50,13.944635539861737,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,35.07,26.8,373.662875
33
+ google_chilean_spanish,4374,1306.5544163221957,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,5.09,19.7,25737.9899375
34
+ common_voice,152,44.908032676899815,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,4.16,19.75,887.004
src/about.py CHANGED
@@ -1,72 +1,178 @@
1
- from dataclasses import dataclass
2
- from enum import Enum
3
 
4
- @dataclass
5
- class Task:
6
- benchmark: str
7
- metric: str
8
- col_name: str
9
 
 
 
 
 
 
 
 
10
 
11
- # Select your tasks here
12
- # ---------------------------------------------------
13
- class Tasks(Enum):
14
- # task_key in the json file, metric_key in the json file, name to display in the leaderboard
15
- task0 = Task("anli_r1", "acc", "ANLI")
16
- task1 = Task("logiqa", "acc_norm", "LogiQA")
17
 
18
- NUM_FEWSHOT = 0 # Change with your few shot
19
- # ---------------------------------------------------
20
 
 
21
 
 
22
 
23
- # Your leaderboard name
24
- TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
25
 
26
- # What does your leaderboard evaluate?
27
- INTRODUCTION_TEXT = """
28
- Intro text
29
- """
30
 
31
- # Which evaluations are you running? how can people reproduce what you have?
32
- LLM_BENCHMARKS_TEXT = f"""
33
- ## How it works
 
 
 
 
34
 
35
- ## Reproducibility
36
- To reproduce our results, here is the commands you can run:
37
 
38
- """
39
 
40
- EVALUATION_QUEUE_TEXT = """
41
- ## Some good practices before submitting a model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- ### 1) Make sure you can load your model and tokenizer using AutoClasses:
44
- ```python
45
- from transformers import AutoConfig, AutoModel, AutoTokenizer
46
- config = AutoConfig.from_pretrained("your model name", revision=revision)
47
- model = AutoModel.from_pretrained("your model name", revision=revision)
48
- tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
49
  ```
50
- If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
 
 
 
51
 
52
- Note: make sure your model is public!
53
- Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
 
 
 
 
 
54
 
55
- ### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
56
- It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
57
 
58
- ### 3) Make sure your model has an open license!
59
- This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- ### 4) Fill up your model card
62
- When we add extra information about models to the leaderboard, it will be automatically taken from the model card
 
 
63
 
64
- ## In case of model failure
65
- If your model is displayed in the `FAILED` category, its execution stopped.
66
- Make sure you have followed the above steps first.
67
- If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
68
  """
69
 
70
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
71
- CITATION_BUTTON_TEXT = r"""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  """
 
1
+ # Chilean Spanish ASR Leaderboard Configuration
 
2
 
3
+ # Your leaderboard name
4
+ TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""
 
 
 
5
 
6
+ # What does your leaderboard evaluate?
7
+ INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
8
+ on Chilean Spanish speech data from the Hugging Face Hub. \
9
+ \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
10
+ \nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.
11
+
12
+ 🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**"""
13
 
14
+ # About section content
15
+ ABOUT_TEXT = """
16
+ ## About This Leaderboard
 
 
 
17
 
18
+ This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**.
 
19
 
20
+ ### What is the Open ASR Leaderboard?
21
 
22
+ The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.
23
 
24
+ ### How This Repository Differs
 
25
 
26
+ This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:
 
 
 
27
 
28
+ | Aspect | Original Open ASR Leaderboard | This Repository |
29
+ |--------|-------------------------------|-----------------|
30
+ | **Target Language** | English (primarily) | Chilean Spanish |
31
+ | **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) |
32
+ | **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) |
33
+ | **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR |
34
+ | **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal |
35
 
36
+ ---
 
37
 
38
+ ## Models Evaluated
39
 
40
+ This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support:
41
+
42
+ | Model | Type | Framework | Parameters | Notes |
43
+ |-------|------|-----------|------------|-------|
44
+ | **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model |
45
+ | **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant |
46
+ | **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish |
47
+ | **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model |
48
+ | **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish |
49
+ | **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR |
50
+ | **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference |
51
+ | **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio |
52
+ | **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model |
53
+ | **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API |
54
+ | **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target |
55
+
56
+ ## Dataset
57
+
58
+ This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:
59
+
60
+ ### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)
61
+
62
+ This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:
63
+
64
+ 1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects.
65
+
66
+ 2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)):
67
+ - 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
68
+ - Recorded by 31 volunteers.
69
+ - Intended for speech technologies.
70
+ - Restructured from original OpenSLR archives for easier streaming
71
+
72
+ 3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
73
+ - Audio fragments from comedy routines at the Festival de Viña del Mar.
74
+ - Represents spontaneous, colloquial Chilean Spanish.
75
+ - Captures humor and cultural expressions specific to Chile.
76
+
77
+ **Combined Dataset Properties:**
78
+ - **Language**: Spanish (Chilean variant)
79
+ - **Split**: `test`
80
+ - **Domain**: Mixed (formal recordings, volunteer speech, comedy performances)
81
+ - **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish
82
+
83
+ ## Metrics
84
+
85
+ Following the Open ASR Leaderboard standard, we report:
86
+
87
+ - **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy
88
+ - **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)
89
+
90
+ ### Word Error Rate (WER)
91
+ Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
92
+ of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
93
+
94
+ Take the following example:
95
+ | Reference: | el | gato | se | sentó | en | la | alfombra |
96
+ |-------------|-----|-----|---------|-----|-----|-----| -----|
97
+ | Prediction: | el | gato | **se** | sentó | en | la | | |
98
+ | Label: | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | D |
99
+
100
+ Here, we have:
101
+ * 0 substitutions
102
+ * 0 insertions
103
+ * 1 deletion ("alfombra" is missing)
104
+
105
+ This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
106
+ reference (N), which for this example is 7:
107
 
 
 
 
 
 
 
108
  ```
109
+ WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
110
+ ```
111
+
112
+ Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.
113
 
114
+ ### Inverse Real Time Factor (RTFx)
115
+ Inverse Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an
116
+ model to process a given amount of speech. It is defined as:
117
+
118
+ ```
119
+ RTFx = (number of seconds of audio inferred) / (compute time in seconds)
120
+ ```
121
 
122
+ Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
123
+ Thus, **a higher RTFx value indicates lower latency**.
124
 
125
+ ## Text Normalization for Spanish
126
+
127
+ This repository uses a **multilingual normalizer** configured to preserve Spanish characters:
128
+
129
+ ```python
130
+ normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
131
+ ```
132
+
133
+ **What it does:**
134
+ - ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
135
+ - ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
136
+ - ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
137
+ - ❌ Does NOT remove: Accents or Spanish-specific characters
138
+
139
+ **Example:**
140
+ ```python
141
+ Input: "¿Cómo estás? [ruido] (suspiro)"
142
+ Output: "cómo estás"
143
+ ```
144
 
145
+ This is critical for Spanish evaluation, as diacritics change word meaning:
146
+ - `esta` (this) vs. `está` (is)
147
+ - `si` (if) vs. `sí` (yes)
148
+ - `el` (the) vs. `él` (he)
149
 
150
+ ## How to reproduce our results
151
+ The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation.
152
+ For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl
 
153
  """
154
 
155
  CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
156
+ CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
157
+ title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
158
+ author={Instituto de Data Science UDD},
159
+ year={2025},
160
+ publisher={Hugging Face},
161
+ howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
162
+ }
163
+
164
+ @misc{astroza2025chilean-dataset,
165
+ title={Chilean Spanish ASR Test Dataset},
166
+ author={Alonso Astroza},
167
+ year={2025},
168
+ howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
169
+ }
170
+
171
+ @misc{open-asr-leaderboard,
172
+ title={Open Automatic Speech Recognition Leaderboard},
173
+ author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
174
+ year={2023},
175
+ publisher={Hugging Face},
176
+ howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
177
+ }
178
  """