Spaces:

idsudd
/

open_asr_leaderboard_cl

Running

App Files Files Community

astroza commited on 19 days ago

Commit

13a06cd

1 Parent(s): 309f2b5

Update leaderboard configuration and results processing for Chilean Spanish ASR evaluation

Browse files

Files changed (6) hide show

.gitignore +1 -0
README.md +71 -31
app.py +85 -173
requirements.txt +2 -16
results.csv +34 -0
src/about.py +155 -49

.gitignore CHANGED Viewed

@@ -11,3 +11,4 @@ eval-results/
 eval-queue-bk/
 eval-results-bk/
 logs/

 eval-queue-bk/
 eval-results-bk/
 logs/
+.github/copilot-instructions.md

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: Open Asr Leaderboard Cl
 emoji: 🥇
 colorFrom: green
 colorTo: indigo
@@ -7,42 +7,82 @@ sdk: gradio
 app_file: app.py
 pinned: true
 license: apache-2.0
-short_description: Duplicate this leaderboard to initialize your own!
-sdk_version: 5.43.1
 tags:
 - leaderboard
 ---
-# Start the configuration
-Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
-Results files should have the following format and be stored as json files:
-```json
-{
-    "config": {
-        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
-        "model_name": "path of the model on the hub: org/model",
-        "model_sha": "revision on the hub",
-    },
-    "results": {
-        "task_name": {
-            "metric_name": score,
-        },
-        "task_name2": {
-            "metric_name": score,
-        }
-    }
-}
 ```
-Request files are created automatically by this tool.
-If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
-# Code logic for more complex edits
-You'll find
-- the main table' columns names and properties in `src/display/utils.py`
-- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
-- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`

 ---
+title: Open Asr Leaderboard CL
 emoji: 🥇
 colorFrom: green
 colorTo: indigo
 app_file: app.py
 pinned: true
 license: apache-2.0
+short_description: Open ASR Leaderboard for Chilean Spanish
+sdk_version: 4.44.0
 tags:
 - leaderboard
 ---
+# Chilean Spanish ASR Leaderboard
+> **Simple Gradio-based leaderboard displaying ASR evaluation results for Chilean Spanish models.**
+## Quick Start
+This is a simplified version that displays results from a CSV file with two tabs:
+- **🏅 Chilean Spanish ASR Leaderboard**: Shows model rankings based on WER and RTFx metrics
+- **📝 About**: Detailed information about the evaluation methodology and datasets
+### Running the Leaderboard
+```bash
+# Clone the repository
+git clone https://github.com/aastroza/open_asr_leaderboard_cl.git
+cd open_asr_leaderboard_cl
+# Install dependencies
+pip install gradio pandas
+# Run the application
+python app.py
 ```
+The application will load results from `results.csv` and display them in a simple, clean interface.
+### Results Format
+The `results.csv` file should contain the following columns:
+- `model_id`: The model identifier (e.g., "openai/whisper-large-v3")
+- `wer`: Word Error Rate (lower is better)
+- `rtfx`: Real-Time Factor (higher is better)
+- Additional metadata columns (dataset, num_samples, etc.)
+### Configuration
+- **Title and Content**: Edit `src/about.py` to modify the title, introduction text, and about section
+- **Styling**: Customize appearance in `src/display/css_html_js.py`
+- **Data Processing**: Modify the `load_results()` function in `app.py` to change how results are aggregated and displayed
+## About the Evaluation
+This leaderboard evaluates ASR models on Chilean Spanish using three datasets:
+- **Common Voice** (Chilean Spanish subset)
+- **Google Chilean Spanish**
+- **Datarisas**
+Models are ranked by average Word Error Rate (WER) across all datasets, with Real-Time Factor (RTFx) as a secondary metric for inference speed.
+## Models Evaluated
+- openai/whisper-large-v3
+- openai/whisper-large-v3-turbo
+- openai/whisper-small
+- rcastrovexler/whisper-small-es-cl (Chilean Spanish fine-tuned)
+- nvidia/canary-1b-v2
+- nvidia/parakeet-tdt-0.6b-v3
+- microsoft/Phi-4-multimodal-instruct
+- mistralai/Voxtral-Mini-3B-2507
+- elevenlabs/scribe_v1
+For detailed methodology and complete evaluation framework, see the Modal-based evaluation code in the original repository.
+## Citation
+```bibtex
+@misc{astroza2024chilean,
+  title={Chilean Spanish ASR Test Dataset},
+  author={Alonso Astroza},
+  year={2025},
+  howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
+}
+```

app.py CHANGED Viewed

@@ -1,93 +1,90 @@
 import gradio as gr
-from gradio_leaderboard import Leaderboard, ColumnFilter, SelectColumns
 import pandas as pd
-from apscheduler.schedulers.background import BackgroundScheduler
-from huggingface_hub import snapshot_download
 from src.about import (
     CITATION_BUTTON_LABEL,
     CITATION_BUTTON_TEXT,
-    EVALUATION_QUEUE_TEXT,
     INTRODUCTION_TEXT,
-    LLM_BENCHMARKS_TEXT,
     TITLE,
 )
 from src.display.css_html_js import custom_css
-from src.display.utils import (
-    BENCHMARK_COLS,
-    COLS,
-    EVAL_COLS,
-    EVAL_TYPES,
-    AutoEvalColumn,
-    ModelType,
-    fields,
-    WeightType,
-    Precision
-)
-from src.envs import API, EVAL_REQUESTS_PATH, EVAL_RESULTS_PATH, QUEUE_REPO, REPO_ID, RESULTS_REPO, TOKEN
-from src.populate import get_evaluation_queue_df, get_leaderboard_df
-from src.submission.submit import add_new_eval
-def restart_space():
-    API.restart_space(repo_id=REPO_ID)
-### Space initialisation
-try:
-    print(EVAL_REQUESTS_PATH)
-    snapshot_download(
-        repo_id=QUEUE_REPO, local_dir=EVAL_REQUESTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-try:
-    print(EVAL_RESULTS_PATH)
-    snapshot_download(
-        repo_id=RESULTS_REPO, local_dir=EVAL_RESULTS_PATH, repo_type="dataset", tqdm_class=None, etag_timeout=30, token=TOKEN
-    )
-except Exception:
-    restart_space()
-LEADERBOARD_DF = get_leaderboard_df(EVAL_RESULTS_PATH, EVAL_REQUESTS_PATH, COLS, BENCHMARK_COLS)
-(
-    finished_eval_queue_df,
-    running_eval_queue_df,
-    pending_eval_queue_df,
-) = get_evaluation_queue_df(EVAL_REQUESTS_PATH, EVAL_COLS)
-def init_leaderboard(dataframe):
-    if dataframe is None or dataframe.empty:
-        raise ValueError("Leaderboard DataFrame is empty or None.")
-    return Leaderboard(
-        value=dataframe,
-        datatype=[c.type for c in fields(AutoEvalColumn)],
-        select_columns=SelectColumns(
-            default_selection=[c.name for c in fields(AutoEvalColumn) if c.displayed_by_default],
-            cant_deselect=[c.name for c in fields(AutoEvalColumn) if c.never_hidden],
-            label="Select Columns to Display:",
-        ),
-        search_columns=[AutoEvalColumn.model.name, AutoEvalColumn.license.name],
-        hide_columns=[c.name for c in fields(AutoEvalColumn) if c.hidden],
-        filter_columns=[
-            ColumnFilter(AutoEvalColumn.model_type.name, type="checkboxgroup", label="Model types"),
-            ColumnFilter(AutoEvalColumn.precision.name, type="checkboxgroup", label="Precision"),
-            ColumnFilter(
-                AutoEvalColumn.params.name,
-                type="slider",
-                min=0.01,
-                max=150,
-                label="Select the number of parameters (B)",
-            ),
-            ColumnFilter(
-                AutoEvalColumn.still_on_hub.name, type="boolean", label="Deleted/incomplete", default=True
-            ),
-        ],
-        bool_checkboxgroup_label="Hide models",
-        interactive=False,
-    )
 demo = gr.Blocks(css=custom_css)
 with demo:
@@ -95,99 +92,17 @@ with demo:
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
-        with gr.TabItem("🏅 LLM Benchmark", elem_id="llm-benchmark-tab-table", id=0):
-            leaderboard = init_leaderboard(LEADERBOARD_DF)
-        with gr.TabItem("📝 About", elem_id="llm-benchmark-tab-table", id=2):
-            gr.Markdown(LLM_BENCHMARKS_TEXT, elem_classes="markdown-text")
-        with gr.TabItem("🚀 Submit here! ", elem_id="llm-benchmark-tab-table", id=3):
-            with gr.Column():
-                with gr.Row():
-                    gr.Markdown(EVALUATION_QUEUE_TEXT, elem_classes="markdown-text")
-                with gr.Column():
-                    with gr.Accordion(
-                        f"✅ Finished Evaluations ({len(finished_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            finished_eval_table = gr.components.Dataframe(
-                                value=finished_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"🔄 Running Evaluation Queue ({len(running_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            running_eval_table = gr.components.Dataframe(
-                                value=running_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-                    with gr.Accordion(
-                        f"⏳ Pending Evaluation Queue ({len(pending_eval_queue_df)})",
-                        open=False,
-                    ):
-                        with gr.Row():
-                            pending_eval_table = gr.components.Dataframe(
-                                value=pending_eval_queue_df,
-                                headers=EVAL_COLS,
-                                datatype=EVAL_TYPES,
-                                row_count=5,
-                            )
-            with gr.Row():
-                gr.Markdown("# ✉️✨ Submit your model here!", elem_classes="markdown-text")
-            with gr.Row():
-                with gr.Column():
-                    model_name_textbox = gr.Textbox(label="Model name")
-                    revision_name_textbox = gr.Textbox(label="Revision commit", placeholder="main")
-                    model_type = gr.Dropdown(
-                        choices=[t.to_str(" : ") for t in ModelType if t != ModelType.Unknown],
-                        label="Model type",
-                        multiselect=False,
-                        value=None,
-                        interactive=True,
-                    )
-                with gr.Column():
-                    precision = gr.Dropdown(
-                        choices=[i.value.name for i in Precision if i != Precision.Unknown],
-                        label="Precision",
-                        multiselect=False,
-                        value="float16",
-                        interactive=True,
-                    )
-                    weight_type = gr.Dropdown(
-                        choices=[i.value.name for i in WeightType],
-                        label="Weights type",
-                        multiselect=False,
-                        value="Original",
-                        interactive=True,
-                    )
-                    base_model_name_textbox = gr.Textbox(label="Base model (for delta or adapter weights)")
-            submit_button = gr.Button("Submit Eval")
-            submission_result = gr.Markdown()
-            submit_button.click(
-                add_new_eval,
-                [
-                    model_name_textbox,
-                    base_model_name_textbox,
-                    revision_name_textbox,
-                    precision,
-                    weight_type,
-                    model_type,
-                ],
-                submission_result,
             )
     with gr.Row():
         with gr.Accordion("📙 Citation", open=False):
             citation_button = gr.Textbox(
@@ -198,7 +113,4 @@ with demo:
                 show_copy_button=True,
             )
-scheduler = BackgroundScheduler()
-scheduler.add_job(restart_space, "interval", seconds=1800)
-scheduler.start()
-demo.queue(default_concurrency_limit=40).launch()

 import gradio as gr
 import pandas as pd
 from src.about import (
     CITATION_BUTTON_LABEL,
     CITATION_BUTTON_TEXT,
     INTRODUCTION_TEXT,
+    ABOUT_TEXT,
     TITLE,
 )
 from src.display.css_html_js import custom_css
+def load_results():
+    """Load and process results from CSV file"""
+    try:
+        df = pd.read_csv("results.csv")
+        # Get WER by dataset for each model
+        wer_by_dataset = df.pivot_table(
+            index='model_id',
+            columns='dataset',
+            values='wer',
+            aggfunc='mean'
+        ).round(2)
+        # Calculate overall average WER
+        wer_by_dataset['Average WER'] = df.groupby('model_id')['wer'].mean().round(2)
+        # Calculate RTFx properly: sum(total_audio_length) / sum(total_time)
+        audio_time_sums = df.groupby('model_id').agg({
+            'total_audio_length': 'sum',
+            'total_time': 'sum'
+        })
+        rtfx_calculated = (audio_time_sums['total_audio_length'] / audio_time_sums['total_time']).round(2)
+        # Combine all metrics
+        model_stats = wer_by_dataset.copy()
+        model_stats['RTFx'] = rtfx_calculated
+        # Set RTFx to NA for ElevenLabs (API-based, not local model)
+        elevenlabs_mask = model_stats.index.str.contains('elevenlabs', case=False, na=False)
+        model_stats.loc[elevenlabs_mask, 'RTFx'] = 'N/A'
+        # Sort by average WER (lower is better)
+        model_stats = model_stats.sort_values('Average WER')
+        # Reset index to make model_id a column
+        model_stats = model_stats.reset_index()
+        # Reorder columns: Model, Average WER first, then Datarisas, then other datasets, then RTFx
+        dataset_columns = [col for col in model_stats.columns if col not in ['model_id', 'Average WER', 'RTFx']]
+        # Put datarisas first, then other datasets
+        datarisas_col = [col for col in dataset_columns if 'datarisas' in col.lower()]
+        other_dataset_cols = [col for col in dataset_columns if 'datarisas' not in col.lower()]
+        ordered_dataset_cols = datarisas_col + other_dataset_cols
+        new_column_order = ['model_id', 'Average WER'] + ordered_dataset_cols + ['RTFx']
+        model_stats = model_stats[new_column_order]
+        # Convert model names to appropriate links
+        def create_model_link(model_name):
+            if 'elevenlabs' in model_name.lower():
+                return f'<a href="https://elevenlabs.io/speech-to-text" target="_blank">{model_name}</a>'
+            else:
+                return f'<a href="https://huggingface.co/{model_name}" target="_blank">{model_name}</a>'
+        model_stats['model_id'] = model_stats['model_id'].apply(create_model_link)
+        # Rename columns for better display
+        column_mapping = {'model_id': 'Model', 'Average WER': 'Average WER ⬇️', 'RTFx': 'RTFx ⬆️'}
+        # Add arrows to dataset WER columns
+        for col in dataset_columns:
+            column_mapping[col] = f'{col.replace("_", " ").title()} WER ⬇️'
+        model_stats = model_stats.rename(columns=column_mapping)
+        return model_stats
+    except FileNotFoundError:
+        # Return empty dataframe if CSV doesn't exist
+        return pd.DataFrame(columns=['Model', 'Average WER ⬇️', 'RTFx ⬆️'])
+# Load results
+leaderboard_df = load_results()
 demo = gr.Blocks(css=custom_css)
 with demo:
     gr.Markdown(INTRODUCTION_TEXT, elem_classes="markdown-text")
     with gr.Tabs(elem_classes="tab-buttons") as tabs:
+        with gr.TabItem("🏅 Chilean Spanish ASR Leaderboard", elem_id="leaderboard-tab", id=0):
+            gr.Dataframe(
+                value=leaderboard_df,
+                interactive=False,
+                wrap=True,
+                datatype=["markdown"] + ["number"] * (len(leaderboard_df.columns) - 1)
             )
+        with gr.TabItem("📝 About", elem_id="about-tab", id=1):
+            gr.Markdown(ABOUT_TEXT, elem_classes="markdown-text")
     with gr.Row():
         with gr.Accordion("📙 Citation", open=False):
             citation_button = gr.Textbox(
                 show_copy_button=True,
             )
+demo.launch()

requirements.txt CHANGED Viewed

@@ -1,16 +1,2 @@
-APScheduler
-black
-datasets
-gradio
-gradio[oauth]
-gradio_leaderboard==0.0.13
-gradio_client
-huggingface-hub>=0.18.0
-matplotlib
-numpy
-pandas
-python-dateutil
-tqdm
-transformers
-tokenizers>=0.15.0
-sentencepiece


1	+ gradio==4.44.0
2	+ pandas==2.0.3

results.csv ADDED Viewed

	@@ -0,0 +1,34 @@

+dataset,num_samples,total_time,total_runtime,job_id,model_id,wer,rtfx,total_audio_length
+google_chilean_spanish,4374,169.16428009035442,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,2.86,152.15,25737.9899375
+datarisas,50,1.9580847612551107,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,17.07,190.83,373.662875
+common_voice,152,5.849061822395057,72.870062367,Transformers_2025-10-26_23-22-40,openai/whisper-large-v3-turbo,4.94,151.65,887.004
+datarisas,50,29.12175529364519,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,16.4,12.83,373.662875
+google_chilean_spanish,4374,2460.3554988628057,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,3.3,10.46,25737.9899375
+common_voice,152,84.19953364344755,376.658109154,ElevenLabs_2025-10-26_23-47-14,elevenlabs/scribe_v1,2.21,10.53,887.004
+datarisas,50,2.8973398348924038,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,16.53,128.97,373.662875
+google_chilean_spanish,4374,252.6714496900961,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,4.6,101.86,25737.9899375
+common_voice,152,8.742226805006748,71.440938334,Transformers_2025-10-27_00-23-30,openai/whisper-large-v3,3.64,101.46,887.004
+datarisas,50,0.24104375075209022,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,16.4,1550.19,373.662875
+google_chilean_spanish,4374,21.062983242975022,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,4.44,1221.95,25737.9899375
+common_voice,152,0.721991695274016,39.020745296,NeMo_2025-10-27_00-26-07,nvidia/parakeet-tdt-0.6b-v3,2.86,1228.55,887.004
+google_chilean_spanish,4374,38.13183395634415,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,4.95,674.97,25737.9899375
+datarisas,50,0.4448383559679799,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,20.93,840.0,373.662875
+common_voice,152,1.3294066856862687,66.793817482,NeMo_2025-10-27_00-27-29,nvidia/canary-1b-v2,3.58,667.22,887.004
+google_chilean_spanish,4374,1046.1813324201853,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,4.65,24.6,25737.9899375
+datarisas,50,12.073279585432461,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,16.8,30.95,373.662875
+common_voice,152,36.39141853038421,136.537279006,Voxtral_2025-10-27_00-29-28,mistralai/Voxtral-Mini-3B-2507,3.58,24.37,887.004
+datarisas,50,37.41451234264182,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,20.67,9.99,373.662875
+google_chilean_spanish,4374,3188.2717377215713,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,4.44,8.07,25737.9899375
+common_voice,152,110.95904769604698,487.718295923,Phi4Multimodal_2025-10-27_00-32-26,microsoft/Phi-4-multimodal-instruct,3.25,7.99,887.004
+datarisas,50,1.4263290456974054,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,30.8,261.98,373.662875
+google_chilean_spanish,4374,125.01975872613868,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,7.99,205.87,25737.9899375
+common_voice,152,4.290577019156153,44.071024773999994,Transformers_2025-10-27_00-59-35,openai/whisper-small,10.34,206.73,887.004
+google_chilean_spanish,4374,113.85375795331139,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,2.37,226.06,25737.9899375
+datarisas,50,1.3030618462718029,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,30.13,286.76,373.662875
+common_voice,152,3.9525142864162195,36.504430927,Transformers_2025-10-27_01-00-59,rcastrovexler/whisper-small-es-cl,13.4,224.42,887.004
+google_chilean_spanish,4374,177.93435856938717,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,4.64,144.65,25737.9899375
+common_voice,152,6.106875077747236,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,2.86,145.25,887.004
+datarisas,50,2.0549551418640357,59.272752274,Transformers_2025-10-27_16-19-28,surus-lat/whisper-large-v3-turbo-latam,20.93,181.84,373.662875
+datarisas,50,13.944635539861737,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,35.07,26.8,373.662875
+google_chilean_spanish,4374,1306.5544163221957,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,5.09,19.7,25737.9899375
+common_voice,152,44.908032676899815,498.902513566,Omnilingual_2025-11-10_23-37-32,facebookresearch/omniASR_LLM_7B,4.16,19.75,887.004

src/about.py CHANGED Viewed

@@ -1,72 +1,178 @@
-from dataclasses import dataclass
-from enum import Enum
-@dataclass
-class Task:
-    benchmark: str
-    metric: str
-    col_name: str
-# Select your tasks here
-# ---------------------------------------------------
-class Tasks(Enum):
-    # task_key in the json file, metric_key in the json file, name to display in the leaderboard
-    task0 = Task("anli_r1", "acc", "ANLI")
-    task1 = Task("logiqa", "acc_norm", "LogiQA")
-NUM_FEWSHOT = 0 # Change with your few shot
-# ---------------------------------------------------
-# Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">Demo leaderboard</h1>"""
-# What does your leaderboard evaluate?
-INTRODUCTION_TEXT = """
-Intro text
-"""
-# Which evaluations are you running? how can people reproduce what you have?
-LLM_BENCHMARKS_TEXT = f"""
-## How it works
-## Reproducibility
-To reproduce our results, here is the commands you can run:
-"""
-EVALUATION_QUEUE_TEXT = """
-## Some good practices before submitting a model
-### 1) Make sure you can load your model and tokenizer using AutoClasses:
-```python
-from transformers import AutoConfig, AutoModel, AutoTokenizer
-config = AutoConfig.from_pretrained("your model name", revision=revision)
-model = AutoModel.from_pretrained("your model name", revision=revision)
-tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
 ```
-If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
-Note: make sure your model is public!
-Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
-### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
-It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
-### 3) Make sure your model has an open license!
-This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
-### 4) Fill up your model card
-When we add extra information about models to the leaderboard, it will be automatically taken from the model card
-## In case of model failure
-If your model is displayed in the `FAILED` category, its execution stopped.
-Make sure you have followed the above steps first.
-If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
-CITATION_BUTTON_TEXT = r"""
 """

+# Chilean Spanish ASR Leaderboard Configuration
+# Your leaderboard name
+TITLE = """<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard - Chilean Spanish </h1> </body> </html>"""
+# What does your leaderboard evaluate?
+INTRODUCTION_TEXT = """📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
+    on Chilean Spanish speech data from the Hugging Face Hub. \
+    \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️ lower is better) and [RTFx](https://github.com/NVIDIA/DeepLearningExamples/blob/master/Kaldi/SpeechRecognition/README.md#metrics) (⬆️ higher is better). Models are ranked based on their Average WER, from lowest to highest. \
+    \nThis leaderboard focuses specifically on **Chilean Spanish dialect** evaluation using three datasets: Common Voice (Chilean Spanish), Google Chilean Spanish, and Datarisas.
+🙏 **Special thanks to [Modal](https://modal.com/) for providing compute credits that made this evaluation possible!**"""
+# About section content
+ABOUT_TEXT = """
+## About This Leaderboard
+This repository is a **streamlined, task-specific version** of the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard) evaluation framework, specifically adapted for benchmarking ASR models on the **Chilean Spanish dialect**.
+### What is the Open ASR Leaderboard?
+The [Open ASR Leaderboard](https://github.com/huggingface/open_asr_leaderboard) is a comprehensive benchmarking framework developed by Hugging Face, NVIDIA NeMo, and the community to evaluate ASR models across multiple English datasets (LibriSpeech, AMI, VoxPopuli, Earnings-22, GigaSpeech, SPGISpeech, TED-LIUM). It supports various ASR frameworks including Transformers, NeMo, SpeechBrain, and more, providing standardized WER and RTFx metrics.
+### How This Repository Differs
+This Chilean Spanish adaptation makes the following key modifications to focus exclusively on Chilean Spanish ASR evaluation:
+| Aspect | Original Open ASR Leaderboard | This Repository |
+|--------|-------------------------------|-----------------|
+| **Target Language** | English (primarily) | Chilean Spanish |
+| **Dataset** | 7 English datasets (LibriSpeech, AMI, etc.) | 3 Chilean Spanish datasets (Common Voice, Google Chilean Spanish, Datarisas) |
+| **Text Normalization** | English text normalizer | **Multilingual normalizer** preserving Spanish accents (á, é, í, ó, ú, ñ) |
+| **Model Focus** | Broad coverage (~50+ models) | **10 selected models** optimized for multilingual/Spanish ASR |
+| **Execution** | Local GPU execution | **Cloud-based** parallel execution via Modal |
+---
+## Models Evaluated
+This repository evaluates **11 state-of-the-art ASR models** selected for their multilingual or Spanish language support:
+| Model | Type | Framework | Parameters | Notes |
+|-------|------|-----------|------------|-------|
+| **openai/whisper-large-v3** | Multilingual | Transformers | 1.5B | OpenAI's flagship ASR model |
+| **openai/whisper-large-v3-turbo** | Multilingual | Transformers | 809M | Faster Whisper variant |
+| **surus-lat/whisper-large-v3-turbo-latam** | Multilingual | Transformers | 809M | Fine-tuned model for Latam Spanish |
+| **openai/whisper-small** | Multilingual | Transformers | 244M | Reference baseline model |
+| **rcastrovexler/whisper-small-es-cl** | Chilean Spanish | Transformers | 244M | Only fine-tuned model found for Chilean Spanish |
+| **nvidia/canary-1b-v2** | Multilingual | NeMo | 1B | NVIDIA's multilingual ASR |
+| **nvidia/parakeet-tdt-0.6b-v3** | Multilingual | NeMo | 0.6B | Lightweight, fast inference |
+| **microsoft/Phi-4-multimodal-instruct** | Multimodal | Phi | 14B | Microsoft's multimodal LLM with audio |
+| **mistralai/Voxtral-Mini-3B-2507** | Speech-to-text | Transformers | 3B | Mistral's ASR model |
+| **elevenlabs/scribe_v1** | API-based | API | N/A | ElevenLabs' commercial ASR API |
+| **facebookresearch/omniASR_LLM_7B** | Multilingual | OmniLingual ASR | 7B | FAIR's OmniLingual ASR with spa_Latn target |
+## Dataset
+This evaluation uses a comprehensive Chilean Spanish test dataset that combines three different sources of Chilean Spanish speech data:
+### [`astroza/es-cl-asr-test-only`](https://huggingface.co/datasets/astroza/es-cl-asr-test-only)
+This dataset aggregates three distinct Chilean Spanish speech datasets to provide comprehensive coverage of different domains and speaking styles:
+1. **Common Voice (Chilean Spanish filtered)**: Community-contributed recordings specifically filtered for Chilean Spanish dialects.
+2. **Google Chilean Spanish** ([`ylacombe/google-chilean-spanish`](https://huggingface.co/datasets/ylacombe/google-chilean-spanish)):
+   - 7 hours of transcribed high-quality audio of Chilean Spanish sentences.
+   - Recorded by 31 volunteers.
+   - Intended for speech technologies.
+   - Restructured from original OpenSLR archives for easier streaming
+3. **Datarisas** ([`astroza/chilean-jokes-festival-de-vina`](https://huggingface.co/datasets/astroza/chilean-jokes-festival-de-vina)):
+   - Audio fragments from comedy routines at the Festival de Viña del Mar.
+   - Represents spontaneous, colloquial Chilean Spanish.
+   - Captures humor and cultural expressions specific to Chile.
+**Combined Dataset Properties:**
+- **Language**: Spanish (Chilean variant)
+- **Split**: `test`
+- **Domain**: Mixed (formal recordings, volunteer speech, comedy performances)
+- **Total Coverage**: Multiple speaking styles and contexts of Chilean Spanish
+## Metrics
+Following the Open ASR Leaderboard standard, we report:
+- **WER (Word Error Rate)**: ⬇️ Lower is better - Measures transcription accuracy
+- **RTFx (Real-Time Factor)**: ⬆️ Higher is better - Measures inference speed (audio_duration / transcription_time)
+### Word Error Rate (WER)
+Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage
+of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.
+Take the following example:
+| Reference:  | el | gato | se     | sentó | en  | la | alfombra |
+|-------------|-----|-----|---------|-----|-----|-----| -----|
+| Prediction: | el | gato | **se** | sentó | en  | la |     |  |
+| Label:      | ✅   | ✅   | ✅       | ✅   | ✅   | ✅ | D   |
+Here, we have:
+* 0 substitutions
+* 0 insertions
+* 1 deletion ("alfombra" is missing)
+This gives 1 error in total. To get our word error rate, we divide the total number of errors (substitutions + insertions + deletions) by the total number of words in our
+reference (N), which for this example is 7:
 ```
+WER = (S + I + D) / N = (0 + 0 + 1) / 7 = 0.143
+```
+Giving a WER of 0.14, or 14%. For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints, meaning punctuation and casing is removed from the references and predictions.
+### Inverse Real Time Factor (RTFx)
+Inverse Real Time Factor is a measure of  the **latency** of automatic speech recognition systems, i.e. how long it takes an
+model to process a given amount of speech. It is defined as:
+```
+RTFx = (number of seconds of audio inferred) / (compute time in seconds)
+```
+Therefore, and RTFx of 1 means a system processes speech as fast as it's spoken, while an RTFx of 2 means it takes half the time.
+Thus, **a higher RTFx value indicates lower latency**.
+## Text Normalization for Spanish
+This repository uses a **multilingual normalizer** configured to preserve Spanish characters:
+```python
+normalizer = BasicMultilingualTextNormalizer(remove_diacritics=False)
+```
+**What it does:**
+- ✅ Preserves: `á, é, í, ó, ú, ñ, ü, ¿, ¡`
+- ✅ Removes: Brackets `[...]`, parentheses `(...)`, special symbols
+- ✅ Normalizes: Whitespace, capitalization (converts to lowercase)
+- ❌ Does NOT remove: Accents or Spanish-specific characters
+**Example:**
+```python
+Input:  "¿Cómo estás? [ruido] (suspiro)"
+Output: "cómo estás"
+```
+This is critical for Spanish evaluation, as diacritics change word meaning:
+- `esta` (this) vs. `está` (is)
+- `si` (if) vs. `sí` (yes)
+- `el` (the) vs. `él` (he)
+## How to reproduce our results
+The ASR Leaderboard evaluation was conducted using [Modal](https://modal.com) for cloud-based distributed GPU evaluation.
+For more details head over to our repo at: https://github.com/aastroza/open_asr_leaderboard_cl
 """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
+CITATION_BUTTON_TEXT = r"""@misc{chilean-open-asr-leaderboard,
+  title={Open Automatic Speech Recognition Leaderboard - Chilean Spanish},
+  author={Instituto de Data Science UDD},
+  year={2025},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/spaces/idsudd/open_asr_leaderboard_cl}}
+}
+@misc{astroza2025chilean-dataset,
+  title={Chilean Spanish ASR Test Dataset},
+  author={Alonso Astroza},
+  year={2025},
+  howpublished={\url{https://huggingface.co/datasets/astroza/es-cl-asr-test-only}}
+}
+@misc{open-asr-leaderboard,
+  title={Open Automatic Speech Recognition Leaderboard},
+  author={Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team},
+  year={2023},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}}
+}
 """