Spaces:
Running
Running
David Pomerenke
commited on
Commit
Β·
0a5d23d
1
Parent(s):
a65282b
Metadata and Methodology
Browse files
README.md
CHANGED
|
@@ -4,8 +4,15 @@ emoji: π
|
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: pink
|
| 6 |
sdk: gradio
|
| 7 |
-
license:
|
| 8 |
short_description: Evaluating LLM performance across all human languages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
tags:
|
| 10 |
- leaderboard
|
| 11 |
- submission:manual
|
|
@@ -23,6 +30,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
|
|
| 23 |
For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
|
| 24 |
-->
|
| 25 |
|
|
|
|
| 26 |
|
| 27 |
# AI Language Monitor π
|
| 28 |
|
|
|
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: pink
|
| 6 |
sdk: gradio
|
| 7 |
+
license: cc-by-sa-4.0
|
| 8 |
short_description: Evaluating LLM performance across all human languages.
|
| 9 |
+
datasets:
|
| 10 |
+
- openlanguagedata/flores_plus
|
| 11 |
+
models:
|
| 12 |
+
- meta-llama/Llama-3.3-70B-Instruct
|
| 13 |
+
- mistralai/Mistral-Small-24B-Instruct-2501
|
| 14 |
+
- deepseek-ai/DeepSeek-V3
|
| 15 |
+
- microsoft/phi-4
|
| 16 |
tags:
|
| 17 |
- leaderboard
|
| 18 |
- submission:manual
|
|
|
|
| 30 |
For tag meaning, see https://huggingface.co/spaces/leaderboards/LeaderboardsExplorer
|
| 31 |
-->
|
| 32 |
|
| 33 |
+
[](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
|
| 34 |
|
| 35 |
# AI Language Monitor π
|
| 36 |
|
app.py
CHANGED
|
@@ -190,4 +190,24 @@ with gr.Blocks(title="AI Language Translation Benchmark") as demo:
|
|
| 190 |
gr.DataFrame(value=df, label="Language Results", show_search="search")
|
| 191 |
gr.Plot(value=scatter_plot, label="Language Coverage")
|
| 192 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
demo.launch()
|
|
|
|
| 190 |
gr.DataFrame(value=df, label="Language Results", show_search="search")
|
| 191 |
gr.Plot(value=scatter_plot, label="Language Coverage")
|
| 192 |
|
| 193 |
+
|
| 194 |
+
gr.Markdown("""
|
| 195 |
+
## Methodology
|
| 196 |
+
### Dataset
|
| 197 |
+
- Using [FLORES-200](https://huggingface.co/datasets/openlanguagedata/flores_plus) evaluation set, a high-quality human-translated benchmark comprising 200 languages
|
| 198 |
+
- Each language is tested with the same 100 sentences
|
| 199 |
+
- All translations are from the evaluated language to a fixed set of representative languages sampled by number of speakers
|
| 200 |
+
- Language statistics sourced from Ethnologue and Wikidata
|
| 201 |
+
|
| 202 |
+
### Models & Evaluation
|
| 203 |
+
- Models accessed through [OpenRouter](https://openrouter.ai/), including fast models of all big labs, open and closed
|
| 204 |
+
- **BLEU Score**: Translations are evaluated using the BLEU metric, which measures how similar the AI's translation is to a human reference translation -- higher is better
|
| 205 |
+
|
| 206 |
+
### Language Categories
|
| 207 |
+
Languages are divided into three tiers based on translation difficulty:
|
| 208 |
+
- High-Resource: Top 25% of languages by BLEU score (easiest to translate)
|
| 209 |
+
- Mid-Resource: Middle 50% of languages
|
| 210 |
+
- Low-Resource: Bottom 25% of languages (hardest to translate)
|
| 211 |
+
""", container=True)
|
| 212 |
+
|
| 213 |
demo.launch()
|