Spaces:

fair-forward
/

evals-for-every-language

Running

App Files Files Community

davidpomerenke commited on 11 days ago

Commit

2cdada4

verified ·

1 Parent(s): 68a93b5

Upload from GitHub Actions: Merge pull request #22 from datenlabor-bmz/dev

Browse files

Files changed (35) hide show

.github/workflows/nightly-evals.yml +9 -29
.gitignore +5 -0
Dockerfile +1 -1
README.md +5 -0
data/datasets.json +783 -0
evals/__init__.py +0 -1
evals/backend.py +88 -38
evals/countries.py +10 -4
evals/datasets_/arc.py +45 -28
evals/datasets_/fleurs.py +2 -1
evals/datasets_/mgsm.py +48 -24
evals/datasets_/mmlu.py +57 -25
evals/datasets_/truthfulqa.py +63 -27
evals/datasets_/util.py +32 -1
evals/download_data.py +33 -16
evals/languages.py +3 -0
evals/main.py +65 -47
evals/models.py +192 -91
evals/plots.py +75 -41
evals/tasks.py +213 -256
evals/translate.py +1 -1
frontend/package-lock.json +0 -0
frontend/package.json +7 -5
frontend/public/sw.js +9 -0
frontend/src/App.js +185 -77
frontend/src/components/HistoryPlot.js +2 -2
frontend/src/components/LanguageTable.js +1 -1
frontend/src/components/ModelTable.js +31 -17
frontend/src/components/ScoreColumns.js +23 -10
frontend/src/components/ScoreField.js +2 -1
frontend/src/components/SpeakerPlot.js +2 -2
frontend/src/components/WorldMap.js +22 -7
notes/system-architecture-diagram.md +177 -0
pyproject.toml +2 -10
uv.lock +0 -0

.github/workflows/nightly-evals.yml CHANGED Viewed

@@ -1,13 +1,15 @@
 name: Nightly Evaluation Run
 on:
-  schedule:
-    - cron: '0 3 * * *'  # Run at 3am UTC every day
   workflow_dispatch:  # Allow manual triggering
 jobs:
   run-evals:
     runs-on: ubuntu-latest
     steps:
       - uses: actions/checkout@v3
@@ -25,38 +27,16 @@ jobs:
         env:
           OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
           HUGGINGFACE_ACCESS_TOKEN: ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
         run: |
           uv run huggingface-cli login --token ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
           uv run evals/download_data.py
           uv run evals/main.py
-      - name: Commit changes
-        env:
-          GH_PAT: ${{ secrets.GH_PAT }}
-        run: |
-          git config --local user.email "github-actions[bot]@users.noreply.github.com"
-          git config --local user.name "github-actions[bot]"
-          git config --local --unset-all http.https://github.com/.extraheader
-          git remote set-url origin https://${GH_PAT}@github.com/datenlabor-bmz/ai-language-monitor.git
-          git add results.json models.json languages.json
-          git commit -m "Update evaluation results" || echo "No changes to commit"
-          git push origin HEAD:main
-      - name: Upload to Hugging Face
         env:
           HUGGINGFACE_ACCESS_TOKEN: ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
         run: |
-          uv run python -c '
-          from huggingface_hub import upload_folder
-          import os
-          upload_folder(
-              folder_path=".",
-              path_in_repo="/",
-              allow_patterns=["results.json", "models.json", "languages.json"],
-              repo_id="fair-forward/evals-for-every-language",
-              repo_type="space",
-              token=os.environ["HUGGINGFACE_ACCESS_TOKEN"],
-              commit_message="Upload from nightly evaluation run",
-          )
-          '

 name: Nightly Evaluation Run
 on:
+  # schedule:
+  #   - cron: '0 3 * * *'  # Run at 3am UTC every day
   workflow_dispatch:  # Allow manual triggering
 jobs:
   run-evals:
     runs-on: ubuntu-latest
+    # checking if this is working in case eval runs take longer than 6h github actions allowance
+    timeout-minutes: 1440  # 24 hours timeout
     steps:
       - uses: actions/checkout@v3
         env:
           OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
           HUGGINGFACE_ACCESS_TOKEN: ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
+          N_SENTENCES: 20
+          MAX_LANGUAGES: 150
         run: |
           uv run huggingface-cli login --token ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
           uv run evals/download_data.py
           uv run evals/main.py
+      - name: Restart HuggingFace Space
         env:
           HUGGINGFACE_ACCESS_TOKEN: ${{ secrets.HUGGINGFACE_ACCESS_TOKEN }}
         run: |
+          curl -X POST "https://huggingface.co/api/spaces/fair-forward/evals-for-every-language/restart" \
+            -H "Authorization: Bearer $HUGGINGFACE_ACCESS_TOKEN"

.gitignore CHANGED Viewed

@@ -1,3 +1,4 @@
 data/translations/
 floresp-*
 fleurs
@@ -5,6 +6,8 @@ spbleu
 .cache
 .env
 *_credentials.json
 # Python-generated files
 __pycache__/
@@ -20,3 +23,5 @@ wheels/
 # folders and files to be ignored
 .specstory/
 .cursorindexingignore

+results/
 data/translations/
 floresp-*
 fleurs
 .cache
 .env
 *_credentials.json
+models_unfiltered.json
+**/*.DS_Store
 # Python-generated files
 __pycache__/
 # folders and files to be ignored
 .specstory/
 .cursorindexingignore

Dockerfile CHANGED Viewed

@@ -14,7 +14,7 @@ ENV HOME=/home/user \
 RUN mkdir -p ${UV_CACHE_DIR} && chown -R user:user ${HOME}
 USER user
 WORKDIR $HOME/app
-COPY --chown=user pyproject.toml uv.lock ./
 RUN uv sync --frozen --no-dev
 COPY --chown=user evals/ evals/
 COPY --chown=user --from=build /frontend/build /home/user/app/frontend/build

 RUN mkdir -p ${UV_CACHE_DIR} && chown -R user:user ${HOME}
 USER user
 WORKDIR $HOME/app
+COPY --chown=user pyproject.toml uv.lock README.md ./
 RUN uv sync --frozen --no-dev
 COPY --chown=user evals/ evals/
 COPY --chown=user --from=build /frontend/build /home/user/app/frontend/build

README.md CHANGED Viewed

@@ -45,6 +45,7 @@ _Tracking language proficiency of AI models for every language_
 ## Evaluate
 ```bash
 uv run --extra dev evals/main.py
 ```
@@ -55,3 +56,7 @@ uv run --extra dev evals/main.py
 uv run evals/backend.py
 cd frontend && npm i && npm start
 ```

 ## Evaluate
+### Local Development
 ```bash
 uv run --extra dev evals/main.py
 ```
 uv run evals/backend.py
 cd frontend && npm i && npm start
 ```
+## System Architecture
+See [notes/system-architecture-diagram.md](notes/system-architecture-diagram.md) for the complete system architecture diagram and component descriptions.

data/datasets.json ADDED Viewed

	@@ -0,0 +1,783 @@

+[
+    {
+        "name": "FLORES+",
+        "author": "Meta",
+        "author_url": "https://ai.meta.com",
+        "url": "https://huggingface.co/datasets/openlanguagedata/flores_plus",
+        "n_languages": 200,
+        "tasks": [
+            "translation"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "FLORES",
+        "implemented": true,
+        "group": "Translation"
+    },
+    {
+        "name": "SIB-200",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/Davlan/sib200",
+        "n_languages": 200,
+        "tasks": [
+            "classification"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "FLORES",
+        "implemented": true,
+        "group": "Translation"
+    },
+    {
+        "name": "CCAligned",
+        "author": "Meta",
+        "author_url": "https://ai.meta.com",
+        "url": "https://huggingface.co/datasets/ahelk/ccaligned_multilingual",
+        "n_languages": 137,
+        "tasks": [
+            "translation"
+        ],
+        "parallel": false,
+        "group": "Translation"
+    },
+    {
+        "name": "OPUS Collection",
+        "author": "Helsinki NLP",
+        "author_url": null,
+        "url": "https://opus.nlpl.eu",
+        "n_languages": 747,
+        "tasks": [
+            "translation"
+        ],
+        "parallel": false,
+        "group": "Translation"
+    },
+    {
+        "name": "Global MMLU",
+        "author": "Cohere",
+        "author_url": "https://cohere.com",
+        "url": "https://huggingface.co/datasets/CohereForAI/Global-MMLU",
+        "n_languages": 42,
+        "languages": [
+            "am",
+            "ar",
+            "bn",
+            "cs",
+            "de",
+            "el",
+            "en",
+            "es",
+            "fa",
+            "fil",
+            "fr",
+            "ha",
+            "he",
+            "hi",
+            "id",
+            "ig",
+            "it",
+            "ja",
+            "ko",
+            "ky",
+            "lt",
+            "mg",
+            "ms",
+            "ne",
+            "nl",
+            "ny",
+            "pl",
+            "pt",
+            "ro",
+            "ru",
+            "si",
+            "sn",
+            "so",
+            "sr",
+            "sv",
+            "sw",
+            "te",
+            "tr",
+            "uk",
+            "vi",
+            "yo",
+            "zh"
+        ],
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "mixed",
+        "base": "MMLU",
+        "implemented": true,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "MMMLU",
+        "author": "OpenAI",
+        "author_url": "https://openai.com",
+        "url": "https://huggingface.co/datasets/openai/MMMLU",
+        "n_languages": "14",
+        "languages": [
+            "ar",
+            "bn",
+            "de",
+            "es",
+            "fr",
+            "hi",
+            "id",
+            "it",
+            "ja",
+            "ko",
+            "pt",
+            "sw",
+            "yo",
+            "zh"
+        ],
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "MMLU",
+        "implemented": true,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "AfriMMLU",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/afrimmlu",
+        "n_languages": "17",
+        "languages": [
+            "am",
+            "en",
+            "ee",
+            "fr",
+            "ha",
+            "ig",
+            "rw",
+            "ln",
+            "lg",
+            "om",
+            "sn",
+            "st",
+            "sw",
+            "tw",
+            "wo",
+            "xh",
+            "yo",
+            "zu"
+        ],
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "MMLU",
+        "implemented": true,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "Okapi MMLU",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/jon-tow/okapi_mmlu",
+        "n_languages": 26,
+        "languages": [
+            "ar",
+            "bn",
+            "ca",
+            "da",
+            "de",
+            "es",
+            "eu",
+            "fr",
+            "gu",
+            "hi",
+            "hr",
+            "hu",
+            "hy",
+            "id",
+            "it",
+            "kn",
+            "ml",
+            "mr",
+            "ne",
+            "nl",
+            "pt",
+            "ro",
+            "ru",
+            "sk",
+            "sr",
+            "sv",
+            "ta",
+            "te",
+            "uk",
+            "vi",
+            "zh"
+        ],
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "MMLU",
+        "implemented": false,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "MMLU-X",
+        "author": "OpenGPT-X",
+        "author_url": "https://opengpt-x.de",
+        "url": "https://huggingface.co/datasets/openGPT-X/mmlux",
+        "n_languages": 20,
+        "languages": [
+            "bg",
+            "cs",
+            "da",
+            "de",
+            "el",
+            "es",
+            "et",
+            "fi",
+            "fr",
+            "hu",
+            "it",
+            "lt",
+            "lv",
+            "nl",
+            "pl",
+            "pt",
+            "ro",
+            "sk",
+            "sl",
+            "sv"
+        ],
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "MMLU",
+        "implemented": false,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "MMLU Auto-Translated",
+        "author": null,
+        "author_url": null,
+        "url": null,
+        "n_languages": null,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "MMLU",
+        "implemented": true,
+        "group": "Multitask Language Understanding"
+    },
+    {
+        "name": "MGSM",
+        "author": "Google",
+        "author_url": "https://google.com",
+        "url": "https://huggingface.co/datasets/juletxara/mgsm",
+        "n_languages": 10,
+        "tasks": [
+            "math"
+        ],
+        "parallel": true,
+        "base": "MGSM",
+        "implemented": true,
+        "group": "Grade School Math"
+    },
+    {
+        "name": "AfriMGSM",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/afrimgsm",
+        "n_languages": 18,
+        "tasks": [
+            "math"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "MGSM",
+        "implemented": true,
+        "group": "Grade School Math"
+    },
+    {
+        "name": "GSM8K-X",
+        "author": "OpenGPT-X",
+        "author_url": "https://opengpt-x.de",
+        "url": "https://huggingface.co/datasets/openGPT-X/gsm8kx",
+        "n_languages": 20,
+        "tasks": [
+            "math"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "MGSM",
+        "implemented": true,
+        "group": "Grade School Math"
+    },
+    {
+        "name": "GSM Auto-Translated",
+        "author": null,
+        "author_url": null,
+        "url": null,
+        "n_languages": 52,
+        "tasks": [
+            "math"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "MGSM",
+        "implemented": true,
+        "group": "Grade School Math"
+    },
+    {
+        "name": "Uhuru ARC Easy",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/uhura-arc-easy",
+        "n_languages": 6,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "AI2 ARC",
+        "implemented": true,
+        "group": "ARC Question Answering"
+    },
+    {
+        "name": "Okapi ARC Challenge",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/jon-tow/okapi_arc_challenge",
+        "n_languages": 31,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "AI2 ARC",
+        "implemented": false,
+        "group": "ARC Question Answering"
+    },
+    {
+        "name": "Arc-X",
+        "author": "OpenGPT-X",
+        "author_url": "https://opengpt-x.de",
+        "url": "https://huggingface.co/datasets/openGPT-X/arcx",
+        "n_languages": 20,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "AI2 ARC",
+        "implemented": false,
+        "group": "ARC Question Answering"
+    },
+    {
+        "name": "ARC-Easy Auto-Translated",
+        "author": null,
+        "author_url": null,
+        "url": null,
+        "n_languages": null,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "AI2 ARC",
+        "implemented": true,
+        "group": "ARC Question Answering"
+    },
+    {
+        "name": "Uhura TruthfulQA",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/uhura-truthfulqa",
+        "n_languages": 6,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "TruthfulQA",
+        "implemented": true,
+        "group": "Truthfulness"
+    },
+    {
+        "name": "Okapi TruthfulQA",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/jon-tow/okapi_truthfulqa/tree/main/data",
+        "n_languages": 31,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "TruthfulQA",
+        "implemented": false,
+        "group": "Truthfulness"
+    },
+    {
+        "name": "TruthfulQA-X",
+        "author": "OpenGPT-X",
+        "author_url": "https://opengpt-x.de",
+        "url": "https://huggingface.co/datasets/openGPT-X/truthfulqax",
+        "n_languages": 20,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "TruthfulQA",
+        "implemented": false,
+        "group": "Truthfulness"
+    },
+    {
+        "name": "TruthfulQA Auto-Translated",
+        "author": null,
+        "author_url": null,
+        "url": null,
+        "n_languages": null,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "TruthfulQA",
+        "implemented": true,
+        "group": "Truthfulness"
+    },
+    {
+        "name": "FLEURS",
+        "author": "Meta",
+        "author_url": "https://ai.meta.com",
+        "url": "https://huggingface.co/datasets/google/fleurs",
+        "n_languages": 102,
+        "tasks": [
+            "speech_recognition"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "FLORES",
+        "implemented": false,
+        "group": "Speech Recognition"
+    },
+    {
+        "name": "CommonVoice",
+        "author": "Mozilla",
+        "author_url": "https://blog.mozilla.ai",
+        "url": "https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0",
+        "n_languages": 124,
+        "tasks": [
+            "speech_recognition"
+        ],
+        "parallel": null,
+        "translation": "human",
+        "group": "Speech Recognition"
+    },
+    {
+        "name": "WorldCuisines",
+        "author": "Academic",
+        "author_url": "https://worldcuisines.github.io",
+        "url": "https://huggingface.co/datasets/worldcuisines/vqa",
+        "n_languages": 30,
+        "tasks": [
+            "visual_question_answering"
+        ],
+        "parallel": null,
+        "group": "Visual Question Answering"
+    },
+    {
+        "name": "CVQA",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/afaji/cvqa",
+        "n_languages": 39,
+        "tasks": [
+            "visual_question_answering"
+        ],
+        "parallel": null,
+        "group": "Visual Question Answering"
+    },
+    {
+        "name": "XNLI",
+        "author": "Meta",
+        "author_url": "https://ai.meta.com",
+        "url": "https://huggingface.co/datasets/facebook/xnli",
+        "n_languages": 14,
+        "tasks": [
+            "classification",
+            "logic"
+        ],
+        "parallel": true,
+        "base": "MNLI",
+        "group": "Natural Language Inference"
+    },
+    {
+        "name": "AfriXNLI",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/afrixnli",
+        "n_languages": 18,
+        "tasks": [
+            "classification",
+            "logic"
+        ],
+        "parallel": true,
+        "translation": "human",
+        "base": "MNLI",
+        "implemented": false,
+        "group": "Natural Language Inference"
+    },
+    {
+        "name": "XGLUE",
+        "author": "Microsoft",
+        "author_url": "https://microsoft.ai",
+        "url": "https://huggingface.co/datasets/microsoft/xglue",
+        "n_languages": 18,
+        "tasks": [
+            "pos"
+        ],
+        "parallel": null,
+        "base": "GLUE",
+        "group": "General Language Understanding"
+    },
+    {
+        "name": "IndicGLUE",
+        "author": "AI4Bharat",
+        "author_url": "https://models.ai4bharat.org",
+        "url": "https://huggingface.co/datasets/ai4bharat/indic_glue",
+        "n_languages": 11,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": null,
+        "base": "GLUE",
+        "group": "General Language Understanding"
+    },
+    {
+        "name": "Okapi HellaSwag",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/jon-tow/okapi_hellaswag",
+        "n_languages": 31,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "HellaSwag",
+        "implemented": false,
+        "group": "Adversarial Language Modelling"
+    },
+    {
+        "name": "HellaSwag-X",
+        "author": "OpenGPT-X",
+        "author_url": "https://opengpt-x.de",
+        "url": "https://huggingface.co/datasets/openGPT-X/hellaswagx",
+        "n_languages": 20,
+        "tasks": [
+            "question_answering"
+        ],
+        "parallel": true,
+        "translation": "machine",
+        "base": "HellaSwag",
+        "implemented": false,
+        "group": "Adversarial Language Modelling"
+    },
+    {
+        "name": "WikiANN / PAN-X",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/unimelb-nlp/wikiann",
+        "n_languages": 176,
+        "tasks": [
+            "ner"
+        ],
+        "parallel": false,
+        "group": "Named Entity Recognition"
+    },
+    {
+        "name": "MasakhaNER",
+        "author": "Masakhane",
+        "author_url": "https://www.masakhane.io",
+        "url": "https://huggingface.co/datasets/masakhane/masakhaner",
+        "n_languages": 10,
+        "tasks": [
+            "ner"
+        ],
+        "parallel": null,
+        "group": "Named Entity Recognition"
+    },
+    {
+        "name": "Tülu 3 SFT Mixture",
+        "author": "AllenAI",
+        "author_url": "https://allenai.org",
+        "url": "https://huggingface.co/datasets/allenai/tulu-3-sft-mixture",
+        "n_languages": 70,
+        "tasks": [
+            "instruction_following"
+        ],
+        "parallel": false,
+        "group": "Instruction Following"
+    },
+    {
+        "name": "xP3",
+        "author": "BigScience",
+        "author_url": "https://bigscience.huggingface.co",
+        "url": "https://huggingface.co/datasets/bigscience/xP3",
+        "n_languages": 46,
+        "tasks": [
+            "instruction_following"
+        ],
+        "parallel": false,
+        "group": "Instruction Following"
+    },
+    {
+        "name": "Aya",
+        "author": "Cohere",
+        "author_url": "https://cohere.com",
+        "url": "https://huggingface.co/datasets/CohereForAI/aya_dataset",
+        "n_languages": 65,
+        "tasks": [
+            "instruction_following"
+        ],
+        "parallel": null,
+        "group": "Instruction Following"
+    },
+    {
+        "name": "SEA-IFEVAL",
+        "author": "AI Singapore",
+        "author_url": "https://aisingapore.org",
+        "url": "https://huggingface.co/datasets/aisingapore/instruction_following-ifeval",
+        "n_languages": 7,
+        "tasks": [
+            "instruction_following"
+        ],
+        "parallel": true,
+        "base": "IFEVAL",
+        "group": "Instruction Following"
+    },
+    {
+        "name": "Babel-670",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://github.com/UBC-NLP/Babel-670-Language-Identification",
+        "n_languages": 670,
+        "tasks": [
+            "language_identification"
+        ],
+        "parallel": false,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "CulturaX",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/uonlp/CulturaX",
+        "n_languages": 167,
+        "tasks": [
+            "language_modeling"
+        ],
+        "parallel": false,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "XTREME",
+        "author": "Google",
+        "author_url": "https://google.com",
+        "url": "https://huggingface.co/datasets/google/xtreme",
+        "n_languages": 40,
+        "tasks": [
+            "translation",
+            "classification",
+            "question_answering",
+            "ner"
+        ],
+        "parallel": null,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "XLSUM",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/csebuetnlp/xlsum",
+        "n_languages": 45,
+        "tasks": [
+            "summarization"
+        ],
+        "parallel": true,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "MSVAMP",
+        "author": "Microsoft",
+        "author_url": "https://microsoft.ai",
+        "url": "https://huggingface.co/datasets/Mathoctopus/MSVAMP",
+        "n_languages": 10,
+        "tasks": [
+            "math"
+        ],
+        "parallel": true,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "Multilingual Sentiments",
+        "author": "Academic",
+        "author_url": null,
+        "url": "https://huggingface.co/datasets/tyqiangz/multilingual-sentiments",
+        "n_languages": 12,
+        "tasks": [
+            "sentiment_analysis"
+        ],
+        "parallel": null,
+        "group": "Other Tasks"
+    },
+    {
+        "name": "Lanfrica",
+        "author": "Lanfrica",
+        "author_url": "https://lanfrica.com",
+        "url": "https://lanfrica.com/records?language=yor&task=machine%20translation",
+        "n_languages": 2200,
+        "tasks": [
+            "datasets"
+        ],
+        "parallel": null,
+        "group": "Dataset Collections"
+    },
+    {
+        "name": "HuggingFace Languages",
+        "author": "HuggingFace",
+        "author_url": "https://huggingface.co",
+        "url": "https://huggingface.co/languages",
+        "n_languages": 4680,
+        "tasks": [
+            "datasets",
+            "models"
+        ],
+        "parallel": null,
+        "group": "Dataset Collections"
+    },
+    {
+        "name": "HuggingFace Multilingual Datasets",
+        "author": "HuggingFace",
+        "author_url": "https://huggingface.co",
+        "url": "https://huggingface.co/datasets?other=multilinguality:multilingual",
+        "n_languages": 2012,
+        "tasks": [
+            "datasets"
+        ],
+        "parallel": false,
+        "group": "Dataset Collections"
+    }
+]

evals/__init__.py CHANGED Viewed

	@@ -1 +0,0 @@
1	-

evals/backend.py CHANGED Viewed

@@ -4,16 +4,18 @@ import os
 import numpy as np
 import pandas as pd
 import uvicorn
 from countries import make_country_table
 from fastapi import FastAPI, Request
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.middleware.gzip import GZipMiddleware
 from fastapi.responses import JSONResponse
 from fastapi.staticfiles import StaticFiles
-scores = pd.read_json("results.json")
-languages = pd.read_json("languages.json")
-models = pd.read_json("models.json")
 def mean(lst):
@@ -26,7 +28,7 @@ task_metrics = [
     "classification_accuracy",
     "mmlu_accuracy",
     "arc_accuracy",
-    # "truthfulqa_accuracy",
     "mgsm_accuracy",
 ]
@@ -39,28 +41,58 @@ def compute_normalized_average(df, metrics):
             col_min = normalized_df[col].min()
             col_max = normalized_df[col].max()
             if col_max > col_min:  # Avoid division by zero
-                normalized_df[col] = (normalized_df[col] - col_min) / (col_max - col_min)
             else:
                 normalized_df[col] = 0  # If all values are the same, set to 0
     return normalized_df.mean(axis=1, skipna=False)
-def make_model_table(df, models):
-    df = (
-        df.groupby(["model", "task", "metric"])
-        .agg({"score": "mean", "bcp_47": "nunique"})
-        .reset_index()
     )
-    df["task_metric"] = df["task"] + "_" + df["metric"]
-    df = df.drop(columns=["task", "metric"])
-    df = df.pivot(index="model", columns="task_metric", values="score")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = df.sort_values(by="average", ascending=False).reset_index()
     df = pd.merge(df, models, left_on="model", right_on="id", how="left")
     df["rank"] = df.index + 1
     df = df[
         [
             "rank",
@@ -74,27 +106,41 @@ def make_model_table(df, models):
             "license",
             "cost",
             "average",
-            *task_metrics,
         ]
     ]
     return df
-def make_language_table(df, languages):
-    df = (
-        df.groupby(["bcp_47", "task", "metric"])
-        .agg({"score": "mean", "model": "nunique"})
-        .reset_index()
     )
-    df["task_metric"] = df["task"] + "_" + df["metric"]
-    df = df.drop(columns=["task", "metric"])
-    df = df.pivot(index="bcp_47", columns="task_metric", values="score").reset_index()
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = pd.merge(languages, df, on="bcp_47", how="outer")
     df = df.sort_values(by="speakers", ascending=False)
     df = df[
         [
             "bcp_47",
@@ -104,7 +150,7 @@ def make_language_table(df, languages):
             "family",
             "average",
             "in_benchmark",
-            *task_metrics,
         ]
     ]
     return df
@@ -125,35 +171,39 @@ async def data(request: Request):
     body = await request.body()
     data = json.loads(body)
     selected_languages = data.get("selectedLanguages", {})
-    df = scores.groupby(["model", "bcp_47", "task", "metric"]).mean().reset_index()
-    # lang_results = pd.merge(languages, lang_results, on="bcp_47", how="outer")
-    language_table = make_language_table(df, languages)
-    datasets_df = pd.read_json("datasets.json")
-    if selected_languages:
-        # the filtering is only applied for the model table and the country data
-        df = df[df["bcp_47"].isin(lang["bcp_47"] for lang in selected_languages)]
     if len(df) == 0:
         model_table = pd.DataFrame()
         countries = pd.DataFrame()
     else:
         model_table = make_model_table(df, models)
         countries = make_country_table(make_language_table(df, languages))
-    all_tables = {
         "model_table": serialize(model_table),
         "language_table": serialize(language_table),
         "dataset_table": serialize(datasets_df),
         "countries": serialize(countries),
-    }
-    return JSONResponse(content=all_tables)
-# Only serve static files if build directory exists (production mode)
 if os.path.exists("frontend/build"):
     app.mount("/", StaticFiles(directory="frontend/build", html=True), name="frontend")
-else:
-    print("🧪 Development mode: frontend/build directory not found")
-    print("🌐 Frontend should be running on http://localhost:3000")
-    print("📡 API available at http://localhost:8000/api/data")
 if __name__ == "__main__":
     uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", 8000)))

 import numpy as np
 import pandas as pd
 import uvicorn
 from countries import make_country_table
+from datasets_.util import load
 from fastapi import FastAPI, Request
 from fastapi.middleware.cors import CORSMiddleware
 from fastapi.middleware.gzip import GZipMiddleware
 from fastapi.responses import JSONResponse
 from fastapi.staticfiles import StaticFiles
+scores = load("results")
+languages = load("languages")
+models = load("models")
 def mean(lst):
     "classification_accuracy",
     "mmlu_accuracy",
     "arc_accuracy",
+    "truthfulqa_accuracy",
     "mgsm_accuracy",
 ]
             col_min = normalized_df[col].min()
             col_max = normalized_df[col].max()
             if col_max > col_min:  # Avoid division by zero
+                normalized_df[col] = (normalized_df[col] - col_min) / (
+                    col_max - col_min
+                )
             else:
                 normalized_df[col] = 0  # If all values are the same, set to 0
     return normalized_df.mean(axis=1, skipna=False)
+def make_model_table(scores_df, models):
+    scores_df = scores_df.copy()
+    # Create a combined task_metric for origin
+    scores_df["task_metric_origin"] = (
+        scores_df["task"] + "_" + scores_df["metric"] + "_" + scores_df["origin"]
+    )
+    # Pivot to get scores for each origin-specific metric
+    scores_pivot = scores_df.pivot_table(
+        index="model",
+        columns="task_metric_origin",
+        values="score",
+        aggfunc="mean",
+    )
+    # Create the regular task_metric for the main average calculation
+    scores_df["task_metric"] = scores_df["task"] + "_" + scores_df["metric"]
+    main_pivot = scores_df.pivot_table(
+        index="model", columns="task_metric", values="score", aggfunc="mean"
     )
+    # Merge the two pivots
+    df = pd.merge(main_pivot, scores_pivot, on="model", how="outer")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
+    # Add flag if any machine-origin data was used
+    machine_presence = scores_df[scores_df["origin"] == "machine"].groupby(["model", "task_metric"]).size()
+    for metric in task_metrics:
+        df[f"{metric}_contains_machine"] = df.index.map(lambda m: (m, metric) in machine_presence.index)
     df = df.sort_values(by="average", ascending=False).reset_index()
     df = pd.merge(df, models, left_on="model", right_on="id", how="left")
     df["rank"] = df.index + 1
+    # Dynamically find all metric columns to include
+    final_cols = df.columns
+    metric_cols = [m for m in final_cols if any(tm in m for tm in task_metrics)]
+    df["creation_date"] = df["creation_date"].apply(lambda x: x.isoformat() if x else None)
     df = df[
         [
             "rank",
             "license",
             "cost",
             "average",
+            *sorted(list(set(metric_cols))),
         ]
     ]
     return df
+def make_language_table(scores_df, languages):
+    scores_df = scores_df.copy()
+    scores_df["task_metric"] = scores_df["task"] + "_" + scores_df["metric"]
+    # Pivot scores
+    score_pivot = scores_df.pivot_table(
+        index="bcp_47", columns="task_metric", values="score", aggfunc="mean"
     )
+    # Pivot origins (first origin since each task+lang combo has only one)
+    origin_pivot = scores_df.pivot_table(
+        index="bcp_47", columns="task_metric", values="origin", aggfunc="first"
+    )
+    origin_pivot = origin_pivot.add_suffix("_origin")
+    df = pd.merge(score_pivot, origin_pivot, on="bcp_47", how="outer")
     for metric in task_metrics:
         if metric not in df.columns:
             df[metric] = np.nan
     df["average"] = compute_normalized_average(df, task_metrics)
     df = pd.merge(languages, df, on="bcp_47", how="outer")
     df = df.sort_values(by="speakers", ascending=False)
+    # Dynamically find all metric columns to include
+    final_cols = df.columns
+    metric_cols = [m for m in final_cols if any(tm in m for tm in task_metrics)]
     df = df[
         [
             "bcp_47",
             "family",
             "average",
             "in_benchmark",
+            *sorted(list(set(metric_cols))),
         ]
     ]
     return df
     body = await request.body()
     data = json.loads(body)
     selected_languages = data.get("selectedLanguages", {})
+    # Identify which metrics have machine translations available
+    machine_translated_metrics = {
+        f"{row['task']}_{row['metric']}"
+        for _, row in scores.iterrows()
+        if row["origin"] == "machine"
+    }
+    # Filter by selected languages if provided
+    df = scores[scores["bcp_47"].isin(lang["bcp_47"] for lang in selected_languages)] if selected_languages else scores
     if len(df) == 0:
         model_table = pd.DataFrame()
         countries = pd.DataFrame()
     else:
         model_table = make_model_table(df, models)
         countries = make_country_table(make_language_table(df, languages))
+    language_table = make_language_table(scores, languages)
+    datasets_df = pd.read_json("data/datasets.json")
+    return JSONResponse(content={
         "model_table": serialize(model_table),
         "language_table": serialize(language_table),
         "dataset_table": serialize(datasets_df),
         "countries": serialize(countries),
+        "machine_translated_metrics": list(machine_translated_metrics),
+    })
+# Only serve static files if build directory exists
 if os.path.exists("frontend/build"):
     app.mount("/", StaticFiles(directory="frontend/build", html=True), name="frontend")
 if __name__ == "__main__":
     uvicorn.run(app, host="0.0.0.0", port=int(os.environ.get("PORT", 8000)))

evals/countries.py CHANGED Viewed

@@ -15,6 +15,7 @@ def population(bcp_47):
     }
     return items
 @cache
 def make_country_table(language_table):
     countries = defaultdict(list)
@@ -30,10 +31,15 @@ def make_country_table(language_table):
             )
     for country, languages in countries.items():
         speaker_pop = sum(entry["population"] for entry in languages)
-        score = (
-            sum(entry["score"] * entry["population"] for entry in languages)
-            / speaker_pop
-        )
         countries[country] = {
             "score": score,
             "languages": languages,

     }
     return items
 @cache
 def make_country_table(language_table):
     countries = defaultdict(list)
             )
     for country, languages in countries.items():
         speaker_pop = sum(entry["population"] for entry in languages)
+        if speaker_pop < 1000:  # Grey out low-population countries
+            score = None  # This will make them appear grey on the map
+        else:
+            score = (
+                sum(entry["score"] * entry["population"] for entry in languages)
+                / speaker_pop
+            )
         countries[country] = {
             "score": score,
             "languages": languages,

evals/datasets_/arc.py CHANGED Viewed

@@ -1,11 +1,10 @@
 import random
-from collections import Counter, defaultdict
-from langcodes import Language, standardize_tag
 from rich import print
-from models import translate_google, google_supported_languages
 from tqdm import tqdm
-from datasets import Dataset, load_dataset
 import asyncio
 from tqdm.asyncio import tqdm_asyncio
 import os
@@ -14,27 +13,33 @@ from datasets_.util import _get_dataset_config_names, _load_dataset
 slug_uhura_arc_easy = "masakhane/uhura-arc-easy"
 tags_uhura_arc_easy = {
-    standardize_tag(a.split("_")[0], macro=True): a for a in _get_dataset_config_names(slug_uhura_arc_easy)
     if not a.endswith("unmatched")
 }
 random.seed(42)
-id_sets_train = [set(_load_dataset(slug_uhura_arc_easy, tag, split="train")["id"]) for tag in tags_uhura_arc_easy.values()]
 common_ids_train = list(sorted(set.intersection(*id_sets_train)))
 random.shuffle(common_ids_train)
-id_sets_test = [set(_load_dataset(slug_uhura_arc_easy, tag, split="test")["id"]) for tag in tags_uhura_arc_easy.values()]
 common_ids_test = list(sorted(set.intersection(*id_sets_test)))
 random.shuffle(common_ids_test)
 slug_uhura_arc_easy_translated = "fair-forward/arc-easy-autotranslated"
 tags_uhura_arc_easy_translated = {
-    standardize_tag(a.split("_")[0], macro=True): a for a in _get_dataset_config_names(slug_uhura_arc_easy_translated)
 }
 def add_choices(row):
     row["choices"] = row["choices"]["text"]
     return row
@@ -45,37 +50,40 @@ def load_uhura_arc_easy(language_bcp_47, nr):
         ds = _load_dataset(slug_uhura_arc_easy, tags_uhura_arc_easy[language_bcp_47])
         ds = ds.map(add_choices)
         ds = ds.rename_column("answerKey", "answer")
-        train_ids = common_ids_train[nr:nr+3]
-        examples = ds["train"].filter(lambda x: x["id"] in train_ids)
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
-        return "masakhane/uhura-arc-easy", examples, task
     if language_bcp_47 in tags_uhura_arc_easy_translated.keys():
-        ds = _load_dataset(slug_uhura_arc_easy_translated, tags_uhura_arc_easy_translated[language_bcp_47])
         ds = ds.rename_column("answerKey", "answer")
-        train_ids = common_ids_train[nr:nr+3]
-        examples = ds["train"].filter(lambda x: x["id"] in train_ids)
-        # raise Exception(language_bcp_47)
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
-        return "fair-forward/arc-easy-autotranslated", examples, task
     else:
         return None, None, None
 def translate_arc(languages):
     human_translated = tags_uhura_arc_easy.keys()
     untranslated = [
         lang
-        for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     n_samples = 10
-    train_ids = common_ids_train[:n_samples+3]
-    en_train = _load_dataset(slug_uhura_arc_easy, subset=tags_uhura_arc_easy["en"], split="train")
     en_train = en_train.filter(lambda x: x["id"] in train_ids)
     test_ids = common_ids_test[:n_samples]
-    en_test = _load_dataset(slug_uhura_arc_easy, subset=tags_uhura_arc_easy["en"], split="test")
     en_test = en_test.filter(lambda x: x["id"] in test_ids)
     data = {"train": en_train, "test": en_test}
     slug = "fair-forward/arc-easy-autotranslated"
     for lang in tqdm(untranslated):
         # check if already exists on hub
@@ -84,16 +92,22 @@ def translate_arc(languages):
         except (ValueError, Exception):
             print(f"Translating {lang}...")
             for split, data_en in data.items():
-                questions_tr = [translate_google(q, "en", lang) for q in data_en["question"]]
                 questions_tr = asyncio.run(tqdm_asyncio.gather(*questions_tr))
                 choices_texts_concatenated = []
                 for choice in data_en["choices"]:
                     for option in choice["text"]:
                         choices_texts_concatenated.append(option)
-                choices_tr = [translate_google(c, "en", lang) for c in choices_texts_concatenated]
                 choices_tr = asyncio.run(tqdm_asyncio.gather(*choices_tr))
                 # group into chunks of 4
-                choices_tr = [choices_tr[i:i+4] for i in range(0, len(choices_tr), 4)]
                 ds_lang = Dataset.from_dict(
                     {
@@ -110,5 +124,8 @@ def translate_arc(languages):
                     token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
                 )
                 ds_lang.to_json(
-                    f"data/translations/arc/{lang}_{split}.json", lines=False, force_ascii=False, indent=2
                 )

 import random
+from langcodes import standardize_tag
 from rich import print
+from models import translate_google, get_google_supported_languages
 from tqdm import tqdm
+from datasets import load_dataset, Dataset
 import asyncio
 from tqdm.asyncio import tqdm_asyncio
 import os
 slug_uhura_arc_easy = "masakhane/uhura-arc-easy"
 tags_uhura_arc_easy = {
+    standardize_tag(a.split("_")[0], macro=True): a
+    for a in _get_dataset_config_names(slug_uhura_arc_easy)
     if not a.endswith("unmatched")
 }
 random.seed(42)
+id_sets_train = [
+    set(_load_dataset(slug_uhura_arc_easy, tag, split="train")["id"])
+    for tag in tags_uhura_arc_easy.values()
+]
 common_ids_train = list(sorted(set.intersection(*id_sets_train)))
 random.shuffle(common_ids_train)
+id_sets_test = [
+    set(_load_dataset(slug_uhura_arc_easy, tag, split="test")["id"])
+    for tag in tags_uhura_arc_easy.values()
+]
 common_ids_test = list(sorted(set.intersection(*id_sets_test)))
 random.shuffle(common_ids_test)
 slug_uhura_arc_easy_translated = "fair-forward/arc-easy-autotranslated"
 tags_uhura_arc_easy_translated = {
+    standardize_tag(a.split("_")[0], macro=True): a
+    for a in _get_dataset_config_names(slug_uhura_arc_easy_translated)
 }
 def add_choices(row):
     row["choices"] = row["choices"]["text"]
     return row
         ds = _load_dataset(slug_uhura_arc_easy, tags_uhura_arc_easy[language_bcp_47])
         ds = ds.map(add_choices)
         ds = ds.rename_column("answerKey", "answer")
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
+        return "masakhane/uhura-arc-easy", task, "human"
     if language_bcp_47 in tags_uhura_arc_easy_translated.keys():
+        ds = _load_dataset(
+            slug_uhura_arc_easy_translated,
+            tags_uhura_arc_easy_translated[language_bcp_47],
+        )
         ds = ds.rename_column("answerKey", "answer")
         task = ds["test"].filter(lambda x: x["id"] == common_ids_test[nr])[0]
+        return "fair-forward/arc-easy-autotranslated", task, "machine"
     else:
         return None, None, None
 def translate_arc(languages):
     human_translated = tags_uhura_arc_easy.keys()
     untranslated = [
         lang
+        for lang in languages["bcp_47"].values
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     n_samples = 10
+    train_ids = common_ids_train[: n_samples + 3]
+    en_train = _load_dataset(
+        slug_uhura_arc_easy, subset=tags_uhura_arc_easy["en"], split="train"
+    )
     en_train = en_train.filter(lambda x: x["id"] in train_ids)
     test_ids = common_ids_test[:n_samples]
+    en_test = _load_dataset(
+        slug_uhura_arc_easy, subset=tags_uhura_arc_easy["en"], split="test"
+    )
     en_test = en_test.filter(lambda x: x["id"] in test_ids)
     data = {"train": en_train, "test": en_test}
     slug = "fair-forward/arc-easy-autotranslated"
     for lang in tqdm(untranslated):
         # check if already exists on hub
         except (ValueError, Exception):
             print(f"Translating {lang}...")
             for split, data_en in data.items():
+                questions_tr = [
+                    translate_google(q, "en", lang) for q in data_en["question"]
+                ]
                 questions_tr = asyncio.run(tqdm_asyncio.gather(*questions_tr))
                 choices_texts_concatenated = []
                 for choice in data_en["choices"]:
                     for option in choice["text"]:
                         choices_texts_concatenated.append(option)
+                choices_tr = [
+                    translate_google(c, "en", lang) for c in choices_texts_concatenated
+                ]
                 choices_tr = asyncio.run(tqdm_asyncio.gather(*choices_tr))
                 # group into chunks of 4
+                choices_tr = [
+                    choices_tr[i : i + 4] for i in range(0, len(choices_tr), 4)
+                ]
                 ds_lang = Dataset.from_dict(
                     {
                     token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
                 )
                 ds_lang.to_json(
+                    f"data/translations/arc/{lang}_{split}.json",
+                    lines=False,
+                    force_ascii=False,
+                    indent=2,
                 )

evals/datasets_/fleurs.py CHANGED Viewed

@@ -11,6 +11,7 @@ fleurs["bcp_47"] = fleurs["fleurs_tag"].apply(
     lambda x: standardize_tag(x.rsplit("_")[0], macro=True)
 )
 def download_file(url, path):
     response = requests.get(url)
     with open(path, "wb") as f:
@@ -34,4 +35,4 @@ def download_fleurs(transcription_langs_eval):
         if not tsv_path.exists():
             print(f"Downloading {tsv_url} to {tsv_path}")
             tsv_path.parent.mkdir(parents=True, exist_ok=True)
-            download_file(tsv_url, tsv_path)

     lambda x: standardize_tag(x.rsplit("_")[0], macro=True)
 )
 def download_file(url, path):
     response = requests.get(url)
     with open(path, "wb") as f:
         if not tsv_path.exists():
             print(f"Downloading {tsv_url} to {tsv_path}")
             tsv_path.parent.mkdir(parents=True, exist_ok=True)
+            download_file(tsv_url, tsv_path)

evals/datasets_/mgsm.py CHANGED Viewed

@@ -1,10 +1,12 @@
 import asyncio
 import os
 from datasets import Dataset, load_dataset
-from datasets_.util import _get_dataset_config_names, _load_dataset
-from langcodes import standardize_tag
-from models import google_supported_languages, translate_google
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
@@ -37,39 +39,58 @@ def parse_number(i):
         return None
 def load_mgsm(language_bcp_47, nr):
     if language_bcp_47 in tags_mgsm.keys():
-        ds = _load_dataset(slug_mgsm, subset=tags_mgsm[language_bcp_47], split="test")
-        return slug_mgsm, ds[nr]
     elif language_bcp_47 in tags_afrimgsm.keys():
-        ds = _load_dataset(
-            slug_afrimgsm, subset=tags_afrimgsm[language_bcp_47], split="test"
         )
-        return slug_afrimgsm, ds[nr]
     elif language_bcp_47 in tags_gsm_autotranslated.keys():
-        ds = _load_dataset(
-            slug_gsm_autotranslated, subset=tags_gsm_autotranslated[language_bcp_47], split="test"
         )
-        return slug_gsm_autotranslated, ds[nr]
-    elif language_bcp_47 in tags_gsm8kx.keys():
-        row = _load_dataset(
-            slug_gsm8kx,
-            subset=tags_gsm8kx[language_bcp_47],
-            split="test",
-            trust_remote_code=True,
-        )[nr]
-        row["answer_number"] = row["answer"].split("####")[1].strip()
-        return slug_gsm8kx, row
     else:
-        return None, None
 def translate_mgsm(languages):
     human_translated = [*tags_mgsm.keys(), *tags_afrimgsm.keys()]
     untranslated = [
         lang
-        for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
     en = _load_dataset(slug_mgsm, subset=tags_mgsm["en"], split="test")
     slug = "fair-forward/gsm-autotranslated"
@@ -96,5 +117,8 @@ def translate_mgsm(languages):
                 token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
             )
             ds_lang.to_json(
-                f"data/translations/mgsm/{lang}.json", lines=False, force_ascii=False, indent=2
             )

 import asyncio
 import os
+import random
 from datasets import Dataset, load_dataset
+from datasets_.util import _get_dataset_config_names, _load_dataset, cache
+from langcodes import Language, standardize_tag
+from models import get_google_supported_languages, translate_google
+from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
         return None
+@cache
+def _get_mgsm_item(dataset_slug, subset_tag, nr, trust_remote_code=False):
+    """Cache individual MGSM items efficiently"""
+    try:
+        ds = _load_dataset(
+            dataset_slug,
+            subset=subset_tag,
+            split="test",
+            trust_remote_code=trust_remote_code,
+        )
+        if nr >= len(ds):
+            return None
+        row = ds[nr]
+        # Post-process based on dataset type
+        if dataset_slug == slug_gsm8kx:
+            row["answer_number"] = row["answer"].split("####")[1].strip()
+        return row
+    except Exception:
+        # Dataset doesn't exist or doesn't have test split
+        return None
 def load_mgsm(language_bcp_47, nr):
     if language_bcp_47 in tags_mgsm.keys():
+        item = _get_mgsm_item(slug_mgsm, tags_mgsm[language_bcp_47], nr)
+        return slug_mgsm, item, "human" if item else (None, None, None)
     elif language_bcp_47 in tags_afrimgsm.keys():
+        item = _get_mgsm_item(slug_afrimgsm, tags_afrimgsm[language_bcp_47], nr)
+        return slug_afrimgsm, item, "human" if item else (None, None, None)
+    elif language_bcp_47 in tags_gsm8kx.keys():
+        item = _get_mgsm_item(
+            slug_gsm8kx, tags_gsm8kx[language_bcp_47], nr, trust_remote_code=True
         )
+        return slug_gsm8kx, item, "machine" if item else (None, None, None)
     elif language_bcp_47 in tags_gsm_autotranslated.keys():
+        item = _get_mgsm_item(
+            slug_gsm_autotranslated, tags_gsm_autotranslated[language_bcp_47], nr
         )
+        return slug_gsm_autotranslated, item, "machine" if item else (None, None, None)
     else:
+        return None, None, None
 def translate_mgsm(languages):
     human_translated = [*tags_mgsm.keys(), *tags_afrimgsm.keys()]
     untranslated = [
         lang
+        for lang in languages["bcp_47"].values
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
     en = _load_dataset(slug_mgsm, subset=tags_mgsm["en"], split="test")
     slug = "fair-forward/gsm-autotranslated"
                 token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
             )
             ds_lang.to_json(
+                f"data/translations/mgsm/{lang}.json",
+                lines=False,
+                force_ascii=False,
+                indent=2,
             )

evals/datasets_/mmlu.py CHANGED Viewed

@@ -4,9 +4,9 @@ import random
 from collections import Counter, defaultdict
 from datasets import Dataset, load_dataset
-from datasets_.util import _get_dataset_config_names, _load_dataset
 from langcodes import Language, standardize_tag
-from models import google_supported_languages, translate_google
 from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
@@ -111,6 +111,7 @@ def print_datasets_analysis():
 # MMLUX is translated using DeepL
 # Therefore, the priority is: AfriMMLU, Global-MMLU, MMLUX, Okapi-MMLU
 # print_datasets_analysis()
@@ -143,32 +144,61 @@ tags_mmlux = set(
     a.rsplit("_", 1)[1].split("-")[0].lower()
     for a in _get_dataset_config_names("Eurolingua/mmlux", trust_remote_code=True)
 )
-tags_mmlu_autotranslated = _get_dataset_config_names("fair-forward/mmlu-autotranslated")
 categories = sorted(
-        list(set(_load_dataset("masakhane/afrimmlu", "eng")["dev"]["subject"]))
-    )
-def load_mmlu(language_bcp_47, nr):
     category = categories[nr % len(categories)]
     if language_bcp_47 in tags_afrimmlu.keys():
-        ds = _load_dataset("masakhane/afrimmlu", tags_afrimmlu[language_bcp_47])
-        ds = ds.map(parse_choices)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
-        task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "masakhane/afrimmlu", examples, task
     elif language_bcp_47 in tags_global_mmlu.keys():
-        ds = _load_dataset("CohereForAI/Global-MMLU", tags_global_mmlu[language_bcp_47])
-        ds = ds.map(add_choices)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
-        task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "CohereForAI/Global-MMLU", examples, task
     elif language_bcp_47 in tags_mmlu_autotranslated:
-        ds = _load_dataset("fair-forward/mmlu-autotranslated", language_bcp_47)
-        examples = ds["dev"].filter(lambda x: x["subject"] == category)
-        task = ds["test"].filter(lambda x: x["subject"] == category)[nr]
-        return "fair-forward/mmlu-autotranslated", examples, task
     else:
         return None, None, None
@@ -177,10 +207,10 @@ def translate_mmlu(languages):
     human_translated = [*tags_afrimmlu.keys(), *tags_global_mmlu.keys()]
     untranslated = [
         lang
-        for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
-    n_samples = 10
     slug = "fair-forward/mmlu-autotranslated"
     for lang in tqdm(untranslated):
@@ -196,8 +226,10 @@ def translate_mmlu(languages):
                     if split == "dev":
                         samples.extend(ds.filter(lambda x: x["subject"] == category))
                     else:
-                        for i in range(n_samples):
-                            task = ds.filter(lambda x: x["subject"] == category)[i]
                             samples.append(task)
                 questions_tr = [
                     translate_google(s["question"], "en", lang) for s in samples

 from collections import Counter, defaultdict
 from datasets import Dataset, load_dataset
+from datasets_.util import _get_dataset_config_names, _load_dataset, cache
 from langcodes import Language, standardize_tag
+from models import get_google_supported_languages, translate_google
 from rich import print
 from tqdm import tqdm
 from tqdm.asyncio import tqdm_asyncio
 # MMLUX is translated using DeepL
 # Therefore, the priority is: AfriMMLU, Global-MMLU, MMLUX, Okapi-MMLU
 # print_datasets_analysis()
     a.rsplit("_", 1)[1].split("-")[0].lower()
     for a in _get_dataset_config_names("Eurolingua/mmlux", trust_remote_code=True)
 )
+tags_mmlu_autotranslated = {
+    standardize_tag(a, macro=True): a
+    for a in _get_dataset_config_names("fair-forward/mmlu-autotranslated")
+}
 categories = sorted(
+    list(set(_load_dataset("masakhane/afrimmlu", "eng")["dev"]["subject"]))
+)
+@cache
+def _get_processed_mmlu_dataset(dataset_name, subset_tag):
+    """Cache processed datasets to avoid reprocessing"""
+    ds = _load_dataset(dataset_name, subset_tag)
+    if dataset_name == "masakhane/afrimmlu":
+        ds = ds.map(parse_choices)
+    elif dataset_name == "CohereForAI/Global-MMLU":
+        ds = ds.map(add_choices)
+    return ds
+@cache
+def _get_mmlu_item(dataset_name, subset_tag, category, nr):
+    """Cache individual MMLU items efficiently"""
+    ds = _get_processed_mmlu_dataset(dataset_name, subset_tag)
+    if dataset_name in ["masakhane/afrimmlu", "CohereForAI/Global-MMLU"]:
+        filtered = ds["test"].filter(lambda x: x["subject"] == category)
+        return filtered[nr] if nr < len(filtered) else None
+    else:  # fair-forward/mmlu-autotranslated
+        filtered = ds["test"].filter(lambda x: x["subject"] == category)
+        return filtered[nr] if nr < len(filtered) else None
+async def load_mmlu(language_bcp_47, nr):
     category = categories[nr % len(categories)]
     if language_bcp_47 in tags_afrimmlu.keys():
+        task = _get_mmlu_item(
+            "masakhane/afrimmlu", tags_afrimmlu[language_bcp_47], category, nr
+        )
+        return "masakhane/afrimmlu", task, "human" if task else (None, None, None)
     elif language_bcp_47 in tags_global_mmlu.keys():
+        task = _get_mmlu_item(
+            "CohereForAI/Global-MMLU", tags_global_mmlu[language_bcp_47], category, nr
+        )
+        return "CohereForAI/Global-MMLU", task, "human" if task else (None, None, None)
+    # TODO: add in Okapi, MMLUX @Jonas
     elif language_bcp_47 in tags_mmlu_autotranslated:
+        task = _get_mmlu_item(
+            "fair-forward/mmlu-autotranslated", language_bcp_47, category, nr
+        )
+        return (
+            "fair-forward/mmlu-autotranslated",
+            task,
+            "machine" if task else (None, None, None),
+        )
     else:
         return None, None, None
     human_translated = [*tags_afrimmlu.keys(), *tags_global_mmlu.keys()]
     untranslated = [
         lang
+        for lang in languages["bcp_47"].values
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
+    n_samples = 20
     slug = "fair-forward/mmlu-autotranslated"
     for lang in tqdm(untranslated):
                     if split == "dev":
                         samples.extend(ds.filter(lambda x: x["subject"] == category))
                     else:
+                        # Use the same 20 samples that the evaluation pipeline uses (indices 0-19)
+                        filtered = ds.filter(lambda x: x["subject"] == category)
+                        for i in range(min(n_samples, len(filtered))):
+                            task = filtered[i]
                             samples.append(task)
                 questions_tr = [
                     translate_google(s["question"], "en", lang) for s in samples

evals/datasets_/truthfulqa.py CHANGED Viewed

@@ -9,16 +9,25 @@ from tqdm.asyncio import tqdm_asyncio
 import os
 from datasets import Dataset, load_dataset
-from models import translate_google, google_supported_languages
 from datasets_.util import _get_dataset_config_names, _load_dataset
 slug_uhura_truthfulqa = "masakhane/uhura-truthfulqa"
 tags_uhura_truthfulqa = {
-    standardize_tag(a.split("_")[0], macro=True): a for a in _get_dataset_config_names(slug_uhura_truthfulqa)
     if a.endswith("multiple_choice")
 }
 def add_choices(row):
     row["choices"] = row["mc1_targets"]["choices"]
@@ -26,26 +35,43 @@ def add_choices(row):
     return row
-def load_truthfulqa(language_bcp_47, nr):
     if language_bcp_47 in tags_uhura_truthfulqa.keys():
-        ds = _load_dataset(slug_uhura_truthfulqa, tags_uhura_truthfulqa[language_bcp_47])
         ds = ds.map(add_choices)
-        examples = ds["train"]
         task = ds["test"][nr]
-        return "masakhane/uhura-truthfulqa", examples, task
     else:
         return None, None, None
 def translate_truthfulqa(languages):
     human_translated = [*tags_uhura_truthfulqa.keys()]
     untranslated = [
         lang
-        for lang in languages["bcp_47"].values[:100]
-        if lang not in human_translated and lang in google_supported_languages
     ]
-    n_samples = 10
     slug = "fair-forward/truthfulqa-autotranslated"
     for lang in tqdm(untranslated):
@@ -55,37 +81,47 @@ def translate_truthfulqa(languages):
         except (ValueError, Exception):
             print(f"Translating {lang}...")
             for split in ["train", "test"]:
-                ds = _load_dataset(slug_uhura_truthfulqa, tags_uhura_truthfulqa["en"], split=split)
                 samples = []
                 if split == "train":
                     samples.extend(ds)
                 else:
-                    for i in range(n_samples):
                         task = ds[i]
                         samples.append(task)
                 questions_tr = [
                     translate_google(s["question"], "en", lang) for s in samples
                 ]
                 questions_tr = asyncio.run(tqdm_asyncio.gather(*questions_tr))
-                choices_texts_concatenated = []
                 for s in samples:
-                    for choice in eval(s["choices"]):
-                        choices_texts_concatenated.append(choice)
-                choices_tr = [
-                    translate_google(c, "en", lang) for c in choices_texts_concatenated
-                ]
-                choices_tr = asyncio.run(tqdm_asyncio.gather(*choices_tr))
-                # group into chunks of 4
-                choices_tr = [
-                    choices_tr[i : i + 4] for i in range(0, len(choices_tr), 4)
-                ]
                 ds_lang = Dataset.from_dict(
                     {
-                        "subject": [s["subject"] for s in samples],
                         "question": questions_tr,
-                        "choices": choices_tr,
-                        "answer": [s["answer"] for s in samples],
                     }
                 )
                 ds_lang.push_to_hub(
@@ -95,7 +131,7 @@ def translate_truthfulqa(languages):
                     token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
                 )
                 ds_lang.to_json(
-                    f"data/translations/mmlu/{lang}_{split}.json",
                     lines=False,
                     force_ascii=False,
                     indent=2,

 import os
 from datasets import Dataset, load_dataset
+from models import translate_google, get_google_supported_languages
 from datasets_.util import _get_dataset_config_names, _load_dataset
 slug_uhura_truthfulqa = "masakhane/uhura-truthfulqa"
+slug_truthfulqa_autotranslated = "fair-forward/truthfulqa-autotranslated"
 tags_uhura_truthfulqa = {
+    standardize_tag(a.split("_")[0], macro=True): a
+    for a in _get_dataset_config_names(slug_uhura_truthfulqa)
     if a.endswith("multiple_choice")
 }
+tags_truthfulqa_autotranslated = {
+    standardize_tag(a, macro=True): a
+    for a in _get_dataset_config_names(slug_truthfulqa_autotranslated)
+}
+tags_truthfulqa_autotranslated = {}
 def add_choices(row):
     row["choices"] = row["mc1_targets"]["choices"]
     return row
+async def load_truthfulqa(language_bcp_47, nr):
     if language_bcp_47 in tags_uhura_truthfulqa.keys():
+        ds = _load_dataset(
+            slug_uhura_truthfulqa, tags_uhura_truthfulqa[language_bcp_47]
+        )
         ds = ds.map(add_choices)
         task = ds["test"][nr]
+        # Ensure there is a correct answer before returning the task
+        if 1 not in task["labels"]:
+            return None, None, None
+        return "masakhane/uhura-truthfulqa", task, "human"
+    # TODO check quality/completeness of autotranslated dataset
+    # elif language_bcp_47 in tags_truthfulqa_autotranslated.keys():
+    #     # Load from auto-translated dataset (same samples as translation)
+    #     ds = _load_dataset(slug_truthfulqa_autotranslated, language_bcp_47)
+    #     test_split = ds["test"] if "test" in ds else ds
+    #     task = test_split[nr]
+    #     # Ensure there is a correct answer before returning the task
+    #     if 1 not in task.get("labels", []):
+    #         return None, None, None
+    #     return slug_truthfulqa_autotranslated, task, "machine"
+    # TODO: add Okapi, TruthfulQA-X @Jonas
     else:
         return None, None, None
 def translate_truthfulqa(languages):
     human_translated = [*tags_uhura_truthfulqa.keys()]
     untranslated = [
         lang
+        for lang in languages["bcp_47"].values[:150]
+        if lang not in human_translated and lang in get_google_supported_languages()
     ]
+    n_samples = 20
+    # Set fixed seed for consistent sample selection across all languages
+    random.seed(42)
     slug = "fair-forward/truthfulqa-autotranslated"
     for lang in tqdm(untranslated):
         except (ValueError, Exception):
             print(f"Translating {lang}...")
             for split in ["train", "test"]:
+                ds = _load_dataset(
+                    slug_uhura_truthfulqa, tags_uhura_truthfulqa["en"], split=split
+                )
                 samples = []
                 if split == "train":
                     samples.extend(ds)
                 else:
+                    # Use the same 20 samples that the evaluation pipeline uses (indices 0-19)
+                    for i in range(min(n_samples, len(ds))):
                         task = ds[i]
                         samples.append(task)
+                # Translate questions
                 questions_tr = [
                     translate_google(s["question"], "en", lang) for s in samples
                 ]
                 questions_tr = asyncio.run(tqdm_asyncio.gather(*questions_tr))
+                # Translate choices for each sample
+                all_choices_tr = []
+                all_labels = []
                 for s in samples:
+                    # Get choices from mc1_targets
+                    choices = s["mc1_targets"]["choices"]
+                    labels = s["mc1_targets"]["labels"]
+                    # Translate choices
+                    choices_tr = [
+                        translate_google(choice, "en", lang) for choice in choices
+                    ]
+                    choices_tr = asyncio.run(tqdm_asyncio.gather(*choices_tr))
+                    all_choices_tr.append(choices_tr)
+                    all_labels.append(labels)
                 ds_lang = Dataset.from_dict(
                     {
                         "question": questions_tr,
+                        "choices": all_choices_tr,
+                        "labels": all_labels,
                     }
                 )
                 ds_lang.push_to_hub(
                     token=os.getenv("HUGGINGFACE_ACCESS_TOKEN"),
                 )
                 ds_lang.to_json(
+                    f"data/translations/truthfulqa/{lang}_{split}.json",
                     lines=False,
                     force_ascii=False,
                     indent=2,

evals/datasets_/util.py CHANGED Viewed

@@ -1,7 +1,14 @@
-from datasets import get_dataset_config_names, load_dataset
 from joblib.memory import Memory
 cache = Memory(location=".cache", verbose=0).cache
 @cache
@@ -12,3 +19,27 @@ def _get_dataset_config_names(dataset, **kwargs):
 @cache
 def _load_dataset(dataset, subset, **kwargs):
     return load_dataset(dataset, subset, **kwargs)

+import os
+from pathlib import Path
+import pandas as pd
+from datasets import Dataset, get_dataset_config_names, load_dataset
+from datasets.exceptions import DatasetNotFoundError
+from huggingface_hub.errors import RepositoryNotFoundError
 from joblib.memory import Memory
 cache = Memory(location=".cache", verbose=0).cache
+TOKEN = os.getenv("HUGGINGFACE_ACCESS_TOKEN")
 @cache
 @cache
 def _load_dataset(dataset, subset, **kwargs):
     return load_dataset(dataset, subset, **kwargs)
+# Cache individual dataset items to avoid reloading entire datasets
+@cache
+def _get_dataset_item(dataset, subset, split, index, **kwargs):
+    """Load a single item from a dataset efficiently"""
+    ds = load_dataset(dataset, subset, split=split, **kwargs)
+    return ds[index] if index < len(ds) else None
+def load(fname: str):
+    try:
+        ds = load_dataset(f"fair-forward/evals-for-every-language-{fname}", token=TOKEN)
+        return ds["train"].to_pandas()
+    except (DatasetNotFoundError, RepositoryNotFoundError, KeyError):
+        return pd.DataFrame()
+def save(df: pd.DataFrame, fname: str):
+    df = df.drop(columns=["__index_level_0__"], errors="ignore")
+    ds = Dataset.from_pandas(df)
+    ds.push_to_hub(f"fair-forward/evals-for-every-language-{fname}", token=TOKEN)
+    Path("results").mkdir(exist_ok=True)
+    df.to_json(f"results/{fname}.json", orient="records", force_ascii=False, indent=2)

evals/download_data.py CHANGED Viewed

@@ -8,6 +8,7 @@ from pathlib import Path
 import sys
 import huggingface_hub
 from datasets import load_dataset, DatasetDict
 # Import fleurs DataFrame directly from its source module
 from datasets_.fleurs import fleurs
@@ -24,22 +25,25 @@ DATA_DIR = project_root / "data"
 FLEURS_BASE_URL = "https://huggingface.co/datasets/google/fleurs/resolve/main/data"
 FLEURS_TARGET_DIR = DATA_DIR / "fleurs"
-GLOTTOLOG_URL = "https://cdstar.shh.mpg.de/bitstreams/EAEA0-B44E-8CEC-EA65-0/glottolog_languoid.zip" # Assumed direct link from https://glottolog.org/meta/downloads
 GLOTTOLOG_TARGET_DIR = DATA_DIR / "glottolog_languoid.csv"
 GLOTTOLOG_CSV_NAME = "languoid.csv"
-SCRIPTCODES_URL = "https://www.unicode.org/iso15924/iso15924-codes.html" # This is HTML, need manual download or parsing
 SCRIPTCODES_TARGET_FILE = DATA_DIR / "ScriptCodes.csv"
-SPBLEU_SPM_URL = "https://tinyurl.com/flores200sacrebleuspm" # Assumed direct link
 SPBLEU_TARGET_DIR = DATA_DIR / "spbleu"
 SPBLEU_SPM_NAME = "flores200_sacrebleu_tokenizer_spm.model"
-SPBLEU_DICT_URL = "https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt"
 SPBLEU_DICT_NAME = "dictionary.txt"
 # --- Helper Functions ---
 def download_file(url, path: Path):
     """Downloads a file from a URL to a local path."""
     print(f"Downloading {url} to {path}...")
@@ -84,11 +88,16 @@ def extract_zip(zip_content: bytes, extract_path: Path, target_filename: str):
                     break
             if target_zip_path:
-                with z.open(target_zip_path) as source, open(extract_path / target_filename, "wb") as target:
                     target.write(source.read())
                 print(f"Successfully extracted {target_filename}.")
             else:
-                print(f"Error: Could not find {target_filename} within the zip archive.")
     except zipfile.BadZipFile:
         print("Error: Downloaded file is not a valid zip archive.")
@@ -98,13 +107,14 @@ def extract_zip(zip_content: bytes, extract_path: Path, target_filename: str):
 # --- Download Functions ---
 def download_fleurs_data():
     """Downloads Fleurs audio and text data."""
     print("\n--- Downloading Fleurs Data ---")
     FLEURS_TARGET_DIR.mkdir(parents=True, exist_ok=True)
     # Use the fleurs_tag column from the imported DataFrame
-    fleurs_tags_list = fleurs['fleurs_tag'].tolist()
     if not fleurs_tags_list:
         print("No Fleurs tags found in imported fleurs DataFrame. Skipping Fleurs.")
@@ -117,7 +127,9 @@ def download_fleurs_data():
         audio_dir = lang_dir / "audio"
         dev_tsv_path = lang_dir / "dev.tsv"
         dev_audio_archive_path = audio_dir / "dev.tar.gz"
-        audio_extracted_marker = audio_dir / "dev" # Check if extraction likely happened
         # Download TSV
         if not dev_tsv_path.exists():
@@ -129,15 +141,15 @@ def download_fleurs_data():
         # Download and Extract Audio
         if not audio_extracted_marker.exists():
             if not dev_audio_archive_path.exists():
-                 tar_url = f"{FLEURS_BASE_URL}/{lang_tag}/audio/dev.tar.gz"
-                 download_file(tar_url, dev_audio_archive_path)
             if dev_audio_archive_path.exists():
-                 extract_tar_gz(dev_audio_archive_path, audio_dir)
             else:
                 print(f"Audio archive missing, cannot extract for {lang_tag}")
         else:
-             print(f"Found extracted audio: {audio_extracted_marker}")
 def download_glottolog_data():
@@ -165,7 +177,9 @@ def download_scriptcodes_data():
     # The URL points to an HTML page, not a direct CSV link.
     # Manual download is likely required for ScriptCodes.csv.
     print(f"Cannot automatically download from {SCRIPTCODES_URL}")
-    print(f"Please manually download the ISO 15924 codes list (often available as a .txt file)")
     print("from the Unicode website or related sources and save it as:")
     print(f"{SCRIPTCODES_TARGET_FILE}")
     if SCRIPTCODES_TARGET_FILE.exists():
@@ -196,21 +210,24 @@ def download_spbleu_data():
 # --- Main Execution ---
 def main():
     """Runs all download functions and the conversion step."""
     print("Starting data download process...")
     DATA_DIR.mkdir(exist_ok=True)
-    #download_fleurs_data()
     download_glottolog_data()
     download_scriptcodes_data()
     download_spbleu_data()
     print("\nData download process finished.")
     print("Please verify downloads and manually obtain ScriptCodes.csv if needed.")
-    print("Note: Flores+ was downloaded as parquet, which might require changes but has been processed as well")
     print("in 'evals/datasets_/flores.py' to be read correctly.")
 if __name__ == "__main__":
-    main()

 import sys
 import huggingface_hub
 from datasets import load_dataset, DatasetDict
 # Import fleurs DataFrame directly from its source module
 from datasets_.fleurs import fleurs
 FLEURS_BASE_URL = "https://huggingface.co/datasets/google/fleurs/resolve/main/data"
 FLEURS_TARGET_DIR = DATA_DIR / "fleurs"
+GLOTTOLOG_URL = "https://cdstar.shh.mpg.de/bitstreams/EAEA0-B44E-8CEC-EA65-0/glottolog_languoid.zip"  # Assumed direct link from https://glottolog.org/meta/downloads
 GLOTTOLOG_TARGET_DIR = DATA_DIR / "glottolog_languoid.csv"
 GLOTTOLOG_CSV_NAME = "languoid.csv"
+SCRIPTCODES_URL = "https://www.unicode.org/iso15924/iso15924-codes.html"  # This is HTML, need manual download or parsing
 SCRIPTCODES_TARGET_FILE = DATA_DIR / "ScriptCodes.csv"
+SPBLEU_SPM_URL = "https://tinyurl.com/flores200sacrebleuspm"  # Assumed direct link
 SPBLEU_TARGET_DIR = DATA_DIR / "spbleu"
 SPBLEU_SPM_NAME = "flores200_sacrebleu_tokenizer_spm.model"
+SPBLEU_DICT_URL = (
+    "https://dl.fbaipublicfiles.com/large_objects/nllb/models/spm_200/dictionary.txt"
+)
 SPBLEU_DICT_NAME = "dictionary.txt"
 # --- Helper Functions ---
 def download_file(url, path: Path):
     """Downloads a file from a URL to a local path."""
     print(f"Downloading {url} to {path}...")
                     break
             if target_zip_path:
+                with (
+                    z.open(target_zip_path) as source,
+                    open(extract_path / target_filename, "wb") as target,
+                ):
                     target.write(source.read())
                 print(f"Successfully extracted {target_filename}.")
             else:
+                print(
+                    f"Error: Could not find {target_filename} within the zip archive."
+                )
     except zipfile.BadZipFile:
         print("Error: Downloaded file is not a valid zip archive.")
 # --- Download Functions ---
 def download_fleurs_data():
     """Downloads Fleurs audio and text data."""
     print("\n--- Downloading Fleurs Data ---")
     FLEURS_TARGET_DIR.mkdir(parents=True, exist_ok=True)
     # Use the fleurs_tag column from the imported DataFrame
+    fleurs_tags_list = fleurs["fleurs_tag"].tolist()
     if not fleurs_tags_list:
         print("No Fleurs tags found in imported fleurs DataFrame. Skipping Fleurs.")
         audio_dir = lang_dir / "audio"
         dev_tsv_path = lang_dir / "dev.tsv"
         dev_audio_archive_path = audio_dir / "dev.tar.gz"
+        audio_extracted_marker = (
+            audio_dir / "dev"
+        )  # Check if extraction likely happened
         # Download TSV
         if not dev_tsv_path.exists():
         # Download and Extract Audio
         if not audio_extracted_marker.exists():
             if not dev_audio_archive_path.exists():
+                tar_url = f"{FLEURS_BASE_URL}/{lang_tag}/audio/dev.tar.gz"
+                download_file(tar_url, dev_audio_archive_path)
             if dev_audio_archive_path.exists():
+                extract_tar_gz(dev_audio_archive_path, audio_dir)
             else:
                 print(f"Audio archive missing, cannot extract for {lang_tag}")
         else:
+            print(f"Found extracted audio: {audio_extracted_marker}")
 def download_glottolog_data():
     # The URL points to an HTML page, not a direct CSV link.
     # Manual download is likely required for ScriptCodes.csv.
     print(f"Cannot automatically download from {SCRIPTCODES_URL}")
+    print(
+        "Please manually download the ISO 15924 codes list (often available as a .txt file)"
+    )
     print("from the Unicode website or related sources and save it as:")
     print(f"{SCRIPTCODES_TARGET_FILE}")
     if SCRIPTCODES_TARGET_FILE.exists():
 # --- Main Execution ---
 def main():
     """Runs all download functions and the conversion step."""
     print("Starting data download process...")
     DATA_DIR.mkdir(exist_ok=True)
+    # download_fleurs_data()
     download_glottolog_data()
     download_scriptcodes_data()
     download_spbleu_data()
     print("\nData download process finished.")
     print("Please verify downloads and manually obtain ScriptCodes.csv if needed.")
+    print(
+        "Note: Flores+ was downloaded as parquet, which might require changes but has been processed as well"
+    )
     print("in 'evals/datasets_/flores.py' to be read correctly.")
 if __name__ == "__main__":
+    main()

evals/languages.py CHANGED Viewed

@@ -31,6 +31,7 @@ glottolog["bcp_47"] = glottolog["iso639P3code"].apply(
     lambda x: standardize_tag(x, macro=True) if not pd.isna(x) else None
 )
 @cache
 def language_family(bcp_47):
     languoid = glottolog[glottolog["bcp_47"] == bcp_47].iloc[0]
@@ -39,6 +40,7 @@ def language_family(bcp_47):
     family = glottolog[glottolog["id"] == languoid["family_id"]].iloc[0]
     return family["name"]
 languages["family"] = languages["bcp_47"].apply(language_family)
 # load script codes and names
@@ -46,6 +48,7 @@ scripts = pd.read_csv("data/ScriptCodes.csv").rename(
     columns={"Code": "iso15924", "English Name": "script_name"}
 )
 def script_name(iso15924):
     return scripts[scripts["iso15924"] == iso15924]["script_name"].values[0]

     lambda x: standardize_tag(x, macro=True) if not pd.isna(x) else None
 )
 @cache
 def language_family(bcp_47):
     languoid = glottolog[glottolog["bcp_47"] == bcp_47].iloc[0]
     family = glottolog[glottolog["id"] == languoid["family_id"]].iloc[0]
     return family["name"]
 languages["family"] = languages["bcp_47"].apply(language_family)
 # load script codes and names
     columns={"Code": "iso15924", "English Name": "script_name"}
 )
 def script_name(iso15924):
     return scripts[scripts["iso15924"] == iso15924]["script_name"].values[0]

evals/main.py CHANGED Viewed

@@ -1,62 +1,80 @@
 import asyncio
 import pandas as pd
 from languages import languages
 from models import models
 from tasks import tasks
 from tqdm.asyncio import tqdm_asyncio
-# ===== config =====
-n_sentences = 10
-# ===== run evaluation and aggregate results =====
-async def evaluate():
-    # FIXME we should not need this for-loop, but it helps
-    for n_languages in range(10, 101, 10):
-        print(f"running evaluations for {n_languages} languages")
-        old_results = pd.read_json("results.json")
-        old_models = pd.read_json("models.json")
-        # get all combinations of model, language and task
-        combis = [
-            (model, lang.bcp_47, task_name)
-            for model in models["id"]
-            for lang in languages.iloc[:n_languages].itertuples()
-            for task_name, task in tasks.items()
-            if task_name in models[models["id"] == model]["tasks"].iloc[0]
-        ]
-        # filter out combinations that have already been evaluated
-        combis = pd.DataFrame(combis, columns=["model", "bcp_47", "task"])
-        combis = combis.merge(old_results, on=["model", "bcp_47", "task"], how="left")
-        combis = combis[combis["metric"].isna()][["model", "bcp_47", "task"]]
-        # run evaluations
-        results = [
-            tasks[task_name](model, bcp_47, i)
-            for i in range(n_sentences)
-            for model, bcp_47, task_name in combis.itertuples(index=False)
-        ]
-        results = await tqdm_asyncio.gather(*results, miniters=1)
-        results = [r for group in results for r in group]
-        args = dict(orient="records", indent=2, force_ascii=False)
-        if results:
-            # aggregate results
-            results = pd.DataFrame(results)
-            results = (
-                results.groupby(["model", "bcp_47", "task", "metric"])
-                .agg({"score": "mean"})
-                .reset_index()
-            )
-            # save results
-            results = pd.concat([old_results, results])
-            results = results.sort_values(by=["model", "bcp_47", "task", "metric"])
-            results.to_json("results.json", **args)
-        # save up-to-date info on models and languages
-        all_models = pd.concat([pd.DataFrame(models), old_models])
-        all_models = all_models.drop_duplicates(subset=["id"]).sort_values(by=["id"])
-        all_models.to_json("models.json", **args)
-        pd.DataFrame(languages).to_json("languages.json", **args)
 if __name__ == "__main__":

 import asyncio
+import time
+from datetime import timedelta
+from os import environ
 import pandas as pd
 from languages import languages
 from models import models
+from rich import print
 from tasks import tasks
 from tqdm.asyncio import tqdm_asyncio
+from datasets_.util import load, save
+from tqdm import tqdm
+n_sentences = int(environ.get("N_SENTENCES", 10))
+n_languages = int(environ.get("N_LANGUAGES", 300))
+n_models = int(environ.get("N_MODELS", 35))
+async def evaluate():
+    start_time = time.time()
+    # Pre-compute model tasks to avoid O(n²) lookups
+    model_tasks = models.set_index("id")["tasks"].to_dict()
+    # get all combinations that need evaluation
+    combis = [
+        (task_name, model, lang.bcp_47, i)
+        for i in range(n_sentences)
+        for lang in languages.head(n_languages).itertuples()
+        for task_name, task in tasks.items()
+        for model in models.iloc[:n_models]["id"]
+        if task_name in model_tasks[model]
+    ]
+    combis = pd.DataFrame(combis, columns=["task", "model", "bcp_47", "sentence_nr"])
+    # Load cached results and filter out completed combinations
+    old_results = load("results-detailed")
+    if not old_results.empty:
+        completed = set(old_results[["task", "model", "bcp_47", "sentence_nr"]].apply(tuple, axis=1))
+        combis = combis[~combis.apply(lambda row: tuple(row) in completed, axis=1)]
+    print(f"Running {len(combis)} evaluation tasks...")
+    # batching (asyncio.gather + rate-limiting can in principle run everything at once, but in practice batching is more efficient / necessary)
+    batch_size = 2000
+    batch_results = [
+        await tqdm_asyncio.gather(
+            *[tasks[task_name](model, bcp_47, sentence_nr)
+              for _, (task_name, model, bcp_47, sentence_nr) in batch.iterrows()]
+        )
+        for i in tqdm(range(0, len(combis), batch_size), colour='blue', desc='Batches')
+        for batch in [combis[i:i + batch_size]]
+    ]
+    results = [r for batch in batch_results for result in batch for r in result]
+    results = pd.DataFrame(results) if results else pd.DataFrame(columns=["task", "model", "bcp_47", "metric", "sentence_nr", "score", "origin"])
+    # Merge with cached results (immutable log)
+    all_results = pd.concat([old_results, results]).drop_duplicates(
+        subset=["task", "model", "bcp_47", "metric", "sentence_nr"]
+    ) if not old_results.empty else results
+    # Filter to current models × languages and aggregate
+    current_models = set(models.iloc[:n_models]["id"])
+    current_languages = set(languages.head(n_languages)["bcp_47"])
+    results_agg = (
+        all_results[all_results["model"].isin(current_models) & all_results["bcp_47"].isin(current_languages)]
+        .groupby(["model", "bcp_47", "task", "metric"])
+        .agg({"score": "mean", "origin": "first"})
+        .reset_index()
+    )
+    save(all_results, "results-detailed")
+    save(results_agg, "results")
+    save(models, "models")
+    save(languages, "languages")
+    elapsed = time.time() - start_time
+    print(f"Evaluation completed in {str(timedelta(seconds=int(elapsed)))}")
 if __name__ == "__main__":

evals/models.py CHANGED Viewed

@@ -1,13 +1,10 @@
-import json
 import re
-from collections import defaultdict
 from datetime import date
 from os import getenv
 import pandas as pd
 from aiolimiter import AsyncLimiter
 from dotenv import load_dotenv
-from elevenlabs import AsyncElevenLabs
 from google.cloud import translate_v2 as translate
 from huggingface_hub import AsyncInferenceClient, HfApi
 from joblib.memory import Memory
@@ -22,20 +19,30 @@ important_models = [
     "meta-llama/llama-3.1-70b-instruct",  # 0.3$
     "meta-llama/llama-3-70b-instruct",  # 0.4$
     # "meta-llama/llama-2-70b-chat", # 0.9$; not properly supported by OpenRouter
     "openai/gpt-4.1",  # 8$
-    "openai/gpt-4.1-mini",  # 1.6$
-    "openai/gpt-4.1-nano",  # 0.4$
-    "openai/gpt-4o-mini",  # 0.6$
-    # "openai/gpt-4o-2024-11-20", # 10$
-    "openai/gpt-3.5-turbo-0613",  # 2$
-    # "openai/gpt-3.5-turbo",  # 1.5$
-    # "anthropic/claude-3.5-haiku", # 4$ -> too expensive for dev
-    "mistralai/mistral-small-3.1-24b-instruct",  # 0.3$
     "mistralai/mistral-saba",  # 0.6$
     "mistralai/mistral-nemo",  # 0.08$
     "google/gemini-2.5-flash",  # 0.6$
-    "google/gemini-2.0-flash-lite-001",  # 0.3$
     "google/gemma-3-27b-it",  # 0.2$
     "qwen/qwen3-32b",
     "qwen/qwen3-235b-a22b",
     "qwen/qwen3-30b-a3b",  # 0.29$
@@ -43,15 +50,16 @@ important_models = [
     # "qwen/qwq-32b",  # 0.2$
     # "qwen/qwen-2.5-72b-instruct",  # 0.39$
     # "qwen/qwen-2-72b-instruct",  # 0.9$
-    "deepseek/deepseek-chat-v3-0324",  # 1.1$
-    "deepseek/deepseek-chat",  # 0.89$
     "microsoft/phi-4",  # 0.07$
-    "microsoft/phi-4-multimodal-instruct",  # 0.1$
-    "amazon/nova-micro-v1",  # 0.09$
 ]
 blocklist = [
     "google/gemini-2.5-pro-preview",
     "google/gemini-2.5-flash-preview",
     "google/gemini-2.5-flash-lite-preview",
     "google/gemini-2.5-flash-preview-04-17",
@@ -59,6 +67,11 @@ blocklist = [
     "google/gemini-2.5-flash-lite-preview-06-17",
     "google/gemini-2.5-pro-preview-06-05",
     "google/gemini-2.5-pro-preview-05-06",
 ]
 transcription_models = [
@@ -72,49 +85,104 @@ cache = Memory(location=".cache", verbose=0).cache
 @cache
-def get_models(date: date):
     return get("https://openrouter.ai/api/frontend/models").json()["data"]
-def get_model(permaslug):
-    models = get_models(date.today())
     slugs = [
         m
         for m in models
-        if m["permaslug"] == permaslug
         and m["endpoint"]
         and not m["endpoint"]["is_free"]
     ]
     if len(slugs) == 0:
-        # the problem is that free models typically have very high rate-limiting
-        print(f"no non-free model found for {permaslug}")
     return slugs[0] if len(slugs) >= 1 else None
 @cache
 def get_historical_popular_models(date: date):
-    raw = get("https://openrouter.ai/rankings").text
-    data = re.search(r'{\\"data\\":(.*),\\"isPercentage\\"', raw).group(1)
-    data = json.loads(data.replace("\\", ""))
-    counts = defaultdict(int)
-    for day in data:
-        for model, count in day["ys"].items():
-            if model.startswith("openrouter") or model == "Others":
-                continue
-            counts[model.split(":")[0]] += count
-    counts = sorted(counts.items(), key=lambda x: x[1], reverse=True)
-    models = [get_model(model) for model, _ in counts]
-    return [m for m in models if m]
 @cache
 def get_current_popular_models(date: date):
-    raw = get("https://openrouter.ai/rankings?view=day").text.replace("\\", "")
-    data = re.search(r'"rankingData":(.*),"rankingType":"day"', raw).group(1)
-    data = json.loads(data)
-    data = sorted(data, key=lambda x: x["total_prompt_tokens"], reverse=True)
-    models = [get_model(model["model_permaslug"]) for model in data]
-    return [m for m in models if m]
 def get_translation_models():
@@ -125,6 +193,7 @@ def get_translation_models():
                 "name": "Google Translate",
                 "provider_name": "Google",
                 "cost": 20.0,
                 "size": None,
                 "type": "closed-source",
                 "license": None,
@@ -161,7 +230,10 @@ async def complete(**kwargs) -> str | None:
 translate_client = translate.Client()
-google_supported_languages = [l["language"] for l in translate_client.get_languages()]
 @cache
@@ -173,42 +245,35 @@ async def translate_google(text, source_language, target_language):
     return response["translatedText"]
-@cache
-async def transcribe_elevenlabs(path, model):
-    modelname = model.split("/")[-1]
-    client = AsyncElevenLabs(api_key=getenv("ELEVENLABS_API_KEY"))
-    async with elevenlabs_rate_limit:
-        with open(path, "rb") as file:
-            response = await client.speech_to_text.convert(
-                model_id=modelname, file=file
-            )
-    return response.text
-@cache
-async def transcribe_huggingface(path, model):
-    client = AsyncInferenceClient(api_key=getenv("HUGGINGFACE_ACCESS_TOKEN"))
-    async with huggingface_rate_limit:
-        output = await client.automatic_speech_recognition(model=model, audio=path)
-    return output.text
-async def transcribe(path, model="elevenlabs/scribe_v1"):
-    provider, modelname = model.split("/")
-    match provider:
-        case "elevenlabs":
-            return await transcribe_elevenlabs(path, modelname)
-        case "openai" | "facebook":
-            return await transcribe_huggingface(path, model)
-        case _:
-            raise ValueError(f"Model {model} not supported")
-def get_or_metadata(id):
-    # get metadata from OpenRouter
-    models = get_models(date.today())
-    metadata = next((m for m in models if m["slug"] == id), None)
-    return metadata
 api = HfApi()
@@ -231,12 +296,15 @@ def get_hf_metadata(row):
         return empty
     try:
         info = api.model_info(id)
-        license = (
-            (info.card_data.license or "")
-            .replace("-", " ")
-            .replace("mit", "MIT")
-            .title()
-        )
         return {
             "hf_id": info.id,
             "creation_date": info.created_at,
@@ -249,20 +317,39 @@ def get_hf_metadata(row):
 def get_cost(row):
-    cost = float(row["endpoint"]["pricing"]["completion"])
-    return round(cost * 1_000_000, 2)
 @cache
-def load_models(date: date):
-    popular_models = (
-        get_historical_popular_models(date.today())[:20]
-        + get_current_popular_models(date.today())[:10]
-    )
     popular_models = [m["slug"] for m in popular_models]
-    models = set(important_models + popular_models) - set(blocklist)
-    models = pd.DataFrame(sorted(list(models)), columns=["id"])
-    or_metadata = models["id"].apply(get_or_metadata)
     hf_metadata = or_metadata.apply(get_hf_metadata)
     creation_date_hf = pd.to_datetime(hf_metadata.str["creation_date"]).dt.date
     creation_date_or = pd.to_datetime(
@@ -274,16 +361,30 @@ def load_models(date: date):
         .str.replace(" (free)", "")
         .str.replace(" (self-moderated)", ""),
         provider_name=or_metadata.str["name"].str.split(": ").str[0],
         cost=or_metadata.apply(get_cost),
         hf_id=hf_metadata.str["hf_id"],
         size=hf_metadata.str["size"],
         type=hf_metadata.str["type"],
         license=hf_metadata.str["license"],
         creation_date=creation_date_hf.combine_first(creation_date_or),
     )
-    # models = models[models["cost"] <= 2.0].reset_index(drop=True)
     models["tasks"] = [
-        ["translation_from", "translation_to", "classification", "mmlu", "arc", "truthfulqa", "mgsm"]
     ] * len(models)
     models = pd.concat([models, get_translation_models()])
     return models

 import re
 from datetime import date
 from os import getenv
 import pandas as pd
 from aiolimiter import AsyncLimiter
 from dotenv import load_dotenv
 from google.cloud import translate_v2 as translate
 from huggingface_hub import AsyncInferenceClient, HfApi
 from joblib.memory import Memory
     "meta-llama/llama-3.1-70b-instruct",  # 0.3$
     "meta-llama/llama-3-70b-instruct",  # 0.4$
     # "meta-llama/llama-2-70b-chat", # 0.9$; not properly supported by OpenRouter
+    "openai/gpt-5",
+    "openai/gpt-5-mini",
+    "openai/gpt-5-nano",
     "openai/gpt-4.1",  # 8$
+    "openai/gpt-4o",  # 10$
+    "openai/gpt-3.5-turbo", # $1.50
+    "openai/gpt-oss-120b",
+    "anthropic/claude-4.5-sonnet",
+    "anthropic/claude-4.5-haiku",
+    "anthropic/claude-opus-4.1",  # 15$
+    "anthropic/claude-4-sonnet",
+    "anthropic/claude-3.7-sonnet",  # 15$
+    "anthropic/claude-3.5-sonnet",
+    "mistralai/mistral-small-3.2-24b-instruct",  # 0.3$
+    "mistralai/mistral-medium-3.1",
     "mistralai/mistral-saba",  # 0.6$
     "mistralai/mistral-nemo",  # 0.08$
+    "google/gemini-2.5-pro", # $10
     "google/gemini-2.5-flash",  # 0.6$
+    "google/gemini-2.5-flash-lite",  # 0.3$
     "google/gemma-3-27b-it",  # 0.2$
+    # "x-ai/grok-4", # $15
+    # "x-ai/grok-3", # $15
+    "cohere/command-a",
     "qwen/qwen3-32b",
     "qwen/qwen3-235b-a22b",
     "qwen/qwen3-30b-a3b",  # 0.29$
     # "qwen/qwq-32b",  # 0.2$
     # "qwen/qwen-2.5-72b-instruct",  # 0.39$
     # "qwen/qwen-2-72b-instruct",  # 0.9$
+    "deepseek/deepseek-v3.2-exp",
     "microsoft/phi-4",  # 0.07$
+    "amazon/nova-pro-v1",  # 0.09$
+    "moonshotai/kimi-k2",  # 0.6$
+    "baidu/ernie-4.5-300b-a47b",
 ]
 blocklist = [
     "google/gemini-2.5-pro-preview",
+    # "google/gemini-2.5-pro",
     "google/gemini-2.5-flash-preview",
     "google/gemini-2.5-flash-lite-preview",
     "google/gemini-2.5-flash-preview-04-17",
     "google/gemini-2.5-flash-lite-preview-06-17",
     "google/gemini-2.5-pro-preview-06-05",
     "google/gemini-2.5-pro-preview-05-06",
+    "perplexity/sonar-deep-research",
+    "perplexity/sonar-reasoning",
+    "perplexity/sonar-reasoning-pro",
+    "qwen/qwen3-vl-30b-a3b-thinking",
+    "alpindale/goliath-120b"
 ]
 transcription_models = [
 @cache
+def load_or_metadata(date: date):
     return get("https://openrouter.ai/api/frontend/models").json()["data"]
+def get_or_metadata(permaslug):
+    models = load_or_metadata(date.today())
     slugs = [
         m
         for m in models
+        if (m["permaslug"] == permaslug or m["slug"] == permaslug)
+        # ensure that a provider endpoint is available
         and m["endpoint"]
+        # exclude free models
+        # the problem is that free models typically have very high rate-limiting
         and not m["endpoint"]["is_free"]
+        # exclude providers that train on user data
+        # this is crucial since we are submitting benchmark data
+        # make sure to additionally configure this in OpenRouter settings to avoid mistakes!
+        and m["endpoint"]["provider_info"]["dataPolicy"]["training"] is False
     ]
     if len(slugs) == 0:
+        print(f"no appropriate model (not free and no user data training) found for {permaslug}")
     return slugs[0] if len(slugs) >= 1 else None
 @cache
 def get_historical_popular_models(date: date):
+    # date parameter is used for daily caching
+    try:
+        raw = get("https://openrouter.ai/rankings").text
+        # Extract model data from rankingData using regex
+        # Find all count and model_permaslug pairs in the data
+        # Format: "count":number,"model_permaslug":"model/name"
+        pattern = r"\\\"count\\\":([\d.]+).*?\\\"model_permaslug\\\":\\\"([^\\\"]+)\\\""
+        matches = re.findall(pattern, raw)
+        if matches:
+            # Aggregate model counts
+            model_counts = {}
+            for count_str, model_slug in matches:
+                count = float(count_str)
+                if not model_slug.startswith("openrouter") and model_slug != "Others":
+                    # Remove variant suffixes for aggregation
+                    base_model = model_slug.split(":")[0]
+                    model_counts[base_model] = model_counts.get(base_model, 0) + count
+            # Sort by popularity and return top models
+            sorted_models = sorted(
+                model_counts.items(), key=lambda x: x[1], reverse=True
+            )
+            result = []
+            for model_slug, count in sorted_models:
+                result.append({"slug": model_slug, "count": int(count)})
+            return result
+        else:
+            return []
+    except Exception as e:
+        return []
 @cache
 def get_current_popular_models(date: date):
+    # date parameter is used for daily caching
+    try:
+        raw = get("https://openrouter.ai/rankings?view=day").text
+        # Extract model data from daily rankings
+        # Find all count and model_permaslug pairs in the daily data
+        pattern = r"\\\"count\\\":([\d.]+).*?\\\"model_permaslug\\\":\\\"([^\\\"]+)\\\""
+        matches = re.findall(pattern, raw)
+        if matches:
+            # Aggregate model counts
+            model_counts = {}
+            for count_str, model_slug in matches:
+                count = float(count_str)
+                if not model_slug.startswith("openrouter") and model_slug != "Others":
+                    # Remove variant suffixes for aggregation
+                    base_model = model_slug.split(":")[0]
+                    model_counts[base_model] = model_counts.get(base_model, 0) + count
+            # Sort by popularity and return top models
+            sorted_models = sorted(
+                model_counts.items(), key=lambda x: x[1], reverse=True
+            )
+            result = []
+            for model_slug, count in sorted_models:
+                result.append({"slug": model_slug, "count": int(count)})
+            return result
+        else:
+            return []
+    except Exception as e:
+        return []
 def get_translation_models():
                 "name": "Google Translate",
                 "provider_name": "Google",
                 "cost": 20.0,
+                "train_on_prompts": False,  # they don't do it in the API
                 "size": None,
                 "type": "closed-source",
                 "license": None,
 translate_client = translate.Client()
+def get_google_supported_languages():
+    return [l["language"] for l in translate_client.get_languages()]
 @cache
     return response["translatedText"]
+# @cache
+# async def transcribe_elevenlabs(path, model):
+#     modelname = model.split("/")[-1]
+#     client = AsyncElevenLabs(api_key=getenv("ELEVENLABS_API_KEY"))
+#     async with elevenlabs_rate_limit:
+#         with open(path, "rb") as file:
+#             response = await client.speech_to_text.convert(
+#                 model_id=modelname, file=file
+#             )
+#     return response.text
+# @cache
+# async def transcribe_huggingface(path, model):
+#     client = AsyncInferenceClient(api_key=getenv("HUGGINGFACE_ACCESS_TOKEN"))
+#     async with huggingface_rate_limit:
+#         output = await client.automatic_speech_recognition(model=model, audio=path)
+#     return output.text
+# async def transcribe(path, model="elevenlabs/scribe_v1"):
+#     provider, modelname = model.split("/")
+#     match provider:
+#         case "elevenlabs":
+#             return await transcribe_elevenlabs(path, modelname)
+#         case "openai" | "facebook":
+#             return await transcribe_huggingface(path, model)
+#         case _:
+#             raise ValueError(f"Model {model} not supported")
 api = HfApi()
         return empty
     try:
         info = api.model_info(id)
+        license = ""
+        if (
+            info.card_data
+            and hasattr(info.card_data, "license")
+            and info.card_data.license
+        ):
+            license = (
+                info.card_data.license.replace("-", " ").replace("mit", "MIT").title()
+            )
         return {
             "hf_id": info.id,
             "creation_date": info.created_at,
 def get_cost(row):
+    try:
+        cost = float(row["endpoint"]["pricing"]["completion"])
+        return round(cost * 1_000_000, 2)
+    except (TypeError, KeyError):
+        return None
+def get_training_policy(row):
+    # get openrouter info whether the provider may train on prompts
+    # (this needs to be thoroughly avoided for our benchmark prompts!)
+    return row["endpoint"]["provider_info"]["dataPolicy"]["training"]
 @cache
+def load_models(date: date) -> pd.DataFrame:
+    # popular_models = (
+    #     get_historical_popular_models(date.today())[:20]
+    #     + get_current_popular_models(date.today())[:10]
+    # )
+    popular_models = []
     popular_models = [m["slug"] for m in popular_models]
+    all_model_candidates = set(important_models + popular_models) - set(blocklist)
+    # Validate models exist on OpenRouter before including them
+    valid_models = []
+    for model_id in all_model_candidates:
+        metadata = get_or_metadata(model_id)
+        if metadata is not None:
+            valid_models.append(model_id)
+    models = pd.DataFrame(sorted(valid_models), columns=["id"])
+    or_metadata = models["id"].apply(get_or_metadata)  # TODO this is double-doubled
     hf_metadata = or_metadata.apply(get_hf_metadata)
     creation_date_hf = pd.to_datetime(hf_metadata.str["creation_date"]).dt.date
     creation_date_or = pd.to_datetime(
         .str.replace(" (free)", "")
         .str.replace(" (self-moderated)", ""),
         provider_name=or_metadata.str["name"].str.split(": ").str[0],
+        # openrouter_metadata=or_metadata.astype(str),
         cost=or_metadata.apply(get_cost),
+        train_on_prompts=or_metadata.apply(get_training_policy),
         hf_id=hf_metadata.str["hf_id"],
         size=hf_metadata.str["size"],
         type=hf_metadata.str["type"],
         license=hf_metadata.str["license"],
         creation_date=creation_date_hf.combine_first(creation_date_or),
     )
+    models.to_json(
+        "models_unfiltered.json", orient="records", indent=2, force_ascii=False
+    )
+    # Filter out expensive models to keep costs reasonable
+    models = models[models["cost"] <= 15.0].reset_index(drop=True)
     models["tasks"] = [
+        [
+            "translation_from",
+            "translation_to",
+            "classification",
+            "mmlu",
+            "arc",
+            "truthfulqa",
+            "mgsm",
+        ]
     ] * len(models)
     models = pd.concat([models, get_translation_models()])
     return models

evals/plots.py CHANGED Viewed

@@ -9,34 +9,33 @@ df = pd.read_json("../results.json")
 df = df[df["metric"] != "chrf"]
 df = df.groupby(["task", "metric", "bcp_47"]).agg({"score": "mean"}).reset_index()
 # Apply logit transformation to classification scores to reduce skewness
 def transform_classification_scores(row):
-    if row['task'] == 'classification':
         # Avoid division by zero and infinite values by clipping
-        score = np.clip(row['score'], 0.001, 0.999)
         # Apply logit transformation (log(p/(1-p)))
         return logit(score)
     else:
-        return row['score']
-df['score'] = df.apply(transform_classification_scores, axis=1)
 # Create a pivot table with tasks as columns and languages as rows
 pivot_df = df.pivot_table(
-    values='score',
-    index='bcp_47',
-    columns='task',
-    aggfunc='mean'
 )
 # Sort and filter tasks
 ordered_tasks = [
-    'translation_from',
-    'translation_to',
-    'classification',
-    'mmlu',
-    'arc',
-    'mgsm',
 ]
 # Drop 'truthfulqa' if present and reindex columns
 pivot_df = pivot_df[[task for task in ordered_tasks if task in pivot_df.columns]]
@@ -46,29 +45,29 @@ correlation_matrix = pivot_df.corr()
 # Create the correlation plot
 plt.figure(figsize=(8, 6))
-# Create mask for upper triangle including diagonal to show only lower triangle
 mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
 # Create a heatmap
 sns.heatmap(
-    correlation_matrix,
-    annot=True,
-    cmap='Blues',
     center=0,
     square=True,
     mask=mask,
-    cbar_kws={"shrink": .8},
-    fmt='.3f'
 )
-plt.xlabel('Tasks', fontsize=12)
-plt.ylabel('Tasks', fontsize=12)
-plt.xticks(rotation=45, ha='right')
 plt.yticks(rotation=0)
 plt.tight_layout()
 # Save the plot
-plt.savefig('task_correlation_matrix.png', dpi=300, bbox_inches='tight')
 plt.show()
 # Print correlation values for reference
@@ -77,56 +76,91 @@ print("Note: Classification scores have been logit-transformed to reduce skewnes
 print(correlation_matrix.round(3))
 # Also create a scatter plot matrix for pairwise relationships with highlighted languages
-highlighted_languages = ['en', 'zh', 'hi', 'es', 'ar']
 # Create color mapping
 def get_color_and_label(lang_code):
     if lang_code in highlighted_languages:
-        color_map = {'en': 'red', 'zh': 'blue', 'hi': 'green', 'es': 'orange', 'ar': 'purple'}
         return color_map[lang_code], lang_code
     else:
-        return 'lightgray', 'Other'
 # Create custom scatter plot matrix
 tasks = pivot_df.columns.tolist()
 n_tasks = len(tasks)
 fig, axes = plt.subplots(n_tasks, n_tasks, figsize=(15, 12))
-fig.suptitle('Pairwise Task Performance', fontsize=16, fontweight='bold')
 # Create legend elements
 legend_elements = []
 for lang in highlighted_languages:
     color, _ = get_color_and_label(lang)
-    legend_elements.append(plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=color, markersize=8, label=lang))
-legend_elements.append(plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='lightgray', markersize=8, label='Other'))
 for i, task_y in enumerate(tasks):
     for j, task_x in enumerate(tasks):
         ax = axes[i, j]
         if i == j:
             # Diagonal: histogram
             task_data = pivot_df[task_y].dropna()
             colors = [get_color_and_label(lang)[0] for lang in task_data.index]
-            ax.hist(task_data, bins=20, alpha=0.7, color='skyblue', edgecolor='black')
-            ax.set_title(f'{task_y}', fontsize=10)
         else:
             # Off-diagonal: scatter plot
             for lang_code in pivot_df.index:
-                if pd.notna(pivot_df.loc[lang_code, task_x]) and pd.notna(pivot_df.loc[lang_code, task_y]):
                     color, _ = get_color_and_label(lang_code)
                     alpha = 0.8 if lang_code in highlighted_languages else 0.3
                     size = 50 if lang_code in highlighted_languages else 20
-                    ax.scatter(pivot_df.loc[lang_code, task_x], pivot_df.loc[lang_code, task_y],
-                             c=color, alpha=alpha, s=size)
         # Set labels
         if i == n_tasks - 1:
             ax.set_xlabel(task_x, fontsize=10)
         if j == 0:
             ax.set_ylabel(task_y, fontsize=10)
         # Remove tick labels except for edges
         if i != n_tasks - 1:
             ax.set_xticklabels([])
@@ -136,15 +170,15 @@ for i, task_y in enumerate(tasks):
 # Add legend
 fig.legend(
     handles=legend_elements,
-    loc='lower center',
     bbox_to_anchor=(0.5, -0.05),
     ncol=len(legend_elements),
     frameon=False,
     fontsize=10,
     handletextpad=0.5,
-    columnspacing=1.0
 )
 plt.tight_layout()
-plt.savefig('task_scatter_matrix.png', dpi=300, bbox_inches='tight')
 plt.show()

 df = df[df["metric"] != "chrf"]
 df = df.groupby(["task", "metric", "bcp_47"]).agg({"score": "mean"}).reset_index()
 # Apply logit transformation to classification scores to reduce skewness
 def transform_classification_scores(row):
+    if row["task"] == "classification":
         # Avoid division by zero and infinite values by clipping
+        score = np.clip(row["score"], 0.001, 0.999)
         # Apply logit transformation (log(p/(1-p)))
         return logit(score)
     else:
+        return row["score"]
+df["score"] = df.apply(transform_classification_scores, axis=1)
 # Create a pivot table with tasks as columns and languages as rows
 pivot_df = df.pivot_table(
+    values="score", index="bcp_47", columns="task", aggfunc="mean"
 )
 # Sort and filter tasks
 ordered_tasks = [
+    "translation_from",
+    "translation_to",
+    "classification",
+    "mmlu",
+    "arc",
+    "mgsm",
 ]
 # Drop 'truthfulqa' if present and reindex columns
 pivot_df = pivot_df[[task for task in ordered_tasks if task in pivot_df.columns]]
 # Create the correlation plot
 plt.figure(figsize=(8, 6))
+# Create mask for upper triangle including diagonal to show only lower triangle
 mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
 # Create a heatmap
 sns.heatmap(
+    correlation_matrix,
+    annot=True,
+    cmap="Blues",
     center=0,
     square=True,
     mask=mask,
+    cbar_kws={"shrink": 0.8},
+    fmt=".3f",
 )
+plt.xlabel("Tasks", fontsize=12)
+plt.ylabel("Tasks", fontsize=12)
+plt.xticks(rotation=45, ha="right")
 plt.yticks(rotation=0)
 plt.tight_layout()
 # Save the plot
+plt.savefig("task_correlation_matrix.png", dpi=300, bbox_inches="tight")
 plt.show()
 # Print correlation values for reference
 print(correlation_matrix.round(3))
 # Also create a scatter plot matrix for pairwise relationships with highlighted languages
+highlighted_languages = ["en", "zh", "hi", "es", "ar"]
 # Create color mapping
 def get_color_and_label(lang_code):
     if lang_code in highlighted_languages:
+        color_map = {
+            "en": "red",
+            "zh": "blue",
+            "hi": "green",
+            "es": "orange",
+            "ar": "purple",
+        }
         return color_map[lang_code], lang_code
     else:
+        return "lightgray", "Other"
 # Create custom scatter plot matrix
 tasks = pivot_df.columns.tolist()
 n_tasks = len(tasks)
 fig, axes = plt.subplots(n_tasks, n_tasks, figsize=(15, 12))
+fig.suptitle("Pairwise Task Performance", fontsize=16, fontweight="bold")
 # Create legend elements
 legend_elements = []
 for lang in highlighted_languages:
     color, _ = get_color_and_label(lang)
+    legend_elements.append(
+        plt.Line2D(
+            [0],
+            [0],
+            marker="o",
+            color="w",
+            markerfacecolor=color,
+            markersize=8,
+            label=lang,
+        )
+    )
+legend_elements.append(
+    plt.Line2D(
+        [0],
+        [0],
+        marker="o",
+        color="w",
+        markerfacecolor="lightgray",
+        markersize=8,
+        label="Other",
+    )
+)
 for i, task_y in enumerate(tasks):
     for j, task_x in enumerate(tasks):
         ax = axes[i, j]
         if i == j:
             # Diagonal: histogram
             task_data = pivot_df[task_y].dropna()
             colors = [get_color_and_label(lang)[0] for lang in task_data.index]
+            ax.hist(task_data, bins=20, alpha=0.7, color="skyblue", edgecolor="black")
+            ax.set_title(f"{task_y}", fontsize=10)
         else:
             # Off-diagonal: scatter plot
             for lang_code in pivot_df.index:
+                if pd.notna(pivot_df.loc[lang_code, task_x]) and pd.notna(
+                    pivot_df.loc[lang_code, task_y]
+                ):
                     color, _ = get_color_and_label(lang_code)
                     alpha = 0.8 if lang_code in highlighted_languages else 0.3
                     size = 50 if lang_code in highlighted_languages else 20
+                    ax.scatter(
+                        pivot_df.loc[lang_code, task_x],
+                        pivot_df.loc[lang_code, task_y],
+                        c=color,
+                        alpha=alpha,
+                        s=size,
+                    )
         # Set labels
         if i == n_tasks - 1:
             ax.set_xlabel(task_x, fontsize=10)
         if j == 0:
             ax.set_ylabel(task_y, fontsize=10)
         # Remove tick labels except for edges
         if i != n_tasks - 1:
             ax.set_xticklabels([])
 # Add legend
 fig.legend(
     handles=legend_elements,
+    loc="lower center",
     bbox_to_anchor=(0.5, -0.05),
     ncol=len(legend_elements),
     frameon=False,
     fontsize=10,
     handletextpad=0.5,
+    columnspacing=1.0,
 )
 plt.tight_layout()
+plt.savefig("task_scatter_matrix.png", dpi=300, bbox_inches="tight")
 plt.show()

evals/tasks.py CHANGED Viewed

@@ -1,19 +1,19 @@
 import random
 from functools import partial
 from textwrap import dedent
 import evaluate
-import pandas as pd
 import sentencepiece as spm
 from datasets_.flores import flores_sentences
 from datasets_.mgsm import load_mgsm, parse_number
 from datasets_.mmlu import load_mmlu
-from datasets_.arc import load_uhura_arc_easy
 from datasets_.truthfulqa import load_truthfulqa
 from google.cloud import translate_v2 as translate
 from langcodes import closest_supported_match
 from languages import languages, script_name
-from models import complete, transcribe, translate_google
 bleu = evaluate.load("bleu")
 chrf = evaluate.load("chrf")
@@ -30,6 +30,58 @@ target_languages = languages[languages["in_benchmark"]].sample(
 translate_client = translate.Client()
 supported_languages = [l["language"] for l in translate_client.get_languages()]
 async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     original_language = languages[languages["bcp_47"] == bcp_47].iloc[0]
@@ -47,31 +99,24 @@ async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     original_sentence = flores_sentences(original_language)["text"][sentence_nr].strip()
     target_sentence = flores_sentences(target_language)["text"][sentence_nr].strip()
     script = script_name(target_language.flores_path.split("_")[1])
     if model == "google/translate-v2":
         original_language = closest_supported_match(
-            original_language, supported_languages
         )
-        target_language = closest_supported_match(target_language, supported_languages)
         if original_language == target_language:
             prediction = original_sentence
         elif original_language is None or target_language is None:
             prediction = None
         else:
             prediction = await translate_google(
-                original_sentence, original_language.bcp_47, target_language.bcp_47
             )
     else:
-        prediction = await complete(
-            model=model,
-            messages=[
-                {
-                    "role": "user",
-                    "content": f"Translate the following text to the {target_language.language_name} language; use the {script} script; reply only with the translation:\n\n{original_sentence}",
-                }
-            ],
-            temperature=0,
-            max_tokens=1024,
-        )
     if prediction:
         bleu_score = bleu.compute(
             predictions=[prediction],
@@ -84,6 +129,7 @@ async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     else:
         bleu_score = {"bleu": 0}
         chrf_score = {"score": 0}
     return [
         {
             "model": model,
@@ -91,7 +137,10 @@ async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
             "task": f"translation_{mode}",
             "metric": metric,
             "score": score,
             "sentence_nr": sentence_nr,
         }
         for metric, score in (
             ("bleu", bleu_score["bleu"]),
@@ -112,57 +161,27 @@ async def classify_and_evaluate(model, bcp_47, nr):
     )
     top_topics = paragraphs.value_counts("topic").head(5).index
     paragraphs = paragraphs[paragraphs["topic"].isin(top_topics)]
-    examples = pd.concat(
-        [
-            paragraphs[paragraphs["topic"] == t].sample(n=1, random_state=42)
-            for t in top_topics
-        ]
-    ).sample(frac=1, random_state=nr)
-    test_paragraphs = paragraphs[~paragraphs["url"].isin(examples["url"])].sample(
-        frac=1, random_state=42
-    )
-    test_paragraph = test_paragraphs.iloc[nr]
-    def format_prompt(text):
-        return f"{text}\n\nTopic: {'|'.join(top_topics)}?"
-    messages = []
-    for example in examples.itertuples():
-        messages += [
-            {"role": "user", "content": format_prompt(example.text)},
-            {"role": "assistant", "content": example.topic},
-        ]
-    # some models have poor tokenization for some languages, and the prompt for this task is relatively long, so it sometimes exceeds the context window
-    # this is not just to blame on the context window but mostly on the model's tokenization, so we assign 0 accuracy in this case
-    try:
-        pred = await complete(
-            model=model,
-            messages=[
-                *messages,
-                {
-                    "role": "user",
-                    "content": format_prompt(test_paragraph.text),
-                },
-            ],
-            temperature=0,
-            max_tokens=30,
-        )
-        true = test_paragraph.topic
-        others = [t for t in top_topics if t != true]
-        acc = (
-            int(
-                pred.startswith(true)
-                or (true in pred and not any(o in pred for o in others))
-            )
-            if pred
-            else 0
         )
-    except Exception as e:
-        if "`inputs` tokens + `max_new_tokens` must be <= 4097" in str(e):
-            print(f"Max tokens exceeded for {model} in {bcp_47}")
-            acc = 0
-        else:
-            raise e
     return [
         {
             "model": model,
@@ -170,101 +189,74 @@ async def classify_and_evaluate(model, bcp_47, nr):
             "task": "classification",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
-def corrupt_sentence(sentence):
-    # replace 5% of the sentence with <mask>
-    mask_length = round(len(sentence) * 0.05)
-    start = random.randint(0, len(sentence) - mask_length)
-    end = start + mask_length
-    return sentence[:start] + "<mask>" + sentence[end:]
-async def mlm_and_evaluate(model, language_bcp_47, nr):
-    language = languages[languages["bcp_47"] == language_bcp_47].iloc[0]
-    sentences = flores_sentences(language)
-    if sentences is None:
-        return []
-    sentences = pd.DataFrame(sentences, columns=["text"])
-    sentences["corrupt_text"] = sentences["text"].apply(corrupt_sentence)
-    examples = sentences.sample(n=10, random_state=42)
-    test_sentences = sentences[~sentences["text"].isin(examples["text"])].sample(
-        frac=1, random_state=42
-    )
-    test_sentence = test_sentences.iloc[nr]
-    messages = []
-    for example in examples.itertuples():
-        messages += [
-            {"role": "user", "content": example.corrupt_text},
-            {"role": "assistant", "content": example.text},
-        ]
-    prediction = await complete(
-        model=model,
-        messages=[
-            *messages,
-            {
-                "role": "user",
-                "content": test_sentence.corrupt_text,
-            },
-        ],
-        temperature=0,
-        max_tokens=1024,
-    )
-    chrf_score = chrf.compute(predictions=[prediction], references=[test_sentence.text])
-    return [
-        {
-            "model": model,
-            "bcp_47": language["bcp_47"],
-            "task": "language_modeling",
-            "metric": "chrf",
-            "score": chrf_score["score"] / 100,
-            "sentence_nr": nr,
-        }
-    ]
-def format_multiple_choice(item):
-    return f"""{item["question"]}
-    A: {item["choices"][0]}
-    B: {item["choices"][1]}
-    C: {item["choices"][2]}
-    D: {item["choices"][3]}
-    A|B|C|D?"""
 async def mmlu_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_mmlu(language_bcp_47, nr)
     if not task:
         return []
-    messages = []
-    for example in examples:
-        messages += [
-            {"role": "user", "content": format_multiple_choice(example)},
-            {"role": "assistant", "content": example["answer"]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice(task)}]
-    try:
-        response = await complete(
-            model=model,
-            messages=messages,
-            temperature=0,
-            max_tokens=1,
-        )
-        if response:
-            acc = int(response[:1].strip() == task["answer"])
-        else:
-            acc = 0
-    except Exception as e:
-        if "ResponsibleAIPolicyViolation" in str(e):
-            acc = 0
-        else:
-            raise e
     return [
         {
             "model": model,
@@ -272,39 +264,22 @@ async def mmlu_and_evaluate(model, language_bcp_47, nr):
             "task": "mmlu",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
 async def arc_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_uhura_arc_easy(language_bcp_47, nr)
     if not task:
         return []
-    messages = []
-    for example in examples:
-        messages += [
-            {"role": "user", "content": format_multiple_choice(example)},
-            {"role": "assistant", "content": example["answer"]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice(task)}]
-    try:
-        response = await complete(
-            model=model,
-            messages=messages,
-            temperature=0,
-            max_tokens=1,
-        )
-        if response:
-            acc = int(response[:1].strip() == task["answer"])
-        else:
-            acc = 0
-    except Exception as e:
-        if "ResponsibleAIPolicyViolation" in str(e):
-            acc = 0
-        else:
-            raise e
     return [
         {
             "model": model,
@@ -312,7 +287,10 @@ async def arc_and_evaluate(model, language_bcp_47, nr):
             "task": "arc",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
@@ -332,40 +310,19 @@ def format_multiple_choice_truthfulqa(item):
     text = item["question"] + "\n\n"
     for i, choice in enumerate(item["choices"]):
         text += f"{letters[i]}: {choice}\n"
-    text += "|".join(letters[: len(item["choices"])]) + "?"
     return text
 async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
-    ds_name, examples, task = load_truthfulqa(language_bcp_47, nr)
     if not task:
         return []
-    task = shuffle_choices_and_labels(task)
-    answer = letters[task["labels"].index(1)]
-    messages = []
-    for example in examples:
-        example = shuffle_choices_and_labels(example)
-        messages += [
-            {"role": "user", "content": format_multiple_choice_truthfulqa(example)},
-            {"role": "assistant", "content": letters[example["labels"].index(1)]},
-        ]
-    messages += [{"role": "user", "content": format_multiple_choice_truthfulqa(task)}]
-    try:
-        response = await complete(
-            model=model,
-            messages=messages,
-            temperature=0,
-            max_tokens=1,
-        )
-        if response:
-            acc = int(response[:1].strip() == answer)
-        else:
-            acc = 0
-    except Exception as e:
-        if "ResponsibleAIPolicyViolation" in str(e):
-            acc = 0
-        else:
-            raise e
     return [
         {
             "model": model,
@@ -373,86 +330,86 @@ async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
             "task": "truthfulqa",
             "metric": "accuracy",
             "score": acc,
             "sentence_nr": nr,
         }
     ]
 async def mgsm_and_evaluate(model, language_bcp_47, nr):
-    system_prompt = """
-    Solve the math problem. Use reasoning, and finally give the answer as a number.
-    Response format: <reasoning> #### <number>
-    """
-    system_prompt = dedent(system_prompt).strip()
-    ds_slug, question = load_mgsm(language_bcp_47, nr)
     if not question:
         return []
-    response = await complete(
-        model=model,
-        messages=[
-            {"role": "system", "content": system_prompt},
-            {"role": "user", "content": question["question"]},
-        ],
-        temperature=0,
-        max_tokens=1024,
-    )
-    if response and len(response.split("####")) == 2:
-        number = response.split("####")[1].strip()
-        accuracy = int(parse_number(number) == parse_number(question["answer_number"]))
-    else:
-        accuracy = 0
     return [
         {
             "model": model,
             "bcp_47": language_bcp_47,
             "task": "mgsm",
             "metric": "accuracy",
-            "score": accuracy,
             "sentence_nr": nr,
         }
     ]
-async def transcribe_and_evaluate(model, language_bcp_47, nr):
-    language = languages[languages["bcp_47"] == language_bcp_47].iloc[0]
-    fleurs = pd.read_csv(
-        f"data/fleurs/{language.fleurs_tag}/dev.tsv",
-        sep="\t",
-        names=[
-            "id",
-            "fname",
-            "raw_transcription",
-            "transcription",
-            "words",
-            "id2",
-            "gender",
-        ],
-    )
-    item = fleurs.iloc[nr]
-    path = f"data/fleurs/{language.fleurs_tag}/audio/dev/{item.fname}"
-    pred = await transcribe(path, model=model)
-    wer_score = wer.compute(predictions=[pred], references=[item.transcription])
-    return [
-        {
-            "model": model,
-            "bcp_47": language["bcp_47"],
-            "task": "asr",
-            "metric": "wer",
-            "score": wer_score,
-            "sentence_nr": nr,
-        }
-    ]
 tasks = {
     "translation_from": partial(translate_and_evaluate, mode="from"),
     "translation_to": partial(translate_and_evaluate, mode="to"),
     "classification": classify_and_evaluate,
-    # "mlm": mlm_and_evaluate,
     "mmlu": mmlu_and_evaluate,
     "arc": arc_and_evaluate,
     "truthfulqa": truthfulqa_and_evaluate,
     "mgsm": mgsm_and_evaluate,
-    # "asr": transcribe_and_evaluate,
 }

 import random
+import re
 from functools import partial
 from textwrap import dedent
 import evaluate
 import sentencepiece as spm
+from datasets_.arc import load_uhura_arc_easy
 from datasets_.flores import flores_sentences
 from datasets_.mgsm import load_mgsm, parse_number
 from datasets_.mmlu import load_mmlu
 from datasets_.truthfulqa import load_truthfulqa
 from google.cloud import translate_v2 as translate
 from langcodes import closest_supported_match
 from languages import languages, script_name
+from models import complete, translate_google
 bleu = evaluate.load("bleu")
 chrf = evaluate.load("chrf")
 translate_client = translate.Client()
 supported_languages = [l["language"] for l in translate_client.get_languages()]
+async def query(model, prompt):
+    # this is just for sharing config across tasks
+    try:
+        response = await complete(
+            model=model,
+            messages=[{"role": "user", "content": prompt}],
+            temperature=0,
+            max_tokens=1024,
+            extra_body=dict(
+                reasoning=dict(
+                    effort="low",  # Can be "high", "medium", or "low" (OpenAI-style)
+                    # max_tokens=1024,  # Specific token limit (Anthropic-style)
+                    # Optional: Default is false. All models support this.
+                    exclude=True,  # Set to true to exclude reasoning tokens from response
+                )
+            ),
+        )
+    except Exception as e:
+        print(f"exception for model {model}: {e}")
+        return None
+    # remove <think>...</think> sections (it's probably an OpenRouter bug that they are included)
+    response = re.sub(r"<think>.*</think>", "", response).strip()
+    # sometimes there's also a lone <think> at the start for some reason
+    response = re.sub(r"<think>", "", response).strip()
+    return response
+reasoning_template = (
+    "Response format:<reasoning>...</reasoning><final_answer>...</final_answer>"
+)
+def format_multiple_choice(item):
+    return dedent(f"""
+    {reasoning_template}
+    ---
+    {item["question"]}
+    A: {item["choices"][0]}
+    B: {item["choices"][1]}
+    C: {item["choices"][2]}
+    D: {item["choices"][3]}""")
+def extract_mc_response(response):
+    if not response:
+        return None
+    final_answer = re.search(r"\<final_answer\>(.*)\<\/final_answer\>", response)
+    return final_answer[1].strip() if final_answer else None
 async def translate_and_evaluate(model, bcp_47, sentence_nr, mode="from"):
     original_language = languages[languages["bcp_47"] == bcp_47].iloc[0]
     original_sentence = flores_sentences(original_language)["text"][sentence_nr].strip()
     target_sentence = flores_sentences(target_language)["text"][sentence_nr].strip()
     script = script_name(target_language.flores_path.split("_")[1])
+    translation_prompt = f"Translate the following text to the {target_language.language_name} language; use the {script} script; reply only with the translation:\n\n{original_sentence}"
     if model == "google/translate-v2":
         original_language = closest_supported_match(
+            original_language.bcp_47, supported_languages
+        )
+        target_language = closest_supported_match(
+            target_language.bcp_47, supported_languages
         )
         if original_language == target_language:
             prediction = original_sentence
         elif original_language is None or target_language is None:
             prediction = None
         else:
             prediction = await translate_google(
+                original_sentence, original_language, target_language
             )
     else:
+        prediction = await query(model, translation_prompt)
     if prediction:
         bleu_score = bleu.compute(
             predictions=[prediction],
     else:
         bleu_score = {"bleu": 0}
         chrf_score = {"score": 0}
     return [
         {
             "model": model,
             "task": f"translation_{mode}",
             "metric": metric,
             "score": score,
+            "origin": "human",  # FLORES+ is human-translated
             "sentence_nr": sentence_nr,
+            "prompt": translation_prompt,
+            "response": prediction,
         }
         for metric, score in (
             ("bleu", bleu_score["bleu"]),
     )
     top_topics = paragraphs.value_counts("topic").head(5).index
     paragraphs = paragraphs[paragraphs["topic"].isin(top_topics)]
+    test_paragraph = paragraphs.sample(n=1, random_state=nr).iloc[0]
+    prompt = f"""Classify the following text into one of these topics: {", ".join(top_topics)}.
+Reply with only the topic name.
+Text:
+{test_paragraph.text}
+"""
+    response = await query(model, prompt)
+    pred = response.lower().strip() if response else ""
+    true = test_paragraph.topic.lower().strip()
+    others = [t for t in top_topics if t != true]
+    acc = (
+        int(
+            pred.startswith(true)
+            or (true in pred and not any(o in pred for o in others))
         )
+        if pred
+        else 0
+    )
     return [
         {
             "model": model,
             "task": "classification",
             "metric": "accuracy",
             "score": acc,
+            "origin": "human",  # FLORES+ is human-translated
             "sentence_nr": nr,
+            "prompt": prompt,
+            "response": pred,
         }
     ]
+# def corrupt_sentence(sentence):
+#     # replace 5% of the sentence with <mask>
+#     mask_length = round(len(sentence) * 0.05)
+#     start = random.randint(0, len(sentence) - mask_length)
+#     end = start + mask_length
+#     return sentence[:start] + "<mask>" + sentence[end:]
+# async def mlm_and_evaluate(model, language_bcp_47, nr):
+#     language = languages[languages["bcp_47"] == language_bcp_47].iloc[0]
+#     sentences = flores_sentences(language)
+#     if sentences is None:
+#         return []
+#     sentences = pd.DataFrame(sentences, columns=["text"])
+#     sentences["corrupt_text"] = sentences["text"].apply(corrupt_sentence)
+#     examples = sentences.sample(n=10, random_state=42)
+#     test_sentences = sentences[~sentences["text"].isin(examples["text"])].sample(
+#         frac=1, random_state=42
+#     )
+#     test_sentence = test_sentences.iloc[nr]
+#     messages = []
+#     for example in examples.itertuples():
+#         messages += [
+#             {"role": "user", "content": example.corrupt_text},
+#             {"role": "assistant", "content": example.text},
+#         ]
+#     prediction = await complete(
+#         model=model,
+#         messages=[
+#             *messages,
+#             {
+#                 "role": "user",
+#                 "content": test_sentence.corrupt_text,
+#             },
+#         ],
+#         temperature=0,
+#         max_tokens=1024,
+#     )
+#     chrf_score = chrf.compute(predictions=[prediction], references=[test_sentence.text])
+#     return [
+#         {
+#             "model": model,
+#             "bcp_47": language["bcp_47"],
+#             "task": "language_modeling",
+#             "metric": "chrf",
+#             "score": chrf_score["score"] / 100,
+#             "sentence_nr": nr,
+#         }
+#     ]
 async def mmlu_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = await load_mmlu(language_bcp_47, nr)
     if not task:
         return []
+    prompt = f"""Solve the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.\n\n{format_multiple_choice(task)}"""
+    response = await query(model, prompt)
+    final_response = extract_mc_response(response)
+    acc = int(final_response == task["answer"]) if final_response else 0
     return [
         {
             "model": model,
             "task": "mmlu",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
+            "prompt": prompt,
+            "response": response,
         }
     ]
 async def arc_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = load_uhura_arc_easy(language_bcp_47, nr)
     if not task:
         return []
+    prompt = f"Solve the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.\n\n{format_multiple_choice(task)}"
+    response = await query(model, prompt)
+    final_response = extract_mc_response(response)
+    acc = int(final_response == task["answer"]) if final_response else 0
     return [
         {
             "model": model,
             "task": "arc",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
+            "prompt": prompt,
+            "response": response,
         }
     ]
     text = item["question"] + "\n\n"
     for i, choice in enumerate(item["choices"]):
         text += f"{letters[i]}: {choice}\n"
     return text
 async def truthfulqa_and_evaluate(model, language_bcp_47, nr):
+    ds_name, task, origin = await load_truthfulqa(language_bcp_47, nr)
     if not task:
         return []
+    correct_choice_index = task["labels"].index(1)
+    answer = letters[correct_choice_index]
+    prompt = f"""Answer the following multiple choice question. Reason step-by-step and then write the final answer as a single letter.\n\n{format_multiple_choice_truthfulqa(task)}"""
+    response = await query(model, prompt)
+    final_response = extract_mc_response(response)
+    acc = int(final_response.upper() == answer) if final_response else 0
     return [
         {
             "model": model,
             "task": "truthfulqa",
             "metric": "accuracy",
             "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
+            "prompt": prompt,
+            "response": response,
         }
     ]
 async def mgsm_and_evaluate(model, language_bcp_47, nr):
+    ds_slug, question, origin = load_mgsm(language_bcp_47, nr)
     if not question:
         return []
+    prompt = dedent(f"""
+    Solve the following math problem. Reason step-by-step and then write the final answer as a single number.
+    {reasoning_template}
+    ---
+    {question["question"]}""").strip()
+    response = await query(model, prompt)
+    number = extract_mc_response(response)
+    acc = (
+        int(parse_number(number) == parse_number(question["answer_number"]))
+        if number
+        else 0
+    )
     return [
         {
             "model": model,
             "bcp_47": language_bcp_47,
             "task": "mgsm",
             "metric": "accuracy",
+            "score": acc,
+            "origin": origin,
             "sentence_nr": nr,
+            "prompt": prompt,
+            "response": response,
         }
     ]
+# async def transcribe_and_evaluate(model, language_bcp_47, nr):
+#     language = languages[languages["bcp_47"] == language_bcp_47].iloc[0]
+#     fleurs = pd.read_csv(
+#         f"data/fleurs/{language.fleurs_tag}/dev.tsv",
+#         sep="\t",
+#         names=[
+#             "id",
+#             "fname",
+#             "raw_transcription",
+#             "transcription",
+#             "words",
+#             "id2",
+#             "gender",
+#         ],
+#     )
+#     item = fleurs.iloc[nr]
+#     path = f"data/fleurs/{language.fleurs_tag}/audio/dev/{item.fname}"
+#     pred = await transcribe(path, model=model)
+#     wer_score = wer.compute(predictions=[pred], references=[item.transcription])
+#     return [
+#         {
+#             "model": model,
+#             "bcp_47": language["bcp_47"],
+#             "task": "asr",
+#             "metric": "wer",
+#             "score": wer_score,
+#             "sentence_nr": nr,
+#         }
+#     ]
 tasks = {
     "translation_from": partial(translate_and_evaluate, mode="from"),
     "translation_to": partial(translate_and_evaluate, mode="to"),
     "classification": classify_and_evaluate,
     "mmlu": mmlu_and_evaluate,
     "arc": arc_and_evaluate,
     "truthfulqa": truthfulqa_and_evaluate,
     "mgsm": mgsm_and_evaluate,
 }

evals/translate.py CHANGED Viewed

@@ -6,4 +6,4 @@ from datasets_.mmlu import translate_mmlu
 if __name__ == "__main__":
     translate_mmlu(languages)
     translate_mgsm(languages)
-    translate_arc(languages)

 if __name__ == "__main__":
     translate_mmlu(languages)
     translate_mgsm(languages)
+    translate_arc(languages)

frontend/package-lock.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

frontend/package.json CHANGED Viewed

@@ -6,13 +6,12 @@
     "@observablehq/plot": "^0.6.17",
     "@testing-library/dom": "^10.4.0",
     "@testing-library/jest-dom": "^6.6.3",
-    "@testing-library/react": "^16.2.0",
     "@testing-library/user-event": "^13.5.0",
     "primeicons": "^7.0.0",
     "primereact": "^10.9.3",
-    "react": "^19.0.0",
-    "react-dom": "^19.0.0",
-    "react-scripts": "5.0.1",
     "topojson-simplify": "^3.0.3",
     "web-vitals": "^2.1.4"
   },
@@ -41,5 +40,8 @@
       "last 1 safari version"
     ]
   },
-  "proxy": "http://localhost:8000"
 }

     "@observablehq/plot": "^0.6.17",
     "@testing-library/dom": "^10.4.0",
     "@testing-library/jest-dom": "^6.6.3",
+    "@testing-library/react": "^15.0.0",
     "@testing-library/user-event": "^13.5.0",
     "primeicons": "^7.0.0",
     "primereact": "^10.9.3",
+    "react": "^18.2.0",
+    "react-dom": "^18.2.0",
     "topojson-simplify": "^3.0.3",
     "web-vitals": "^2.1.4"
   },
       "last 1 safari version"
     ]
   },
+  "proxy": "http://localhost:8000",
+  "devDependencies": {
+    "react-scripts": "^5.0.1"
+  }
 }

frontend/public/sw.js ADDED Viewed

	@@ -0,0 +1,9 @@

+// Unregister service worker
+self.addEventListener('install', () => {
+  self.skipWaiting();
+});
+self.addEventListener('activate', () => {
+  self.registration.unregister();
+});

frontend/src/App.js CHANGED Viewed

@@ -16,12 +16,18 @@ import { Button } from 'primereact/button'
 function App () {
   const [data, setData] = useState(null)
   const [loading, setLoading] = useState(true)
   const [error, setError] = useState(null)
   const [selectedLanguages, setSelectedLanguages] = useState([])
   const [dialogVisible, setDialogVisible] = useState(false)
   const [aboutVisible, setAboutVisible] = useState(false)
   const [contributeVisible, setContributeVisible] = useState(false)
   useEffect(() => {
     fetch('/api/data', {
@@ -36,6 +42,8 @@ function App () {
       })
       .then(jsonData => {
         setData(jsonData)
         setLoading(false)
       })
       .catch(err => {
@@ -44,8 +52,27 @@ function App () {
       })
   }, [selectedLanguages])
   const [windowWidth, setWindowWidth] = useState(window.innerWidth)
   const [windowHeight, setWindowHeight] = useState(window.innerHeight)
   useEffect(() => {
     const handleResize = () => {
       setWindowWidth(window.innerWidth)
@@ -55,6 +82,44 @@ function App () {
     return () => window.removeEventListener('resize', handleResize)
   }, [])
   return (
     <PrimeReactProvider>
       <div
@@ -69,35 +134,50 @@ function App () {
           style={{
             backgroundColor: '#fff3cd',
             color: '#856404',
-            padding: '0.75rem 1.25rem',
             marginBottom: '1rem',
             border: '1px solid #ffeeba',
             borderRadius: '0.25rem',
-            textAlign: 'center'
           }}
         >
           <strong>Work in Progress:</strong> This dashboard is currently under
-          active development. Evaluation results are not yet final.
           <a
             href='https://github.com/datenlabor-bmz/ai-language-monitor'
             target='_blank'
             rel='noopener noreferrer'
             style={{
               textDecoration: 'none',
-              color: '#856404',
-              float: 'right',
-              fontSize: '1.2rem',
-              fontWeight: 'bold',
-              padding: '0 0.5rem',
-              borderRadius: '3px',
-              backgroundColor: 'rgba(255,255,255,0.3)'
             }}
           >
-            <i
-              className='pi pi-github'
-              title='View on GitHub'
-              style={{ marginRight: '0.3rem' }}
-            />
             GitHub
           </a>
         </div>
@@ -149,39 +229,88 @@ function App () {
           <div
             style={{
               display: 'flex',
-              gap: '1rem',
-              marginBottom: '1.5rem',
               flexWrap: 'wrap',
               justifyContent: 'center'
             }}
           >
-            <Button
-              label='📚 About this tool'
-              className='p-button-text'
               onClick={() => setAboutVisible(true)}
               style={{
-                color: '#666',
-                border: '1px solid #ddd',
-                padding: '0.5rem 1rem',
-                borderRadius: '4px',
-                fontSize: '0.9rem'
               }}
-            />
-            <Button
-              label='🚀 Add your model (soon)'
-              className='p-button-text'
               onClick={() => setContributeVisible(true)}
-              tooltip='This feature is on our roadmap and will be available soon.'
-              tooltipOptions={{ position: 'bottom' }}
               style={{
-                color: '#666',
-                border: '1px solid #ddd',
-                padding: '0.5rem 1rem',
-                borderRadius: '4px',
-                fontSize: '0.9rem'
               }}
-            />
           </div>
           {data && (
@@ -220,6 +349,7 @@ function App () {
                 data={data.model_table}
                 selectedLanguages={selectedLanguages}
                 allLanguages={data.language_table || []}
               />
               <LanguageTable
                 data={data.language_table}
@@ -248,20 +378,18 @@ function App () {
                     color: '#666'
                   }}
                 />
-                <Carousel
-                  value={[
-                    <WorldMap data={data.countries} />,
-                    <LanguagePlot data={data} />,
-                    <SpeakerPlot data={data} />,
-                    <HistoryPlot data={data} />,
-                    <CostPlot data={data} />
-                  ]}
-                  numScroll={1}
-                  numVisible={1}
-                  itemTemplate={item => item}
-                  circular
-                  style={{ width: '100%', minHeight: '650px' }}
-                />
               </div>
             </>
           )}
@@ -409,36 +537,16 @@ function App () {
           modal
           header={null}
         >
-          {data && (
             <div style={{ width: '100%', height: '100%' }}>
               <Carousel
-                value={[
-                  <WorldMap
-                    data={data.countries}
-                    width={windowWidth * 0.7}
-                    height={windowHeight * 0.6}
-                  />,
-                  <LanguagePlot
-                    data={data}
-                    width={windowWidth * 0.7}
-                    height={windowHeight * 0.6}
-                  />,
-                  <SpeakerPlot
-                    data={data}
-                    width={windowWidth * 0.7}
-                    height={windowHeight * 0.6}
-                  />,
-                  <HistoryPlot
-                    data={data}
-                    width={windowWidth * 0.7}
-                    height={windowHeight * 0.6}
-                  />,
-                  <CostPlot data={data} />
-                ]}
                 numScroll={1}
                 numVisible={1}
                 itemTemplate={item => item}
-                circular
                 style={{ width: '100%', height: 'calc(90vh - 120px)' }}
               />
             </div>
@@ -449,4 +557,4 @@ function App () {
   )
 }
-export default App

 function App () {
   const [data, setData] = useState(null)
+  const [baseData, setBaseData] = useState(null)
   const [loading, setLoading] = useState(true)
   const [error, setError] = useState(null)
   const [selectedLanguages, setSelectedLanguages] = useState([])
+  const [machineTranslatedMetrics, setMachineTranslatedMetrics] = useState([])
   const [dialogVisible, setDialogVisible] = useState(false)
   const [aboutVisible, setAboutVisible] = useState(false)
   const [contributeVisible, setContributeVisible] = useState(false)
+  // Add state for carousel items
+  const [carouselItems, setCarouselItems] = useState([])
+  const [fullScreenCarouselItems, setFullScreenCarouselItems] = useState([])
   useEffect(() => {
     fetch('/api/data', {
       })
       .then(jsonData => {
         setData(jsonData)
+        setMachineTranslatedMetrics(jsonData.machine_translated_metrics || [])
+        if (!baseData) setBaseData(jsonData)
         setLoading(false)
       })
       .catch(err => {
       })
   }, [selectedLanguages])
+  // Create carousel items when data is loaded
+  useEffect(() => {
+    if (data) {
+      // Add a small delay to ensure components are ready
+      const timer = setTimeout(() => {
+        setCarouselItems([
+          <WorldMap key="worldmap-0" data={(baseData || data).countries} allLanguages={(baseData || data).language_table} width={750} height={500} />,
+          <LanguagePlot key="langplot-1" data={data} width={750} height={500} />,
+          <SpeakerPlot key="speakerplot-2" data={data} width={750} height={500} />,
+          <HistoryPlot key="histplot-3" data={data} width={750} height={500} />,
+          <CostPlot key="costplot-4" data={data} width={750} height={500} />
+        ]);
+      }, 100);
+      return () => clearTimeout(timer);
+    }
+  }, [data, baseData])
   const [windowWidth, setWindowWidth] = useState(window.innerWidth)
   const [windowHeight, setWindowHeight] = useState(window.innerHeight)
   useEffect(() => {
     const handleResize = () => {
       setWindowWidth(window.innerWidth)
     return () => window.removeEventListener('resize', handleResize)
   }, [])
+  // Create full-screen carousel items when data or window size changes
+  useEffect(() => {
+    if (data) {
+      const timer = setTimeout(() => {
+        setFullScreenCarouselItems([
+          <WorldMap
+            key="fs-worldmap-0"
+            data={(baseData || data).countries}
+            allLanguages={(baseData || data).language_table}
+            width={windowWidth * 0.7}
+            height={windowHeight * 0.6}
+          />,
+          <LanguagePlot
+            key="fs-langplot-1"
+            data={data}
+            width={windowWidth * 0.7}
+            height={windowHeight * 0.6}
+          />,
+          <SpeakerPlot
+            key="fs-speakerplot-2"
+            data={data}
+            width={windowWidth * 0.7}
+            height={windowHeight * 0.6}
+          />,
+          <HistoryPlot
+            key="fs-histplot-3"
+            data={data}
+            width={windowWidth * 0.7}
+            height={windowHeight * 0.6}
+          />,
+          <CostPlot key="fs-costplot-4" data={data} width={windowWidth * 0.7} height={windowHeight * 0.6} />
+        ]);
+      }, 100);
+      return () => clearTimeout(timer);
+    }
+  }, [data, baseData, windowWidth, windowHeight])
   return (
     <PrimeReactProvider>
       <div
           style={{
             backgroundColor: '#fff3cd',
             color: '#856404',
+            padding: '1rem 1.5rem',
             marginBottom: '1rem',
             border: '1px solid #ffeeba',
             borderRadius: '0.25rem',
+            textAlign: 'center',
+            lineHeight: '1.5',
+            position: 'relative'
           }}
         >
           <strong>Work in Progress:</strong> This dashboard is currently under
+          active development. Evaluation results are not yet final. More extensive evaluation runs will be released later this year.
+        </div>
+        <div
+          style={{
+            display: 'flex',
+            justifyContent: 'flex-end',
+            padding: '0 1.5rem',
+            marginBottom: '1rem'
+          }}
+        >
           <a
             href='https://github.com/datenlabor-bmz/ai-language-monitor'
             target='_blank'
             rel='noopener noreferrer'
             style={{
               textDecoration: 'none',
+              color: '#6c757d',
+              fontSize: '1rem',
+              fontWeight: '500',
+              padding: '0.5rem 1rem',
+              borderRadius: '0.375rem',
+              backgroundColor: '#f8f9fa',
+              border: '1px solid #e9ecef',
+              display: 'flex',
+              alignItems: 'center',
+              gap: '0.5rem',
+              transition: 'all 0.2s ease',
+              ':hover': {
+                backgroundColor: '#e9ecef',
+                color: '#495057'
+              }
             }}
           >
+            <i className='pi pi-github' title='View on GitHub' />
             GitHub
           </a>
         </div>
           <div
             style={{
               display: 'flex',
+              gap: '0.75rem',
+              marginBottom: '2rem',
               flexWrap: 'wrap',
               justifyContent: 'center'
             }}
           >
+            <button
               onClick={() => setAboutVisible(true)}
               style={{
+                background: 'linear-gradient(135deg, #667eea 0%, #764ba2 100%)',
+                color: 'white',
+                border: 'none',
+                padding: '0.75rem 1.5rem',
+                borderRadius: '12px',
+                fontSize: '0.95rem',
+                fontWeight: '500',
+                cursor: 'pointer',
+                display: 'flex',
+                alignItems: 'center',
+                gap: '0.5rem',
+                boxShadow: '0 4px 15px rgba(102, 126, 234, 0.25)',
+                transition: 'all 0.3s ease',
+                ':hover': {
+                  transform: 'translateY(-2px)',
+                  boxShadow: '0 8px 25px rgba(102, 126, 234, 0.35)'
+                }
               }}
+              onMouseEnter={(e) => {
+                e.target.style.transform = 'translateY(-2px)';
+                e.target.style.boxShadow = '0 8px 25px rgba(102, 126, 234, 0.35)';
+              }}
+              onMouseLeave={(e) => {
+                e.target.style.transform = 'translateY(0)';
+                e.target.style.boxShadow = '0 4px 15px rgba(102, 126, 234, 0.25)';
+              }}
+            >
+              <span style={{ fontSize: '1.1rem' }}>📚</span>
+              About this tool
+            </button>
+            <button
               onClick={() => setContributeVisible(true)}
+              title='This feature is on our roadmap and will be available soon.'
               style={{
+                background: 'linear-gradient(135deg, #ff9a9e 0%, #fecfef 50%, #fecfef 100%)',
+                color: '#6b46c1',
+                border: 'none',
+                padding: '0.75rem 1.5rem',
+                borderRadius: '12px',
+                fontSize: '0.95rem',
+                fontWeight: '500',
+                cursor: 'pointer',
+                display: 'flex',
+                alignItems: 'center',
+                gap: '0.5rem',
+                boxShadow: '0 4px 15px rgba(255, 154, 158, 0.25)',
+                transition: 'all 0.3s ease',
+                position: 'relative',
+                overflow: 'hidden'
               }}
+              onMouseEnter={(e) => {
+                e.target.style.transform = 'translateY(-2px)';
+                e.target.style.boxShadow = '0 8px 25px rgba(255, 154, 158, 0.35)';
+              }}
+              onMouseLeave={(e) => {
+                e.target.style.transform = 'translateY(0)';
+                e.target.style.boxShadow = '0 4px 15px rgba(255, 154, 158, 0.25)';
+              }}
+            >
+              <span style={{ fontSize: '1.1rem' }}>🚀</span>
+              Add your model
+              <span style={{
+                fontSize: '0.75rem',
+                backgroundColor: 'rgba(107, 70, 193, 0.15)',
+                padding: '0.2rem 0.5rem',
+                borderRadius: '6px',
+                marginLeft: '0.5rem',
+                fontWeight: '600'
+              }}>
+                soon
+              </span>
+            </button>
           </div>
           {data && (
                 data={data.model_table}
                 selectedLanguages={selectedLanguages}
                 allLanguages={data.language_table || []}
+                machineTranslatedMetrics={machineTranslatedMetrics}
               />
               <LanguageTable
                 data={data.language_table}
                     color: '#666'
                   }}
                 />
+                {carouselItems.length > 0 && (
+                  <Carousel
+                    key={`main-carousel-${carouselItems.length}-${Date.now()}`}
+                    value={carouselItems}
+                    numScroll={1}
+                    numVisible={1}
+                    itemTemplate={item => item}
+                    circular={false}
+                    activeIndex={0}
+                    style={{ width: '100%', minHeight: '650px' }}
+                  />
+                )}
               </div>
             </>
           )}
           modal
           header={null}
         >
+          {fullScreenCarouselItems.length > 0 && (
             <div style={{ width: '100%', height: '100%' }}>
               <Carousel
+                key={`fs-carousel-${fullScreenCarouselItems.length}-${Date.now()}`}
+                value={fullScreenCarouselItems}
                 numScroll={1}
                 numVisible={1}
                 itemTemplate={item => item}
+                circular={false}
+                activeIndex={0}
                 style={{ width: '100%', height: 'calc(90vh - 120px)' }}
               />
             </div>
   )
 }
+export default App

frontend/src/components/HistoryPlot.js CHANGED Viewed

@@ -50,12 +50,12 @@ const HistoryPlot = ({ data, width = 750, height = 500 }) => {
             ...models.filter(d => d.newRecord),
             {
               creation_date: new Date(),
-              maxAverage: models[models.length - 1].maxAverage
             }
           ],
           {
             x: d => d.creation_date,
-            y: d => d.maxAverage,
             curve: 'step-after',
             strokeOpacity: 0.3
           }

             ...models.filter(d => d.newRecord),
             {
               creation_date: new Date(),
+              maxAverage: models[models.length - 1]?.maxAverage || 0
             }
           ],
           {
             x: d => d.creation_date,
+            y: d => d.maxAverage || 0,
             curve: 'step-after',
             strokeOpacity: 0.3
           }

frontend/src/components/LanguageTable.js CHANGED Viewed

@@ -172,7 +172,7 @@ const LanguageTable = ({ data, selectedLanguages, setSelectedLanguages, totalMod
         filterElement={familyRowFilterTemplate}
         style={{ minWidth: '10rem' }}
       />
-      {ScoreColumns}
     </DataTable>
   )
 }

         filterElement={familyRowFilterTemplate}
         style={{ minWidth: '10rem' }}
       />
+      {ScoreColumns()}
     </DataTable>
   )
 }

frontend/src/components/ModelTable.js CHANGED Viewed

@@ -6,7 +6,7 @@ import { useState, useEffect } from 'react'
 import Medal from './Medal'
 import { Slider } from 'primereact/slider'
 import ScoreColumns from './ScoreColumns'
-const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
   const [filters, setFilters] = useState({
     type: { value: null, matchMode: FilterMatchMode.IN },
     size: { value: null, matchMode: FilterMatchMode.BETWEEN },
@@ -50,10 +50,10 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
   }
   const SliderWithLabel = ({ value, onChange, min, max }) => {
-    const p = 10
-    const start = value === null ? min : Math.log(value[0]) / Math.log(p)
-    const stop = value === null ? max : Math.log(value[1]) / Math.log(p)
-    const [_value, _setValue] = useState([start, stop])
     useEffect(() => {
       const timer = setTimeout(() => {
         onChange({
@@ -61,11 +61,11 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
             // set to "no filter" when (almost) the whole range is selected
             _value[0] <= min + 0.1 && _value[1] >= max - 0.1
               ? null
-              : [p ** _value[0], p ** _value[1]]
-        })
-      }, 1000)
-      return () => clearTimeout(timer)
-    }, [_value, onChange, min, max])
     return (
       <div style={{ minWidth: '20rem' }}>
         <div>{formatSize(p ** _value[0])}</div>
@@ -147,21 +147,35 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
   }
   const costBodyTemplate = rowData => {
-    return <div style={{ textAlign: 'center' }}>${rowData.cost?.toFixed(2)}</div>
   }
   const getHeaderText = () => {
-    // Count languages that have evaluation data (average score available)
-    const evaluatedLanguagesCount = allLanguages.filter(lang =>
-      lang.average !== null && lang.average !== undefined
-    ).length
     if (selectedLanguages.length === 0) {
       return (
         <span>
           <span style={{ fontWeight: 'bold', fontSize: '1.1em' }}>AI Models</span>
           <span style={{ fontSize: '0.85em', marginLeft: '0.5rem' }}>
-            Average performance across {evaluatedLanguagesCount} evaluated languages
           </span>
         </span>
       )
@@ -245,7 +259,7 @@ const ModelTable = ({ data, selectedLanguages = [], allLanguages = [] }) => {
         body={costBodyTemplate}
         style={{ minWidth: '5rem' }}
       />
-      {ScoreColumns}
     </DataTable>
   )
 }

 import Medal from './Medal'
 import { Slider } from 'primereact/slider'
 import ScoreColumns from './ScoreColumns'
+const ModelTable = ({ data, selectedLanguages = [], allLanguages = [], machineTranslatedMetrics = [] }) => {
   const [filters, setFilters] = useState({
     type: { value: null, matchMode: FilterMatchMode.IN },
     size: { value: null, matchMode: FilterMatchMode.BETWEEN },
   }
   const SliderWithLabel = ({ value, onChange, min, max }) => {
+    const p = 10;
+    const start = value === null || value[0] === null ? min : Math.log(value[0]) / Math.log(p);
+    const stop = value === null || value[1] === null ? max : Math.log(value[1]) / Math.log(p);
+    const [_value, _setValue] = useState([start, stop]);
     useEffect(() => {
       const timer = setTimeout(() => {
         onChange({
             // set to "no filter" when (almost) the whole range is selected
             _value[0] <= min + 0.1 && _value[1] >= max - 0.1
               ? null
+              : [p ** _value[0], p ** _value[1]],
+        });
+      }, 1000);
+      return () => clearTimeout(timer);
+    }, [_value, onChange, min, max]);
     return (
       <div style={{ minWidth: '20rem' }}>
         <div>{formatSize(p ** _value[0])}</div>
   }
   const costBodyTemplate = rowData => {
+    return (
+      <div style={{ textAlign: 'center' }}>
+        {rowData.cost === null ? 'n/a' : `$${rowData.cost.toFixed(2)}`}
+      </div>
+    )
   }
   const getHeaderText = () => {
+    // Count languages that have any evaluation data (any task scores available)
+    const evaluatedLanguagesCount = allLanguages.filter(lang => {
+      // Check if language has any task scores (not just average)
+      const hasAnyScores = [
+        'translation_from_bleu',
+        'translation_to_bleu',
+        'classification_accuracy',
+        'mmlu_accuracy',
+        'arc_accuracy',
+        'truthfulqa_accuracy',
+        'mgsm_accuracy'
+      ].some(metric => lang[metric] !== null && lang[metric] !== undefined)
+      return hasAnyScores
+    }).length
     if (selectedLanguages.length === 0) {
       return (
         <span>
           <span style={{ fontWeight: 'bold', fontSize: '1.1em' }}>AI Models</span>
           <span style={{ fontSize: '0.85em', marginLeft: '0.5rem' }}>
+            Performance across {evaluatedLanguagesCount} evaluated languages
           </span>
         </span>
       )
         body={costBodyTemplate}
         style={{ minWidth: '5rem' }}
       />
+      {ScoreColumns(machineTranslatedMetrics)}
     </DataTable>
   )
 }

frontend/src/components/ScoreColumns.js CHANGED Viewed

@@ -2,21 +2,28 @@ import { Column } from 'primereact/column'
 import ScoreField from './ScoreField'
 const scoreBodyTemplate = (field, options = {}) => {
-  const { minScore = 0, maxScore = 1 } = options
   return rowData => {
     const score = rowData[field]
-    return ScoreField(score, minScore, maxScore)
   }
 }
-const ScoreColumns = [
   <Column
     field='average'
     header='Proficiency'
     headerTooltip='Language Proficiency Score (average of the scores for each task, after min-max normalization)'
     sortable
-    body={scoreBodyTemplate('average', { minScore: 0.2, maxScore: 0.5 })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
   <Column
@@ -26,7 +33,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('translation_from_bleu', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -37,7 +45,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('translation_to_bleu', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -48,7 +57,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('classification_accuracy', {
       minScore: 0,
-      maxScore: 0.5
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -69,7 +79,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('mmlu_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -80,7 +91,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('arc_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
@@ -91,7 +103,8 @@ const ScoreColumns = [
     sortable
     body={scoreBodyTemplate('mgsm_accuracy', {
       minScore: 0,
-      maxScore: 1
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,

 import ScoreField from './ScoreField'
 const scoreBodyTemplate = (field, options = {}) => {
+  const { minScore = 0, maxScore = 1, machineTranslatedMetrics = [] } = options
   return rowData => {
     const score = rowData[field]
+    // Prefer per-row flag if present (backend sets `<metric>_is_machine`),
+    // otherwise fall back to global list
+    const rowFlagKey = `${field}_is_machine`
+    const hasRowFlag = Object.prototype.hasOwnProperty.call(rowData, rowFlagKey)
+    const isMachineTranslated = hasRowFlag
+      ? !!rowData[rowFlagKey]
+      : machineTranslatedMetrics.includes(field)
+    return ScoreField(score, minScore, maxScore, isMachineTranslated)
   }
 }
+const ScoreColumns = (machineTranslatedMetrics = []) => [
   <Column
     field='average'
     header='Proficiency'
     headerTooltip='Language Proficiency Score (average of the scores for each task, after min-max normalization)'
     sortable
+    body={scoreBodyTemplate('average', { minScore: 0.2, maxScore: 0.5, machineTranslatedMetrics })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
   <Column
     sortable
     body={scoreBodyTemplate('translation_from_bleu', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('translation_to_bleu', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('classification_accuracy', {
       minScore: 0,
+      maxScore: 0.5,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('mmlu_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('arc_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,
     sortable
     body={scoreBodyTemplate('mgsm_accuracy', {
       minScore: 0,
+      maxScore: 1,
+      machineTranslatedMetrics
     })}
     style={{ minWidth: '5rem', maxWidth: '10rem' }}
   />,

frontend/src/components/ScoreField.js CHANGED Viewed

@@ -1,4 +1,4 @@
-const ScoreField = (score, minScore, maxScore) => {
   let percentage = 100
   let barColor = "rgba(210, 106, 255, 0.1)" // light violet for missing data
   if (score !== null) {
@@ -50,6 +50,7 @@ const ScoreField = (score, minScore, maxScore) => {
         }}
       >
         {score !== null ? (score * 100).toFixed(1)+"%" : '–'}
       </span>
     </div>
   )

+const ScoreField = (score, minScore, maxScore, isMachineTranslated = false) => {
   let percentage = 100
   let barColor = "rgba(210, 106, 255, 0.1)" // light violet for missing data
   if (score !== null) {
         }}
       >
         {score !== null ? (score * 100).toFixed(1)+"%" : '–'}
+        {isMachineTranslated && score !== null && <span style={{color: '#666', fontSize: '0.8em'}}>*</span>}
       </span>
     </div>
   )

frontend/src/components/SpeakerPlot.js CHANGED Viewed

@@ -73,10 +73,10 @@ const SpeakerPlot = ({ data, width = 750, height = 500 }) => {
           textStrokeOpacity: 0,
           textFillOpacity: 0
         }),
-        Plot.tip(['The 40 most spoken languages cover 80% of all speakers.'], {
           x: 40,
           y: languages[39].cumSpeakers / 1e6
-        })
       ]
     })
     containerRef.current.append(plot)

           textStrokeOpacity: 0,
           textFillOpacity: 0
         }),
+        ...(languages.length >= 40 ? [Plot.tip(['The 40 most spoken languages cover 80% of all speakers.'], {
           x: 40,
           y: languages[39].cumSpeakers / 1e6
+        })] : [])
       ]
     })
     containerRef.current.append(plot)

frontend/src/components/WorldMap.js CHANGED Viewed

@@ -26,13 +26,13 @@ const makeTitle = data => d => {
         a =>
           `${smoothProgressBar(a.population / pop)} ${
             a.name
-          } – ${a.score.toFixed(2)}`
       )
       .join('\n\n') + (languages?.length > 10 ? `\n\n...` : '')
-  return `${d.properties.ADMIN} – ${cData?.score.toFixed(2)}\n\n${langstring}`
 }
-const WorldMap = ({ data, width = 750, height = 500 }) => {
   const containerRef = useRef()
   const [mapData, setMapData] = useState()
@@ -48,8 +48,22 @@ const WorldMap = ({ data, width = 750, height = 500 }) => {
       acc[country.iso2] = country
       return acc
     }, {})
     const plot = Plot.plot({
-      subtitle: 'Language Proficiency Score by Country',
       width: width,
       height: height,
       projection: 'equal-earth',
@@ -61,11 +75,12 @@ const WorldMap = ({ data, width = 750, height = 500 }) => {
         })
       ],
       color: {
-        scheme: 'Greens',
-        unknown: 'gray',
         label: 'Score',
         legend: true,
-        domain: [0, 1]
       },
       style: {
         fontFamily: 'monospace'

         a =>
           `${smoothProgressBar(a.population / pop)} ${
             a.name
+          } – ${a.score === null || a.score === undefined ? "n/a" : a.score.toFixed(2)}`
       )
       .join('\n\n') + (languages?.length > 10 ? `\n\n...` : '')
+  return `${d.properties.ADMIN} – ${cData?.score === null || cData?.score === undefined ? "n/a" : cData.score.toFixed(2)}\n\n${langstring}`
 }
+const WorldMap = ({ data, width = 750, height = 500, allLanguages = [] }) => {
   const containerRef = useRef()
   const [mapData, setMapData] = useState()
       acc[country.iso2] = country
       return acc
     }, {})
+    // Count languages that have any evaluation data
+    const evaluatedLanguagesCount = allLanguages.filter(lang => {
+      const hasAnyScores = [
+        'translation_from_bleu',
+        'translation_to_bleu',
+        'classification_accuracy',
+        'mmlu_accuracy',
+        'arc_accuracy',
+        'truthfulqa_accuracy',
+        'mgsm_accuracy'
+      ].some(metric => lang[metric] !== null && lang[metric] !== undefined)
+      return hasAnyScores
+    }).length
     const plot = Plot.plot({
+      subtitle: `Language Proficiency Score by Country (Coverage: ~${evaluatedLanguagesCount} languages evaluated)`,
       width: width,
       height: height,
       projection: 'equal-earth',
         })
       ],
       color: {
+        scheme: 'RdYlGn',
+        unknown: '#d0d0d0',
         label: 'Score',
         legend: true,
+        domain: [0, 1],
+        pivot: 0.5
       },
       style: {
         fontFamily: 'monospace'

notes/system-architecture-diagram.md ADDED Viewed

	@@ -0,0 +1,177 @@

+# AI Language Monitor - System Architecture
+\[AI-generated, not 100% up-to-date\]
+This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
+```mermaid
+flowchart TD
+    %% Model Sources
+    A1["important_models<br/>Static Curated List"] --> D[load_models]
+    A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
+    A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
+    A4["blocklist<br/>Exclusions"] --> D
+    %% Model Processing
+    D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
+    E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
+    F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
+    G --> H["Enriched Model DataFrame"]
+    H --> |Save| I[models.json]
+    %% Model Validation & Cost Filtering
+    H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost ≤ $20/1M tokens"]
+    H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
+    %% Language Data
+    J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
+    %% Task Registry with Unified Prompting
+    L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
+    M --> M1["translation_from/to<br/>BLEU + ChrF"]
+    M --> M2["classification<br/>Accuracy"]
+    M --> M3["mmlu<br/>Accuracy"]
+    M --> M4["arc<br/>Accuracy"]
+    M --> M5["truthfulqa<br/>Accuracy"]
+    M --> M6["mgsm<br/>Accuracy"]
+    %% On-the-fly Translation with Origin Tagging
+    subgraph OTF [On-the-fly Dataset Translation]
+        direction LR
+        DS_raw["Raw English Dataset<br/>"] --> Google_Translate["Google Translate API"]
+        Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., MGSM/ARC)<br/>Origin: 'machine'"]
+        DS_native["Native Dataset<br/>(e.g., AfriMMLU/Global-MMLU)<br/>Origin: 'human'"]
+    end
+    %% Evaluation Pipeline
+    H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
+    K --> |"languages bcp_47"| N
+    L --> |"tasks.items"| N
+    N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model × Language × Task"]
+    O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
+    %% Task Execution with Origin Tracking
+    P --> Q1[translate_and_evaluate<br/>Origin: 'human']
+    P --> Q2[classify_and_evaluate<br/>Origin: 'human']
+    P --> Q3[mmlu_and_evaluate<br/>Origin: 'human' (no on-the-fly for missing; uses auto-translated dataset if available)]
+    P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
+    P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human' (no on-the-fly for missing; relies on available datasets)]
+    P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
+    %% API Calls with Error Handling
+    Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
+    Q2 --> |"complete() API<br/>Rate Limiting"| R
+    Q3 --> |"complete() API<br/>Rate Limiting"| R
+    Q4 --> |"complete() API<br/>Rate Limiting"| R
+    Q5 --> |"complete() API<br/>Rate Limiting"| R
+    Q6 --> |"complete() API<br/>Rate Limiting"| R
+    %% Results Processing with Origin Aggregation
+    R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
+    S --> |Save| T[results.json]
+    %% Backend & Frontend with Origin-Specific Metrics
+    T --> |Read| U[backend.py]
+    I --> |Read| U
+    U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
+    U --> |make_country_table| W["Country Aggregation"]
+    U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
+    X --> |"JSON Response"| Y["Frontend React App"]
+    %% UI Components
+    Y --> Z1["WorldMap.js<br/>Country Visualization"]
+    Y --> Z2["ModelTable.js<br/>Model Rankings"]
+    Y --> Z3["LanguageTable.js<br/>Language Coverage"]
+    Y --> Z4["DatasetTable.js<br/>Task Performance"]
+    %% Data Sources with Origin Information
+    subgraph DS ["Data Sources"]
+        DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
+     DS2["MMLU/AfriMMLU/Global-MMLU<br/>Knowledge QA<br/>Origin: 'human' or 'machine' (HF auto-translated only)"]
+        DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
+        DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
+        DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
+    end
+    DS1 --> Q1
+    DS2 --> Q3
+    DS3 --> Q4
+    DS4 --> Q5
+    DS5 --> Q6
+     %% No on-the-fly DS_translated for MMLU anymore; only HF auto-translated used
+    DS_translated --> Q4
+    DS_translated --> Q5
+    DS_native --> Q3
+    DS_native --> Q4
+    DS_native --> Q5
+    %% Styling - Neutral colors that work in both dark and light modes
+    classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
+    classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
+    classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
+    classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
+    classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
+    classDef translation fill:#d4edda,stroke:#155724,color:#155724
+    class A1,A2,A3,A4 modelSource
+    class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
+    class R,F,G,X api
+    class T,I storage
+    class Y,Z1,Z2,Z3,Z4 frontend
+    class Google_Translate,DS_translated,DS_native translation
+```
+## Architecture Components
+### 🔵 Model Discovery (Light Gray)
+- **Static Curated Models**: Handpicked important models for comprehensive evaluation
+- **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
+- **Quality Control**: Blocklist for problematic or incompatible models
+- **Model Validation**: API availability checks and cost filtering (≤$20/1M tokens)
+- **Timeout Protection**: 120s timeout for large/reasoning models, 60s for others
+- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
+### 🟣 Evaluation Pipeline (Medium Gray)
+- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
+- **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
+- **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
+- **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations
+- **Sample-based**: 10 evaluations per combination for statistical reliability
+- **Batch Processing**: 50 tasks per batch with rate limiting and error resilience
+- **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
+### 🟠 API Integration (Light Gray)
+- **OpenRouter**: Primary model inference API for all language model tasks
+- **Rate Limiting**: Intelligent batching and delays to prevent API overload
+- **Error Handling**: Graceful handling of timeouts, rate limits, and model unavailability
+- **HuggingFace**: Model metadata and open-source model information
+- **Google Translate**: Specialized translation API for on-the-fly dataset translation
+### 🟢 Data Storage (Cyan)
+- **results.json**: Aggregated evaluation scores with origin-specific metrics
+- **models.json**: Dynamic model list with metadata and validation status
+- **languages.json**: Language information with population data
+### 🟡 Frontend Visualization (Light Red)
+- **WorldMap**: Interactive country-level language proficiency visualization
+- **ModelTable**: Ranked model performance leaderboard with origin-specific columns
+- **LanguageTable**: Language coverage and speaker statistics
+- **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
+### 🔵 Translation & Origin Tracking (Light Green)
+- **On-the-fly Translation**: Google Translate API for languages without native benchmarks
+- **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
+- **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
+## Data Flow Summary
+1. **Model Discovery**: Combine curated + trending models → validate API availability → enrich with metadata
+2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations with origin tracking
+3. **Task Execution**: Run evaluations using unified English prompting and appropriate datasets
+4. **Result Processing**: Aggregate scores by model+language+task+origin and save to JSON files
+5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics via REST API
+6. **Frontend Display**: React app visualizes data through interactive components with transparency indicators
+This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency.

pyproject.toml CHANGED Viewed

@@ -13,7 +13,7 @@ dependencies = [
     "uvicorn>=0.34.2",
 ]
-[project.optional-dependencies]
 dev = [
     "aiolimiter>=1.2.1",
     "bert-score>=0.3.13",
@@ -26,7 +26,7 @@ dev = [
     "joblib>=1.5.0",
     "langcodes>=3.5.0",
     "language-data>=1.3.0",
-    "openai>=1.78.1",
     "protobuf>=6.30.2",
     "python-dotenv>=1.1.0",
     "rich>=14.0.0",
@@ -36,11 +36,3 @@ dev = [
     "tqdm>=4.67.1",
     "transformers>=4.51.3",
 ]
-[dependency-groups]
-dev = [
-    "ipython>=9.3.0",
-    "jupyter>=1.1.1",
-    "scipy>=1.16.0",
-    "seaborn>=0.13.2",
-]

     "uvicorn>=0.34.2",
 ]
+[dependency-groups]
 dev = [
     "aiolimiter>=1.2.1",
     "bert-score>=0.3.13",
     "joblib>=1.5.0",
     "langcodes>=3.5.0",
     "language-data>=1.3.0",
+    "openai>=2.3.0",
     "protobuf>=6.30.2",
     "python-dotenv>=1.1.0",
     "rich>=14.0.0",
     "tqdm>=4.67.1",
     "transformers>=4.51.3",
 ]

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff