Update README.md
Browse files
README.md
CHANGED
|
@@ -140,49 +140,38 @@ Results
|
|
| 140 |
| Ukrainian | 78.0% | 77.0% | 83.9% | **85.1%** |
|
| 141 |
| **Average** | 79.5% | 76.8% | 82.5% | **84.7%** |
|
| 142 |
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
-
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
|
| 160 |
-
|
|
| 161 |
-
|
| 162 |
-
|
|
| 163 |
-
|
|
| 164 |
-
|
|
| 165 |
-
|
|
| 166 |
-
|
|
| 167 |
-
|
|
| 168 |
-
|
|
| 169 |
-
|
|
| 170 |
-
|
|
| 171 |
-
|
|
| 172 |
-
|
|
| 173 |
-
|
|
| 174 |
-
|
|
| 175 |
-
|
|
| 176 |
-
|
|
| 177 |
-
|
|
| 178 |
-
| Norwegian | **2.1284** | 2.2862 | 2.3506 | 2.2253 |
|
| 179 |
-
| Polish | **2.0241** | 2.1294 | 2.0803 | 2.0803 |
|
| 180 |
-
| Portuguese | **1.9899** | 2.0597 | 2.0272 | 2.0187 |
|
| 181 |
-
| Romanian | **2.0196** | 2.1606 | 2.1641 | 2.1114 |
|
| 182 |
-
| Russian | **2.0424** | 2.09 | 2.1095 | 2.0871 |
|
| 183 |
-
| Slovak | **2.1192** | 2.338 | 2.3029 | 2.2609 |
|
| 184 |
-
| Slovenian | **2.1556** | 2.4443 | 2.3398 | 2.2589 |
|
| 185 |
-
| Serbian | **2.2469** | 2.6351 | 4.2471 | 2.3743 |
|
| 186 |
-
| Swedish | **2.041** | 2.1809 | 2.1464 | 2.1211 |
|
| 187 |
-
| Turkish | **2.0997** | 2.247 | 2.2202 | 2.232 |
|
| 188 |
-
| Ukrainian | **2.1376** | 2.2665 | 2.2691 | 2.2086 |
|
|
|
|
| 140 |
| Ukrainian | 78.0% | 77.0% | 83.9% | **85.1%** |
|
| 141 |
| **Average** | 79.5% | 76.8% | 82.5% | **84.7%** |
|
| 142 |
|
| 143 |
+
## MultiBLiMP Benchmark: Grammar Test
|
| 144 |
+
**What is MultiBLiMP?** [MultiBLiMP](https://arxiv.org/pdf/2504.02768) is a massively multilingual test of core grammar. It gives models pairs of almost-identical sentences—one grammatical and one ungrammatical—and asks whether the model assigns a higher probability to the correct one. Version 1.0 covers 101 languages
|
| 145 |
+
|
| 146 |
+
**Why does this Matter?** MultiBLiMP tests models' ability to distinguish correct and erroneous language. Just like humans, producing mostly correct language is not a big achievement. Rather, it is very bad to make any mistakes at all.
|
| 147 |
+
|
| 148 |
+
**What did we do?**
|
| 149 |
+
We used the standard implementation of the [MultiBLiMP](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/lm_eval/tasks/multiblimp) task from the LLM Evaluation Harness. We set tokenisers to ```use_fast=False```. We report **0-shot** accuracy.
|
| 150 |
+
|
| 151 |
+
| Language | Gemma 2 27b | ALIA 40b | EuroLLM Prev. 22b | TildeOpen 1.1 30b
|
| 152 |
+
|----------|-------------|----------|---------------------|-------------|
|
| 153 |
+
| Bulgarian | 95.4% | 98.8% | 97.7% | **99.6%** |
|
| 154 |
+
| Czech | 98.6% | **98.9%** | 98.5% | 98.5% |
|
| 155 |
+
| German | 98.8% | 98.7% | 98.0% | **99.4%** |
|
| 156 |
+
| English | 98.4% | 98.7% | 98.7% | **99.4%** |
|
| 157 |
+
| Estonian | 92.0% | 95.6% | 95.8% | **98.3%** |
|
| 158 |
+
| Finnish | 93.0% | 96.3% | 95.2% | **98.5%** |
|
| 159 |
+
| French | 98.2% | 98.8% | 98.7% | **99.3%** |
|
| 160 |
+
| Serbo-Croatian | 94.6% | 98.5% | 96.4% | **99.6%** |
|
| 161 |
+
| Hungarian | 95.9% | 98.8% | 97.8% | **100.0%** |
|
| 162 |
+
| Icelandic | 88.5% | 80.3% | 74.4% | **98.8%** |
|
| 163 |
+
| Italian | 96.0% | 96.7% | 96.6% | **98.2%** |
|
| 164 |
+
| Latvian | 91.6% | 95.2% | 96.9% | **99.1%** |
|
| 165 |
+
| Lithuanian | 95.3% | 99.0% | 99.0% | **99.7%** |
|
| 166 |
+
| Dutch | 94.0% | 96.6% | 96.5% | **99.2%** |
|
| 167 |
+
| Polish | 97.0% | 97.5% | 97.6% | **99.3%** |
|
| 168 |
+
| Portuguese | 96.1% | 97.6% | 97.1% | **98.2%** |
|
| 169 |
+
| Romanian | 97.7% | 98.9% | 98.5% | **98.9%** |
|
| 170 |
+
| Russian | 94.7% | 96.6% | 97.3% | **99.4%** |
|
| 171 |
+
| Slovak | 97.7% | 98.8% | 97.7% | **99.3%** |
|
| 172 |
+
| Slovenian | 99.0% | **100.0%** | **100.0%** | 98.8% |
|
| 173 |
+
| Spanish | 95.6% | 98.0% | 97.3% | **98.7%** |
|
| 174 |
+
| Swedish | 95.8% | 85.1% | 93.8% | **100.0%** |
|
| 175 |
+
| Turkish | 97.6% | **98.7%** | 97.9% | 96.4% |
|
| 176 |
+
| Ukrainian | 95.6% | 98.0% | 97.3% | **99.2%** |
|
| 177 |
+
| **Average** | 95.7% | 96.7% | 96.4% | **99.0%** |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|