update readme doc
Browse files- constants.py +13 -3
constants.py
CHANGED
|
@@ -28,10 +28,20 @@ We aim to provide cost-effective and accurate evaluation for multimodal models,
|
|
| 28 |
|
| 29 |
## ππ Results & Takeaways from Evaluating Top Models
|
| 30 |
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
- Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
|
| 34 |
-
- Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
|
| 35 |
- Many open-source models face challenges in adhering to output format instructions
|
| 36 |
|
| 37 |
## π― Interactive Visualization
|
|
|
|
| 28 |
|
| 29 |
## ππ Results & Takeaways from Evaluating Top Models
|
| 30 |
|
| 31 |
+
|
| 32 |
+
### οΈβπ₯π January 2025
|
| 33 |
+
|
| 34 |
+
- **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
|
| 35 |
+
- We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
|
| 36 |
+
- We will evaluate o1 series models when there is budget.
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
### π November 2024
|
| 40 |
+
|
| 41 |
+
- **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
|
| 42 |
+
- **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
|
| 43 |
- Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
|
| 44 |
+
- **Gemini 1.5 Flash** performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
|
| 45 |
- Many open-source models face challenges in adhering to output format instructions
|
| 46 |
|
| 47 |
## π― Interactive Visualization
|