davidpomerenke commited on
Commit
4bfbb64
Β·
verified Β·
1 Parent(s): 34b05c6

Upload from GitHub Actions: update and fixed rendering issues

Browse files
Files changed (1) hide show
  1. notes/system-architecture-diagram.md +68 -55
notes/system-architecture-diagram.md CHANGED
@@ -7,27 +7,25 @@ This diagram shows the complete data flow from model discovery through evaluatio
7
  ```mermaid
8
  flowchart TD
9
  %% Model Sources
10
- A1["important_models<br/>Static Curated List"] --> D[load_models]
11
- A2["get_historical_popular_models<br/>Web Scraping - Top 20"] --> D
12
- A3["get_current_popular_models<br/>Web Scraping - Top 10"] --> D
13
  A4["blocklist<br/>Exclusions"] --> D
14
 
15
  %% Model Processing
16
- D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>~40-50 models"]
17
  E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
18
  F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
19
  G --> H["Enriched Model DataFrame"]
20
  H --> |Save| I[models.json]
21
 
22
  %% Model Validation & Cost Filtering
23
- H --> |"Validate Models<br/>Check API Availability"| H1["Valid Models Only<br/>Cost ≀ $20/1M tokens"]
24
- H1 --> |"Timeout Protection<br/>120s for Large Models"| H2["Robust Model List"]
25
 
26
  %% Language Data
27
- J["languages.py<br/>BCP-47 + Population"] --> K["Top 100 Languages"]
28
 
29
  %% Task Registry with Unified Prompting
30
- L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot"]
31
  M --> M1["translation_from/to<br/>BLEU + ChrF"]
32
  M --> M2["classification<br/>Accuracy"]
33
  M --> M3["mmlu<br/>Accuracy"]
@@ -39,43 +37,45 @@ flowchart TD
39
  subgraph OTF [On-the-fly Dataset Translation]
40
  direction LR
41
  DS_raw["Raw English Dataset<br/>"] --> Google_Translate["Google Translate API"]
42
- Google_Translate --> DS_translated["Translated Dataset<br/>(e.g., MGSM/ARC)<br/>Origin: 'machine'"]
43
- DS_native["Native Dataset<br/>(e.g., AfriMMLU/Global-MMLU)<br/>Origin: 'human'"]
44
  end
45
 
46
  %% Evaluation Pipeline
47
- H2 --> |"models ID"| N["main.py / main_gcs.py<br/>evaluate"]
48
- K --> |"languages bcp_47"| N
49
  L --> |"tasks.items"| N
50
- N --> |"Filter by model.tasks"| O["Valid Combinations<br/>Model Γ— Language Γ— Task"]
51
- O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing"]
52
 
53
  %% Task Execution with Origin Tracking
54
  P --> Q1[translate_and_evaluate<br/>Origin: 'human']
55
  P --> Q2[classify_and_evaluate<br/>Origin: 'human']
56
- P --> Q3[mmlu_and_evaluate<br/>Origin: 'human' (no on-the-fly for missing; uses auto-translated dataset if available)]
57
  P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
58
- P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human' (no on-the-fly for missing; relies on available datasets)]
59
  P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
60
 
61
  %% API Calls with Error Handling
62
- Q1 --> |"complete() API<br/>Rate Limiting"| R["OpenRouter<br/>Model Inference"]
63
- Q2 --> |"complete() API<br/>Rate Limiting"| R
64
- Q3 --> |"complete() API<br/>Rate Limiting"| R
65
- Q4 --> |"complete() API<br/>Rate Limiting"| R
66
- Q5 --> |"complete() API<br/>Rate Limiting"| R
67
- Q6 --> |"complete() API<br/>Rate Limiting"| R
68
 
69
  %% Results Processing with Origin Aggregation
70
- R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin"]
71
- S --> |Save| T[results.json]
72
 
73
  %% Backend & Frontend with Origin-Specific Metrics
74
  T --> |Read| U[backend.py]
75
  I --> |Read| U
76
- U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics"]
77
  U --> |make_country_table| W["Country Aggregation"]
78
- U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine"]
 
 
79
  X --> |"JSON Response"| Y["Frontend React App"]
80
 
81
  %% UI Components
@@ -83,14 +83,16 @@ flowchart TD
83
  Y --> Z2["ModelTable.js<br/>Model Rankings"]
84
  Y --> Z3["LanguageTable.js<br/>Language Coverage"]
85
  Y --> Z4["DatasetTable.js<br/>Task Performance"]
 
 
86
 
87
  %% Data Sources with Origin Information
88
  subgraph DS ["Data Sources"]
89
- DS1["Flores-200<br/>Translation Sentences<br/>Origin: 'human'"]
90
- DS2["MMLU/AfriMMLU/Global-MMLU<br/>Knowledge QA<br/>Origin: 'human' or 'machine' (HF auto-translated only)"]
91
- DS3["ARC<br/>Science Reasoning<br/>Origin: 'human'"]
92
- DS4["TruthfulQA<br/>Truthfulness<br/>Origin: 'human'"]
93
- DS5["MGSM<br/>Math Problems<br/>Origin: 'human'"]
94
  end
95
 
96
  DS1 --> Q1
@@ -115,63 +117,74 @@ flowchart TD
115
  classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
116
  classDef translation fill:#d4edda,stroke:#155724,color:#155724
117
 
118
- class A1,A2,A3,A4 modelSource
119
  class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
120
  class R,F,G,X api
121
  class T,I storage
122
- class Y,Z1,Z2,Z3,Z4 frontend
123
  class Google_Translate,DS_translated,DS_native translation
124
  ```
125
 
126
  ## Architecture Components
127
 
128
  ### πŸ”΅ Model Discovery (Light Gray)
129
- - **Static Curated Models**: Handpicked important models for comprehensive evaluation
130
- - **Dynamic Popular Models**: Real-time discovery of trending models via web scraping
131
  - **Quality Control**: Blocklist for problematic or incompatible models
132
- - **Model Validation**: API availability checks and cost filtering (≀$20/1M tokens)
133
- - **Timeout Protection**: 120s timeout for large/reasoning models, 60s for others
134
  - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
135
 
136
  ### 🟣 Evaluation Pipeline (Medium Gray)
137
  - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
138
  - **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
 
139
  - **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
140
  - **Combinatorial Approach**: Systematic evaluation across Model Γ— Language Γ— Task combinations
141
- - **Sample-based**: 10 evaluations per combination for statistical reliability
142
- - **Batch Processing**: 50 tasks per batch with rate limiting and error resilience
 
 
143
  - **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
144
 
145
  ### 🟠 API Integration (Light Gray)
146
  - **OpenRouter**: Primary model inference API for all language model tasks
147
- - **Rate Limiting**: Intelligent batching and delays to prevent API overload
148
- - **Error Handling**: Graceful handling of timeouts, rate limits, and model unavailability
149
- - **HuggingFace**: Model metadata and open-source model information
150
- - **Google Translate**: Specialized translation API for on-the-fly dataset translation
 
151
 
152
  ### 🟒 Data Storage (Cyan)
153
  - **results.json**: Aggregated evaluation scores with origin-specific metrics
 
154
  - **models.json**: Dynamic model list with metadata and validation status
155
- - **languages.json**: Language information with population data
 
156
 
157
  ### 🟑 Frontend Visualization (Light Red)
158
- - **WorldMap**: Interactive country-level visualization
159
- - **ModelTable**: Ranked model performance leaderboard with origin-specific columns
160
- - **LanguageTable**: Language coverage and speaker statistics
161
  - **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
 
 
 
162
 
163
  ### πŸ”΅ Translation & Origin Tracking (Light Green)
164
- - **On-the-fly Translation**: Google Translate API for languages without native benchmarks
 
165
  - **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
166
  - **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
 
167
 
168
  ## Data Flow Summary
169
 
170
- 1. **Model Discovery**: Combine curated + trending models β†’ validate API availability β†’ enrich with metadata
171
- 2. **Evaluation Setup**: Generate all valid Model Γ— Language Γ— Task combinations with origin tracking
172
- 3. **Task Execution**: Run evaluations using unified English prompting and appropriate datasets
173
- 4. **Result Processing**: Aggregate scores by model+language+task+origin and save to JSON files
174
- 5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics via REST API
175
- 6. **Frontend Display**: React app visualizes data through interactive components with transparency indicators
176
 
177
- This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency.
 
7
  ```mermaid
8
  flowchart TD
9
  %% Model Sources
10
+ A1["important_models<br/>Static Curated List<br/>~34 models"] --> D[load_models]
 
 
11
  A4["blocklist<br/>Exclusions"] --> D
12
 
13
  %% Model Processing
14
+ D --> |"Combine & Dedupe"| E["Dynamic Model List<br/>Validated Models"]
15
  E --> |get_or_metadata| F["OpenRouter API<br/>Model Metadata"]
16
  F --> |get_hf_metadata| G["HuggingFace API<br/>Model Details"]
17
  G --> H["Enriched Model DataFrame"]
18
  H --> |Save| I[models.json]
19
 
20
  %% Model Validation & Cost Filtering
21
+ H --> |"Validate Models<br/>Check API Availability<br/>No User Data Training"| H1["Valid Models Only<br/>Cost ≀ $15/1M tokens"]
22
+ H1 --> H2["Robust Model List<br/>Default: Top 40 models"]
23
 
24
  %% Language Data
25
+ J["languages.py<br/>BCP-47 + Population<br/>Glottolog Families"] --> K["Languages Sorted by Speakers<br/>Default: Up to 1000 languages"]
26
 
27
  %% Task Registry with Unified Prompting
28
+ L["tasks.py<br/>7 Evaluation Tasks"] --> M["Task Functions<br/>Unified English Zero-Shot<br/>Reasoning Template"]
29
  M --> M1["translation_from/to<br/>BLEU + ChrF"]
30
  M --> M2["classification<br/>Accuracy"]
31
  M --> M3["mmlu<br/>Accuracy"]
 
37
  subgraph OTF [On-the-fly Dataset Translation]
38
  direction LR
39
  DS_raw["Raw English Dataset<br/>"] --> Google_Translate["Google Translate API"]
40
+ Google_Translate --> DS_translated["Translated Dataset<br/>e.g., MGSM/ARC<br/>Origin: 'machine'"]
41
+ DS_native["Native Dataset<br/>e.g., AfriMMLU/Global-MMLU<br/>Origin: 'human'"]
42
  end
43
 
44
  %% Evaluation Pipeline
45
+ H2 --> |"models ID<br/>Default: 40 models"| N["main.py / main_gcs.py<br/>evaluate"]
46
+ K --> |"languages bcp_47<br/>Default: 1000 languages"| N
47
  L --> |"tasks.items"| N
48
+ N --> |"Filter by model.tasks<br/>Filter by valid task languages"| O["Valid Combinations<br/>Model Γ— Language Γ— Task"]
49
+ O --> |"10 samples each"| P["Evaluation Execution<br/>Batch Processing<br/>Batch Size: 2000"]
50
 
51
  %% Task Execution with Origin Tracking
52
  P --> Q1[translate_and_evaluate<br/>Origin: 'human']
53
  P --> Q2[classify_and_evaluate<br/>Origin: 'human']
54
+ P --> Q3[mmlu_and_evaluate<br/>Origin: 'human'<br/>no on-the-fly; uses auto-translated if available]
55
  P --> Q4[arc_and_evaluate<br/>Origin: 'human'/'machine']
56
+ P --> Q5[truthfulqa_and_evaluate<br/>Origin: 'human'<br/>no on-the-fly; relies on available datasets]
57
  P --> Q6[mgsm_and_evaluate<br/>Origin: 'human'/'machine']
58
 
59
  %% API Calls with Error Handling
60
+ Q1 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R["OpenRouter<br/>Model Inference"]
61
+ Q2 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
62
+ Q3 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
63
+ Q4 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
64
+ Q5 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
65
+ Q6 --> |"complete() API<br/>Rate Limiting<br/>Reasoning: Low Effort"| R
66
 
67
  %% Results Processing with Origin Aggregation
68
+ R --> |Scores| S["Result Aggregation<br/>Mean by model+lang+task+origin<br/>Bootstrap Confidence Intervals"]
69
+ S --> |Save| T["results.json<br/>results-detailed.json"]
70
 
71
  %% Backend & Frontend with Origin-Specific Metrics
72
  T --> |Read| U[backend.py]
73
  I --> |Read| U
74
+ U --> |make_model_table| V["Model Rankings<br/>Origin-Specific Metrics<br/>Confidence Intervals"]
75
  U --> |make_country_table| W["Country Aggregation"]
76
+ U --> |make_language_tier_history| V2["Language Tier History<br/>Top 1, 2-20, 20-200"]
77
+ U --> |make_license_history| V3["License History<br/>Open-source vs Commercial"]
78
+ U --> |"API Endpoint"| X["FastAPI /api/data<br/>arc_accuracy_human<br/>arc_accuracy_machine<br/>language_tier_history<br/>license_history"]
79
  X --> |"JSON Response"| Y["Frontend React App"]
80
 
81
  %% UI Components
 
83
  Y --> Z2["ModelTable.js<br/>Model Rankings"]
84
  Y --> Z3["LanguageTable.js<br/>Language Coverage"]
85
  Y --> Z4["DatasetTable.js<br/>Task Performance"]
86
+ Y --> Z5["LanguageTierHistoryPlot.js<br/>Tier-based Trends"]
87
+ Y --> Z6["LicenseHistoryPlot.js<br/>License-based Trends"]
88
 
89
  %% Data Sources with Origin Information
90
  subgraph DS ["Data Sources"]
91
+ DS1["FLORES+<br/>Translation Sentences<br/>Origin: 'human'"]
92
+ DS2["MMLU Variants<br/>AfriMMLU/Global-MMLU/MMMLU<br/>HF Auto-translated MMLU<br/>Origin: 'human' or 'machine'"]
93
+ DS3["Uhuru ARC Easy<br/>Auto-translated ARC<br/>Origin: 'human' or 'machine'"]
94
+ DS4["Uhura TruthfulQA<br/>Auto-translated TruthfulQA<br/>Origin: 'human' or 'machine'"]
95
+ DS5["MGSM Variants<br/>MGSM/AfriMGSM/GSM8K-X<br/>Auto-translated GSM<br/>Origin: 'human' or 'machine'"]
96
  end
97
 
98
  DS1 --> Q1
 
117
  classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
118
  classDef translation fill:#d4edda,stroke:#155724,color:#155724
119
 
120
+ class A1,A4 modelSource
121
  class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
122
  class R,F,G,X api
123
  class T,I storage
124
+ class Y,Z1,Z2,Z3,Z4,Z5,Z6 frontend
125
  class Google_Translate,DS_translated,DS_native translation
126
  ```
127
 
128
  ## Architecture Components
129
 
130
  ### πŸ”΅ Model Discovery (Light Gray)
131
+ - **Static Curated Models**: Handpicked important models (~34 models) for comprehensive evaluation
132
+ - **Dynamic Popular Models**: Web scraping capability available but currently disabled
133
  - **Quality Control**: Blocklist for problematic or incompatible models
134
+ - **Model Validation**: API availability checks, cost filtering (≀$15/1M tokens), and exclusion of providers that train on user data
135
+ - **Default Selection**: Top 40 models by default (configurable via N_MODELS)
136
  - **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
137
 
138
  ### 🟣 Evaluation Pipeline (Medium Gray)
139
  - **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
140
  - **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
141
+ - **Reasoning Template**: Tasks use structured reasoning format with `<reasoning>...</reasoning><final_answer>...</final_answer>` tags
142
  - **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
143
  - **Combinatorial Approach**: Systematic evaluation across Model Γ— Language Γ— Task combinations
144
+ - **Sample-based**: 10 evaluations per combination for statistical reliability (configurable via N_SENTENCES)
145
+ - **Batch Processing**: 2000 tasks per batch with rate limiting and error resilience
146
+ - **Language Filtering**: Pre-computed valid languages per task to filter invalid combinations
147
+ - **Default Scale**: 40 models Γ— 1000 languages Γ— 7 tasks Γ— 10 samples (configurable via environment variables)
148
  - **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
149
 
150
  ### 🟠 API Integration (Light Gray)
151
  - **OpenRouter**: Primary model inference API for all language model tasks
152
+ - **Rate Limiting**: Async rate limiters (20 req/s OpenRouter, 10 req/s Google Translate, 5 req/s HuggingFace)
153
+ - **Reasoning Configuration**: Low-effort reasoning mode enabled for efficiency
154
+ - **Error Handling**: Graceful handling of timeouts, rate limits, filtered content, and model unavailability
155
+ - **HuggingFace**: Model metadata and open-source model information via HfApi
156
+ - **Google Translate**: Specialized translation API for on-the-fly dataset translation (when needed)
157
 
158
  ### 🟒 Data Storage (Cyan)
159
  - **results.json**: Aggregated evaluation scores with origin-specific metrics
160
+ - **results-detailed.json**: Detailed results with individual sample scores for bootstrap CI calculation
161
  - **models.json**: Dynamic model list with metadata and validation status
162
+ - **languages.json**: Language information with population data, Glottolog families, and script information
163
+ - **Immutable Log**: Results are cached and merged to avoid re-computation
164
 
165
  ### 🟑 Frontend Visualization (Light Red)
166
+ - **WorldMap**: Interactive country-level visualization with language selection
167
+ - **ModelTable**: Ranked model performance leaderboard with origin-specific columns and confidence intervals
168
+ - **LanguageTable**: Language coverage and speaker statistics with confidence intervals
169
  - **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
170
+ - **LanguageTierHistoryPlot**: Historical trends for language tiers (Top 1, Top 2-20, Top 20-200)
171
+ - **LicenseHistoryPlot**: Historical trends comparing open-source vs commercial models
172
+ - **Confidence Intervals**: Bootstrap-based 95% confidence intervals for all metrics
173
 
174
  ### πŸ”΅ Translation & Origin Tracking (Light Green)
175
+ - **Dataset-Based Translation**: Uses HuggingFace auto-translated datasets (MMLU, ARC, TruthfulQA, MGSM) when available
176
+ - **On-the-fly Translation**: Google Translate API available but primarily used for translation tasks
177
  - **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
178
  - **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
179
+ - **Dataset Variants**: Supports multiple dataset variants (e.g., AfriMMLU, Global-MMLU, MMMLU for MMLU)
180
 
181
  ## Data Flow Summary
182
 
183
+ 1. **Model Discovery**: Load curated models (~34) β†’ validate API availability and cost (≀$15/1M tokens) β†’ exclude providers training on user data β†’ enrich with metadata from OpenRouter and HuggingFace
184
+ 2. **Evaluation Setup**: Generate all valid Model Γ— Language Γ— Task combinations (default: 40 models Γ— 1000 languages) with pre-computed language filtering and origin tracking
185
+ 3. **Task Execution**: Run evaluations using unified English prompting with reasoning templates, batch processing (2000 per batch), and rate limiting
186
+ 4. **Result Processing**: Aggregate scores by model+language+task+origin, compute bootstrap confidence intervals, and save to JSON files (results.json and results-detailed.json)
187
+ 5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics, confidence intervals, language tier history, and license history via REST API
188
+ 6. **Frontend Display**: React app visualizes data through interactive components (WorldMap, ModelTable, LanguageTable, DatasetTable, LanguageTierHistoryPlot, LicenseHistoryPlot) with transparency indicators and confidence intervals
189
 
190
+ This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency and statistical rigor.