# languagebench - System Architecture
\[AI-generated, not 100% up-to-date\]
This diagram shows the complete data flow from model discovery through evaluation to frontend visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models
Static Curated List
~34 models"] --> D[load_models]
A4["blocklist
Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List
Validated Models"]
E --> |get_or_metadata| F["OpenRouter API
Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API
Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Model Validation & Cost Filtering
H --> |"Validate Models
Check API Availability
No User Data Training"| H1["Valid Models Only
Cost ≤ $15/1M tokens"]
H1 --> H2["Robust Model List
Default: Top 40 models"]
%% Language Data
J["languages.py
BCP-47 + Population
Glottolog Families"] --> K["Languages Sorted by Speakers
Default: Up to 1000 languages"]
%% Task Registry with Unified Prompting
L["tasks.py
7 Evaluation Tasks"] --> M["Task Functions
Unified English Zero-Shot
Reasoning Template"]
M --> M1["translation_from/to
BLEU + ChrF"]
M --> M2["classification
Accuracy"]
M --> M3["mmlu
Accuracy"]
M --> M4["arc
Accuracy"]
M --> M5["truthfulqa
Accuracy"]
M --> M6["mgsm
Accuracy"]
%% On-the-fly Translation with Origin Tagging
subgraph OTF [On-the-fly Dataset Translation]
direction LR
DS_raw["Raw English Dataset
"] --> Google_Translate["Google Translate API"]
Google_Translate --> DS_translated["Translated Dataset
e.g., MGSM/ARC
Origin: 'machine'"]
DS_native["Native Dataset
e.g., AfriMMLU/Global-MMLU
Origin: 'human'"]
end
%% Evaluation Pipeline
H2 --> |"models ID
Default: 40 models"| N["main.py / main_gcs.py
evaluate"]
K --> |"languages bcp_47
Default: 1000 languages"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks
Filter by valid task languages"| O["Valid Combinations
Model × Language × Task"]
O --> |"10 samples each"| P["Evaluation Execution
Batch Processing
Batch Size: 2000"]
%% Task Execution with Origin Tracking
P --> Q1[translate_and_evaluate
Origin: 'human']
P --> Q2[classify_and_evaluate
Origin: 'human']
P --> Q3[mmlu_and_evaluate
Origin: 'human'
no on-the-fly; uses auto-translated if available]
P --> Q4[arc_and_evaluate
Origin: 'human'/'machine']
P --> Q5[truthfulqa_and_evaluate
Origin: 'human'
no on-the-fly; relies on available datasets]
P --> Q6[mgsm_and_evaluate
Origin: 'human'/'machine']
%% API Calls with Error Handling
Q1 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R["OpenRouter
Model Inference"]
Q2 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R
Q3 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R
Q4 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R
Q5 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R
Q6 --> |"complete() API
Rate Limiting
Reasoning: Low Effort"| R
%% Results Processing with Origin Aggregation
R --> |Scores| S["Result Aggregation
Mean by model+lang+task+origin
Bootstrap Confidence Intervals"]
S --> |Save| T["results.json
results-detailed.json"]
%% Backend & Frontend with Origin-Specific Metrics
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings
Origin-Specific Metrics
Confidence Intervals"]
U --> |make_country_table| W["Country Aggregation"]
U --> |make_language_tier_history| V2["Language Tier History
Top 1, 2-20, 20-200"]
U --> |make_license_history| V3["License History
Open-source vs Commercial"]
U --> |"API Endpoint"| X["FastAPI /api/data
arc_accuracy_human
arc_accuracy_machine
language_tier_history
license_history"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js
Country Visualization"]
Y --> Z2["ModelTable.js
Model Rankings"]
Y --> Z3["LanguageTable.js
Language Coverage"]
Y --> Z4["DatasetTable.js
Task Performance"]
Y --> Z5["LanguageTierHistoryPlot.js
Tier-based Trends"]
Y --> Z6["LicenseHistoryPlot.js
License-based Trends"]
%% Data Sources with Origin Information
subgraph DS ["Data Sources"]
DS1["FLORES+
Translation Sentences
Origin: 'human'"]
DS2["MMLU Variants
AfriMMLU/Global-MMLU/MMMLU
HF Auto-translated MMLU
Origin: 'human' or 'machine'"]
DS3["Uhuru ARC Easy
Auto-translated ARC
Origin: 'human' or 'machine'"]
DS4["Uhura TruthfulQA
Auto-translated TruthfulQA
Origin: 'human' or 'machine'"]
DS5["MGSM Variants
MGSM/AfriMGSM/GSM8K-X
Auto-translated GSM
Origin: 'human' or 'machine'"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
%% No on-the-fly DS_translated for MMLU anymore; only HF auto-translated used
DS_translated --> Q4
DS_translated --> Q5
DS_native --> Q3
DS_native --> Q4
DS_native --> Q5
%% Styling - Neutral colors that work in both dark and light modes
classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
classDef translation fill:#d4edda,stroke:#155724,color:#155724
class A1,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4,Z5,Z6 frontend
class Google_Translate,DS_translated,DS_native translation
```
## Architecture Components
### 🔵 Model Discovery (Light Gray)
- **Static Curated Models**: Handpicked important models (~34 models) for comprehensive evaluation
- **Dynamic Popular Models**: Web scraping capability available but currently disabled
- **Quality Control**: Blocklist for problematic or incompatible models
- **Model Validation**: API availability checks, cost filtering (≤$15/1M tokens), and exclusion of providers that train on user data
- **Default Selection**: Top 40 models by default (configurable via N_MODELS)
- **Metadata Enrichment**: Rich model information from OpenRouter and HuggingFace APIs
### 🟣 Evaluation Pipeline (Medium Gray)
- **7 Active Tasks**: Translation (bidirectional), Classification, MMLU, ARC, TruthfulQA, MGSM
- **Unified English Zero-Shot Prompting**: All tasks use English instructions with target language content
- **Reasoning Template**: Tasks use structured reasoning format with `......` tags
- **Origin Tagging**: Distinguishes between human-translated ('human') and machine-translated ('machine') data
- **Combinatorial Approach**: Systematic evaluation across Model × Language × Task combinations
- **Sample-based**: 10 evaluations per combination for statistical reliability (configurable via N_SENTENCES)
- **Batch Processing**: 2000 tasks per batch with rate limiting and error resilience
- **Language Filtering**: Pre-computed valid languages per task to filter invalid combinations
- **Default Scale**: 40 models × 1000 languages × 7 tasks × 10 samples (configurable via environment variables)
- **Dual Deployment**: `main.py` for local/GitHub, `main_gcs.py` for Google Cloud with GCS storage
### 🟠 API Integration (Light Gray)
- **OpenRouter**: Primary model inference API for all language model tasks
- **Rate Limiting**: Async rate limiters (20 req/s OpenRouter, 10 req/s Google Translate, 5 req/s HuggingFace)
- **Reasoning Configuration**: Low-effort reasoning mode enabled for efficiency
- **Error Handling**: Graceful handling of timeouts, rate limits, filtered content, and model unavailability
- **HuggingFace**: Model metadata and open-source model information via HfApi
- **Google Translate**: Specialized translation API for on-the-fly dataset translation (when needed)
### 🟢 Data Storage (Cyan)
- **results.json**: Aggregated evaluation scores with origin-specific metrics
- **results-detailed.json**: Detailed results with individual sample scores for bootstrap CI calculation
- **models.json**: Dynamic model list with metadata and validation status
- **languages.json**: Language information with population data, Glottolog families, and script information
- **Immutable Log**: Results are cached and merged to avoid re-computation
### 🟡 Frontend Visualization (Light Red)
- **WorldMap**: Interactive country-level visualization with language selection
- **ModelTable**: Ranked model performance leaderboard with origin-specific columns and confidence intervals
- **LanguageTable**: Language coverage and speaker statistics with confidence intervals
- **DatasetTable**: Task-specific performance breakdowns with human/machine distinction
- **LanguageTierHistoryPlot**: Historical trends for language tiers (Top 1, Top 2-20, Top 20-200)
- **LicenseHistoryPlot**: Historical trends comparing open-source vs commercial models
- **Confidence Intervals**: Bootstrap-based 95% confidence intervals for all metrics
### 🔵 Translation & Origin Tracking (Light Green)
- **Dataset-Based Translation**: Uses HuggingFace auto-translated datasets (MMLU, ARC, TruthfulQA, MGSM) when available
- **On-the-fly Translation**: Google Translate API available but primarily used for translation tasks
- **Origin Tagging**: Automatic classification of data sources (human vs. machine translated)
- **Separate Metrics**: Frontend displays distinct scores for human and machine-translated data
- **Dataset Variants**: Supports multiple dataset variants (e.g., AfriMMLU, Global-MMLU, MMMLU for MMLU)
## Data Flow Summary
1. **Model Discovery**: Load curated models (~34) → validate API availability and cost (≤$15/1M tokens) → exclude providers training on user data → enrich with metadata from OpenRouter and HuggingFace
2. **Evaluation Setup**: Generate all valid Model × Language × Task combinations (default: 40 models × 1000 languages) with pre-computed language filtering and origin tracking
3. **Task Execution**: Run evaluations using unified English prompting with reasoning templates, batch processing (2000 per batch), and rate limiting
4. **Result Processing**: Aggregate scores by model+language+task+origin, compute bootstrap confidence intervals, and save to JSON files (results.json and results-detailed.json)
5. **Backend Serving**: FastAPI serves processed data with origin-specific metrics, confidence intervals, language tier history, and license history via REST API
6. **Frontend Display**: React app visualizes data through interactive components (WorldMap, ModelTable, LanguageTable, DatasetTable, LanguageTierHistoryPlot, LicenseHistoryPlot) with transparency indicators and confidence intervals
This architecture enables scalable, automated evaluation of AI language models across diverse languages and tasks while providing real-time insights through an intuitive web interface with methodological transparency and statistical rigor.