|
|
--- |
|
|
title: AutoBench Leaderboard |
|
|
emoji: π |
|
|
colorFrom: green |
|
|
colorTo: pink |
|
|
sdk: gradio |
|
|
sdk_version: 5.27.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
short_description: Multi-run AutoBench leaderboard with historical navigation |
|
|
--- |
|
|
|
|
|
# AutoBench LLM Leaderboard |
|
|
|
|
|
Interactive leaderboard for AutoBench, where Large Language Models (LLMs) evaluate and rank responses from other LLMs. This application supports multiple benchmark runs with seamless navigation between different time periods. |
|
|
|
|
|
## π Features |
|
|
|
|
|
### Multi-Run Navigation |
|
|
- **π Run Selector**: Switch between different AutoBench runs using the dropdown menu |
|
|
- **π Historical Data**: View and compare results across different time periods |
|
|
- **π Reactive Interface**: All tabs and visualizations update automatically when switching runs |
|
|
- **π Enhanced Metrics**: Support for evaluation iterations and fail rates in newer runs |
|
|
|
|
|
### Comprehensive Analysis |
|
|
- **Overall Ranking**: Model performance with AutoBench scores, costs, latency, and reliability metrics |
|
|
- **Benchmark Comparison**: Correlations with Chatbot Arena, AAI Index, and MMLU benchmarks |
|
|
- **Performance Plots**: Interactive scatter plots showing cost vs. performance trade-offs |
|
|
- **Cost & Latency Analysis**: Detailed breakdown by domain and response time percentiles |
|
|
- **Domain Performance**: Model rankings across specific knowledge areas |
|
|
|
|
|
### Dynamic Features |
|
|
- **π Benchmark Correlations**: Displays correlation percentages with other popular benchmarks |
|
|
- **π° Cost Conversion**: Automatic conversion to cents for better readability |
|
|
- **β‘ Performance Metrics**: Average and P99 latency measurements |
|
|
- **π― Fail Rate Tracking**: Model reliability metrics (for supported runs) |
|
|
- **π’ Iteration Counts**: Number of evaluations per model (for supported runs) |
|
|
|
|
|
## π How to Use |
|
|
|
|
|
### Navigation |
|
|
1. **Select a Run**: Use the dropdown menu at the top to choose between available benchmark runs |
|
|
2. **Explore Tabs**: Navigate through different analysis views using the tab interface |
|
|
3. **Interactive Tables**: Sort and filter data by clicking on column headers |
|
|
4. **Hover for Details**: Get additional information by hovering over chart elements |
|
|
|
|
|
### Understanding the Data |
|
|
- **AutoBench Score**: Higher scores indicate better performance |
|
|
- **Cost**: Lower values are better (displayed in cents per response) |
|
|
- **Latency**: Lower response times are better (average and P99 percentiles) |
|
|
- **Fail Rate**: Lower percentages indicate more reliable models |
|
|
- **Iterations**: Number of evaluation attempts per model |
|
|
|
|
|
## π§ Adding New Runs |
|
|
|
|
|
### Directory Structure |
|
|
``` |
|
|
runs/ |
|
|
βββ run_YYYY-MM-DD/ |
|
|
β βββ metadata.json # Run information and metadata |
|
|
β βββ correlations.json # Benchmark correlation data |
|
|
β βββ summary_data.csv # Main leaderboard data |
|
|
β βββ domain_ranks.csv # Domain-specific rankings |
|
|
β βββ cost_data.csv # Cost breakdown by domain |
|
|
β βββ avg_latency.csv # Average latency by domain |
|
|
β βββ p99_latency.csv # P99 latency by domain |
|
|
``` |
|
|
|
|
|
### Required Files |
|
|
|
|
|
#### 1. metadata.json |
|
|
```json |
|
|
{ |
|
|
"run_id": "run_2025-08-14", |
|
|
"title": "AutoBench Run 3 - August 2025", |
|
|
"date": "2025-08-14", |
|
|
"description": "Latest AutoBench run with enhanced metrics", |
|
|
"blog_url": "https://huggingface.co/blog/PeterKruger/autobench-3rd-run", |
|
|
"model_count": 34, |
|
|
"is_latest": true |
|
|
} |
|
|
``` |
|
|
|
|
|
#### 2. correlations.json |
|
|
```json |
|
|
{ |
|
|
"correlations": { |
|
|
"Chatbot Arena": 82.51, |
|
|
"Artificial Analysis Intelligence Index": 83.74, |
|
|
"MMLU": 71.51 |
|
|
}, |
|
|
"description": "Correlation percentages between AutoBench scores and other benchmark scores" |
|
|
} |
|
|
``` |
|
|
|
|
|
#### 3. summary_data.csv |
|
|
Required columns: |
|
|
- `Model`: Model name |
|
|
- `AutoBench`: AutoBench score |
|
|
- `Costs (USD)`: Cost per response in USD |
|
|
- `Avg Answer Duration (sec)`: Average response time |
|
|
- `P99 Answer Duration (sec)`: 99th percentile response time |
|
|
|
|
|
Optional columns (for enhanced metrics): |
|
|
- `Iterations`: Number of evaluation iterations |
|
|
- `Fail Rate %`: Percentage of failed responses |
|
|
- `LMArena` or `Chatbot Ar.`: Chatbot Arena scores |
|
|
- `MMLU-Pro` or `MMLU Index`: MMLU benchmark scores |
|
|
- `AAI Index`: Artificial Analysis Intelligence Index scores |
|
|
|
|
|
### Adding a New Run |
|
|
|
|
|
1. **Create Directory**: `mkdir runs/run_YYYY-MM-DD` |
|
|
2. **Add Data Files**: Copy your CSV files to the new directory |
|
|
3. **Create Metadata**: Add `metadata.json` with run information |
|
|
4. **Add Correlations**: Create `correlations.json` with benchmark correlations |
|
|
5. **Update Previous Run**: Set `"is_latest": false` in the previous latest run's metadata |
|
|
6. **Restart App**: The new run will be automatically discovered |
|
|
|
|
|
### Column Compatibility |
|
|
|
|
|
The application automatically adapts to different column structures: |
|
|
- **Legacy Runs**: Support basic columns (Model, AutoBench, Cost, Latency) |
|
|
- **Enhanced Runs**: Include additional metrics (Iterations, Fail Rate %) |
|
|
- **Flexible Naming**: Handles variations in benchmark column names |
|
|
|
|
|
## π οΈ Development |
|
|
|
|
|
### Requirements |
|
|
- Python 3.8+ |
|
|
- Gradio 5.27.0+ |
|
|
- Pandas |
|
|
- Plotly |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
### Running Locally |
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
### killing all python processes |
|
|
```bash |
|
|
taskkill /F /IM python.exe 2>/dev/null || echo "No Python processes to kill" |
|
|
``` |
|
|
|
|
|
The app will automatically discover available runs and launch on a local port. |
|
|
|
|
|
## π Data Sources |
|
|
|
|
|
AutoBench evaluations are conducted using LLM-generated questions across diverse domains, with responses ranked by evaluation LLMs. For more information about the methodology, visit the [AutoBench blog posts](https://huggingface.co/blog/PeterKruger/autobench). |
|
|
|
|
|
## π License |
|
|
|
|
|
MIT License - see LICENSE file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
Check out the [Hugging Face Spaces configuration reference](https://huggingface.co/docs/hub/spaces-config-reference) for deployment options. |
|
|
|