AI Model Benchmarks
Live comparisons across all tracked models. Updated as new scores are published.
AIME 2024
| Model | Company | Version | Score |
|---|---|---|---|
| DeepSeek R1 | DeepSeek | DeepSeek-R1 | 79.8% |
AIME 2025
DocVQA
| Model | Company | Version | Score |
|---|---|---|---|
| Llama 4 Scout | Meta AI | Llama-4-Scout-17B-16E | 94.4% |
GPQA
| Model | Company | Version | Score |
|---|---|---|---|
| o3 | OpenAI | o3-2025-04-16 | 87.7% |
| Grok 3 | xAI | grok-3-beta | 84.6% |
| Gemini 2.5 Pro | Google DeepMind | gemini-2.5-pro-preview-03-25 | 84.0% |
| o4-mini | OpenAI | o4-mini-2025-04-16 | 81.4% |
| Claude Opus 4 | Anthropic | claude-opus-4-0 | 74.9% |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4-5 | 68.0% |
| DeepSeek V3 | DeepSeek | DeepSeek-V3 | 59.1% |
| GPT-4o | OpenAI | gpt-4o-2024-11-20 | 53.6% |
HumanEval
| Model | Company | Version | Score |
|---|---|---|---|
| DeepSeek R1 | DeepSeek | DeepSeek-R1 | 96.3% |
| Gemini 2.5 Pro | Google DeepMind | gemini-2.5-pro-preview-03-25 | 94.3% |
| Claude Opus 4 | Anthropic | claude-opus-4-0 | 92.4% |
| Mistral Large 2 | Mistral AI | mistral-large-2407 | 92.0% |
| GPT-4.1 | OpenAI | gpt-4.1-2025-04-14 | 92.0% |
| DeepSeek V3 | DeepSeek | DeepSeek-V3 | 90.2% |
| GPT-4o | OpenAI | gpt-4o-2024-11-20 | 90.2% |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4-5 | 90.1% |
| Llama 3.3 70B | Meta AI | Llama-3.3-70B-Instruct | 88.4% |
| Grok 3 | xAI | grok-3-beta | 88.3% |
| Gemini 2.0 Flash | Google DeepMind | gemini-2.0-flash-001 | 85.3% |
| Claude Haiku 4.5 | Anthropic | claude-haiku-4-5 | 79.5% |
MATH
| Model | Company | Version | Score |
|---|---|---|---|
| o4-mini | OpenAI | o4-mini-2025-04-16 | 99.5% |
| Grok 3 | xAI | grok-3-beta | 97.6% |
| DeepSeek R1 | DeepSeek | DeepSeek-R1 | 97.3% |
| Gemini 2.5 Pro | Google DeepMind | gemini-2.5-pro-preview-03-25 | 97.0% |
| DeepSeek V3 | DeepSeek | DeepSeek-V3 | 90.2% |
| Gemini 2.0 Flash | Google DeepMind | gemini-2.0-flash-001 | 89.7% |
| Claude Opus 4 | Anthropic | claude-opus-4-0 | 89.5% |
| Llama 3.3 70B | Meta AI | Llama-3.3-70B-Instruct | 77.0% |
| GPT-4o | OpenAI | gpt-4o-2024-11-20 | 76.6% |
| Mistral Large 2 | Mistral AI | mistral-large-2407 | 72.4% |
MMLU
| Model | Company | Version | Score |
|---|---|---|---|
| o3 | OpenAI | o3-2025-04-16 | 92.4% |
| Gemini 2.5 Pro | Google DeepMind | gemini-2.5-pro-preview-03-25 | 91.8% |
| DeepSeek R1 | DeepSeek | DeepSeek-R1 | 90.8% |
| GPT-4.1 | OpenAI | gpt-4.1-2025-04-14 | 90.2% |
| GPT-4o | OpenAI | gpt-4o-2024-11-20 | 88.7% |
| Claude Opus 4 | Anthropic | claude-opus-4-0 | 88.7% |
| DeepSeek V3 | DeepSeek | DeepSeek-V3 | 88.5% |
| Claude Sonnet 4.5 | Anthropic | claude-sonnet-4-5 | 86.9% |
| Llama 3.3 70B | Meta AI | Llama-3.3-70B-Instruct | 86.0% |
| Gemini 2.0 Flash | Google DeepMind | gemini-2.0-flash-001 | 85.9% |
| Mistral Large 2 | Mistral AI | mistral-large-2407 | 84.0% |
| Claude Haiku 4.5 | Anthropic | claude-haiku-4-5 | 80.5% |
| Llama 4 Scout | Meta AI | Llama-4-Scout-17B-16E | 79.6% |