AI Model Benchmarks

Live comparisons across all tracked models. Updated as new scores are published.

AIME 2024

ModelCompanyVersionScore
DeepSeek R1DeepSeekDeepSeek-R179.8%

AIME 2025

ModelCompanyVersionScore
o3OpenAIo3-2025-04-1696.7%
o4-miniOpenAIo4-mini-2025-04-1693.4%
Grok 3xAIgrok-3-beta93.3%

DocVQA

ModelCompanyVersionScore
Llama 4 ScoutMeta AILlama-4-Scout-17B-16E94.4%

GPQA

ModelCompanyVersionScore
o3OpenAIo3-2025-04-1687.7%
Grok 3xAIgrok-3-beta84.6%
Gemini 2.5 ProGoogle DeepMindgemini-2.5-pro-preview-03-2584.0%
o4-miniOpenAIo4-mini-2025-04-1681.4%
Claude Opus 4Anthropicclaude-opus-4-074.9%
Claude Sonnet 4.5Anthropicclaude-sonnet-4-568.0%
DeepSeek V3DeepSeekDeepSeek-V359.1%
GPT-4oOpenAIgpt-4o-2024-11-2053.6%

HumanEval

ModelCompanyVersionScore
DeepSeek R1DeepSeekDeepSeek-R196.3%
Gemini 2.5 ProGoogle DeepMindgemini-2.5-pro-preview-03-2594.3%
Claude Opus 4Anthropicclaude-opus-4-092.4%
Mistral Large 2Mistral AImistral-large-240792.0%
GPT-4.1OpenAIgpt-4.1-2025-04-1492.0%
DeepSeek V3DeepSeekDeepSeek-V390.2%
GPT-4oOpenAIgpt-4o-2024-11-2090.2%
Claude Sonnet 4.5Anthropicclaude-sonnet-4-590.1%
Llama 3.3 70BMeta AILlama-3.3-70B-Instruct88.4%
Grok 3xAIgrok-3-beta88.3%
Gemini 2.0 FlashGoogle DeepMindgemini-2.0-flash-00185.3%
Claude Haiku 4.5Anthropicclaude-haiku-4-579.5%

MATH

ModelCompanyVersionScore
o4-miniOpenAIo4-mini-2025-04-1699.5%
Grok 3xAIgrok-3-beta97.6%
DeepSeek R1DeepSeekDeepSeek-R197.3%
Gemini 2.5 ProGoogle DeepMindgemini-2.5-pro-preview-03-2597.0%
DeepSeek V3DeepSeekDeepSeek-V390.2%
Gemini 2.0 FlashGoogle DeepMindgemini-2.0-flash-00189.7%
Claude Opus 4Anthropicclaude-opus-4-089.5%
Llama 3.3 70BMeta AILlama-3.3-70B-Instruct77.0%
GPT-4oOpenAIgpt-4o-2024-11-2076.6%
Mistral Large 2Mistral AImistral-large-240772.4%

MMLU

ModelCompanyVersionScore
o3OpenAIo3-2025-04-1692.4%
Gemini 2.5 ProGoogle DeepMindgemini-2.5-pro-preview-03-2591.8%
DeepSeek R1DeepSeekDeepSeek-R190.8%
GPT-4.1OpenAIgpt-4.1-2025-04-1490.2%
GPT-4oOpenAIgpt-4o-2024-11-2088.7%
Claude Opus 4Anthropicclaude-opus-4-088.7%
DeepSeek V3DeepSeekDeepSeek-V388.5%
Claude Sonnet 4.5Anthropicclaude-sonnet-4-586.9%
Llama 3.3 70BMeta AILlama-3.3-70B-Instruct86.0%
Gemini 2.0 FlashGoogle DeepMindgemini-2.0-flash-00185.9%
Mistral Large 2Mistral AImistral-large-240784.0%
Claude Haiku 4.5Anthropicclaude-haiku-4-580.5%
Llama 4 ScoutMeta AILlama-4-Scout-17B-16E79.6%

SWE-bench Verified

ModelCompanyVersionScore
o3OpenAIo3-2025-04-1671.7%
GPT-4.1OpenAIgpt-4.1-2025-04-1454.6%