opus
1 article tagged with opus
March 26, 2026
benchmarkOpenAI
ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks
The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.