opus

1 article tagged with opus

March 26, 2026
benchmarkOpenAI

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.