ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks
The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.
ARC-AGI-3 Benchmark: Frontier AI Models Score Below 1%, Humans Solve All Tasks
The ARC Prize Foundation has released ARC-AGI-3, a new interactive benchmark that exposes a stark capability gap between frontier AI systems and untrained humans. All 135 benchmark environments were solved by humans with zero prior knowledge and no instructions. Every tested frontier model scored below 1 percent.
Frontier Model Performance
Official leaderboard results from API-based testing without custom scaffolding:
- Gemini 3.1 Pro Preview: 0.37%
- GPT 5.4: 0.26%
- Opus 4.6: 0.25%
- Grok-4.20: 0.00%
Unlike ARC-AGI-1 and ARC-AGI-2, which presented static input-output pattern-matching tasks, ARC-AGI-3 places AI agents in turn-based game environments where they must independently discover objectives, form hypotheses about game mechanics, and execute multi-step plans—exactly as untrained human players do.
The RHAE Scoring Metric
ARC-AGI-3 uses Relative Human Action Efficiency (RHAE) to measure performance, making direct comparison with predecessor benchmarks impossible. The metric counts only actions that change game state; reasoning or internal computation doesn't factor in.
Efficiency is calculated per level using a squared formula: (human actions / AI actions)². A model requiring 100 actions versus a human's 10 receives 1% per level, not 10%. Only the second-best human performer (out of ten first-time players) sets the baseline—the top performer is excluded to filter outliers. Faster-than-human performance caps at 1.0 per level. Later levels receive higher weight due to increased complexity.
Why Scaffolding Doesn't Transfer
The official leaderboard uses standardized prompting across all models to measure general intelligence, not human engineering effort. A critical Duke University finding demonstrates why: Opus 4.6 achieved 97.1% on a known environment with hand-crafted scaffolding but scored 0% on unfamiliar tasks. This bimodal pattern proves task-specific harnesses don't transfer to novel problems.
The ARC Prize Foundation maintains a separate community leaderboard for scaffolding-driven results with explicit warnings against interpreting these as AGI progress. However, the foundation expects successful harness techniques to eventually migrate into models themselves—similar to how chain-of-thought prompting evolved from external technique to built-in feature in OpenAI's o1.
Founder François Chollet argues on X that true AGI requires no task-specific human guidance, since untrained humans solve these tasks independently. The distinction matters: general intelligence means facing any new task without preparation, not broad training across varied task categories.
Historical Predictive Power
ARC-AGI-1 and ARC-AGI-2 predicted major AI breakthroughs before other benchmarks detected them. ARC-AGI-1 precisely identified the reasoning model breakthrough—OpenAI's o3 achieved significant gains—when other benchmarks had plateaued. ARC-AGI-2 captured modern reasoning model progress and scaffolding's rise, now deployed in production systems like Claude Code. Both predecessors are now saturated.
ARC-AGI-3 targets the next frontier: agentic intelligence—navigating completely unfamiliar environments without domain-specific training.
Prize and Public Access
The ARC Prize Foundation made 25 environments publicly available and launched the ARC Prize 2026 on Kaggle with $2 million in total prize money for any AI system matching untrained human performance across the full benchmark.
The foundation plans to limit maximum attempts to five times the human attempt count due to cost constraints.
What This Means
Frontier models' sub-1% performance isn't an artifact of missing scaffolding—it reflects fundamental limits in zero-shot environment adaptation. The benchmark cleanly separates genuine generalization capability from task-specific engineering. ARC-AGI-3's predictive track record suggests it will flag the next genuine capability breakthrough when it arrives. The $2M prize remains unclaimed, indicating the gap between current AI and human-level task-agnostic problem-solving remains substantial.
Related Articles
OpenAI consolidating ChatGPT, Codex, and Atlas into single macOS superapp
OpenAI is consolidating its fragmented macOS app ecosystem by merging ChatGPT, Codex coding platform, and Atlas browser into a single "superapp" led by Chief of Applications Fidji Simo. The unified app will feature agentic AI capabilities for autonomous task execution and team collaboration, with rollout expected over coming months starting with Codex enhancements.
OpenAI shutters Sora video tool after Disney deal collapse, signaling shift to enterprise focus
OpenAI announced the shutdown of its Sora video generation app on Tuesday via an X post, just two days after publishing usage guidelines and following Disney's withdrawal from a proposed $1 billion investment deal. The move represents OpenAI's second major product discontinuation in recent months, after deprecating GPT-4o in January with two weeks' notice.
OpenAI completes pretraining of 'Spud' model, Altman promises 'very strong' release in weeks
OpenAI has completed pretraining on a new model codenamed 'Spud,' according to an internal memo from CEO Sam Altman reported by The Information. Altman claims the company expects a 'very strong model' within weeks that can 'really accelerate the economy.' To free compute resources, OpenAI will shut down its Sora video generation app.
OpenAI shuts down Sora app with no explanation; Disney deal collapses
OpenAI announced the shutdown of its Sora standalone video generation app on X, though the company provided no explanation for the decision. The closure kills a partnership deal with Disney that would have allowed Sora to generate videos using Disney IP. Video generation capabilities may remain available through other OpenAI channels.
Comments
Loading...