benchmarkOpenAI

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

TL;DR

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.

3 min read
0

ARC-AGI-3 Benchmark: Frontier AI Models Score Below 1%, Humans Solve All Tasks

The ARC Prize Foundation has released ARC-AGI-3, a new interactive benchmark that exposes a stark capability gap between frontier AI systems and untrained humans. All 135 benchmark environments were solved by humans with zero prior knowledge and no instructions. Every tested frontier model scored below 1 percent.

Frontier Model Performance

Official leaderboard results from API-based testing without custom scaffolding:

  • Gemini 3.1 Pro Preview: 0.37%
  • GPT 5.4: 0.26%
  • Opus 4.6: 0.25%
  • Grok-4.20: 0.00%

Unlike ARC-AGI-1 and ARC-AGI-2, which presented static input-output pattern-matching tasks, ARC-AGI-3 places AI agents in turn-based game environments where they must independently discover objectives, form hypotheses about game mechanics, and execute multi-step plans—exactly as untrained human players do.

The RHAE Scoring Metric

ARC-AGI-3 uses Relative Human Action Efficiency (RHAE) to measure performance, making direct comparison with predecessor benchmarks impossible. The metric counts only actions that change game state; reasoning or internal computation doesn't factor in.

Efficiency is calculated per level using a squared formula: (human actions / AI actions)². A model requiring 100 actions versus a human's 10 receives 1% per level, not 10%. Only the second-best human performer (out of ten first-time players) sets the baseline—the top performer is excluded to filter outliers. Faster-than-human performance caps at 1.0 per level. Later levels receive higher weight due to increased complexity.

Why Scaffolding Doesn't Transfer

The official leaderboard uses standardized prompting across all models to measure general intelligence, not human engineering effort. A critical Duke University finding demonstrates why: Opus 4.6 achieved 97.1% on a known environment with hand-crafted scaffolding but scored 0% on unfamiliar tasks. This bimodal pattern proves task-specific harnesses don't transfer to novel problems.

The ARC Prize Foundation maintains a separate community leaderboard for scaffolding-driven results with explicit warnings against interpreting these as AGI progress. However, the foundation expects successful harness techniques to eventually migrate into models themselves—similar to how chain-of-thought prompting evolved from external technique to built-in feature in OpenAI's o1.

Founder François Chollet argues on X that true AGI requires no task-specific human guidance, since untrained humans solve these tasks independently. The distinction matters: general intelligence means facing any new task without preparation, not broad training across varied task categories.

Historical Predictive Power

ARC-AGI-1 and ARC-AGI-2 predicted major AI breakthroughs before other benchmarks detected them. ARC-AGI-1 precisely identified the reasoning model breakthrough—OpenAI's o3 achieved significant gains—when other benchmarks had plateaued. ARC-AGI-2 captured modern reasoning model progress and scaffolding's rise, now deployed in production systems like Claude Code. Both predecessors are now saturated.

ARC-AGI-3 targets the next frontier: agentic intelligence—navigating completely unfamiliar environments without domain-specific training.

Prize and Public Access

The ARC Prize Foundation made 25 environments publicly available and launched the ARC Prize 2026 on Kaggle with $2 million in total prize money for any AI system matching untrained human performance across the full benchmark.

The foundation plans to limit maximum attempts to five times the human attempt count due to cost constraints.

What This Means

Frontier models' sub-1% performance isn't an artifact of missing scaffolding—it reflects fundamental limits in zero-shot environment adaptation. The benchmark cleanly separates genuine generalization capability from task-specific engineering. ARC-AGI-3's predictive track record suggests it will flag the next genuine capability breakthrough when it arrives. The $2M prize remains unclaimed, indicating the gap between current AI and human-level task-agnostic problem-solving remains substantial.

Related Articles

product update

OpenAI consolidating ChatGPT, Codex, and Atlas into single macOS superapp

OpenAI is consolidating its fragmented macOS app ecosystem by merging ChatGPT, Codex coding platform, and Atlas browser into a single "superapp" led by Chief of Applications Fidji Simo. The unified app will feature agentic AI capabilities for autonomous task execution and team collaboration, with rollout expected over coming months starting with Codex enhancements.

product update

OpenAI shutters Sora video tool after Disney deal collapse, signaling shift to enterprise focus

OpenAI announced the shutdown of its Sora video generation app on Tuesday via an X post, just two days after publishing usage guidelines and following Disney's withdrawal from a proposed $1 billion investment deal. The move represents OpenAI's second major product discontinuation in recent months, after deprecating GPT-4o in January with two weeks' notice.

model release

OpenAI completes pretraining of 'Spud' model, Altman promises 'very strong' release in weeks

OpenAI has completed pretraining on a new model codenamed 'Spud,' according to an internal memo from CEO Sam Altman reported by The Information. Altman claims the company expects a 'very strong model' within weeks that can 'really accelerate the economy.' To free compute resources, OpenAI will shut down its Sora video generation app.

product update

OpenAI shuts down Sora app with no explanation; Disney deal collapses

OpenAI announced the shutdown of its Sora standalone video generation app on X, though the company provided no explanation for the decision. The closure kills a partnership deal with Disney that would have allowed Sora to generate videos using Disney IP. Video generation capabilities may remain available through other OpenAI channels.

Comments

Loading...