benchmarkOpenAI

ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks

TL;DR

The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.

3 min read
0

ARC-AGI-3 Benchmark: Frontier AI Models Score Below 1%, Humans Solve All Tasks

The ARC Prize Foundation has released ARC-AGI-3, a new interactive benchmark that exposes a stark capability gap between frontier AI systems and untrained humans. All 135 benchmark environments were solved by humans with zero prior knowledge and no instructions. Every tested frontier model scored below 1 percent.

Frontier Model Performance

Official leaderboard results from API-based testing without custom scaffolding:

  • Gemini 3.1 Pro Preview: 0.37%
  • GPT 5.4: 0.26%
  • Opus 4.6: 0.25%
  • Grok-4.20: 0.00%

Unlike ARC-AGI-1 and ARC-AGI-2, which presented static input-output pattern-matching tasks, ARC-AGI-3 places AI agents in turn-based game environments where they must independently discover objectives, form hypotheses about game mechanics, and execute multi-step plans—exactly as untrained human players do.

The RHAE Scoring Metric

ARC-AGI-3 uses Relative Human Action Efficiency (RHAE) to measure performance, making direct comparison with predecessor benchmarks impossible. The metric counts only actions that change game state; reasoning or internal computation doesn't factor in.

Efficiency is calculated per level using a squared formula: (human actions / AI actions)². A model requiring 100 actions versus a human's 10 receives 1% per level, not 10%. Only the second-best human performer (out of ten first-time players) sets the baseline—the top performer is excluded to filter outliers. Faster-than-human performance caps at 1.0 per level. Later levels receive higher weight due to increased complexity.

Why Scaffolding Doesn't Transfer

The official leaderboard uses standardized prompting across all models to measure general intelligence, not human engineering effort. A critical Duke University finding demonstrates why: Opus 4.6 achieved 97.1% on a known environment with hand-crafted scaffolding but scored 0% on unfamiliar tasks. This bimodal pattern proves task-specific harnesses don't transfer to novel problems.

The ARC Prize Foundation maintains a separate community leaderboard for scaffolding-driven results with explicit warnings against interpreting these as AGI progress. However, the foundation expects successful harness techniques to eventually migrate into models themselves—similar to how chain-of-thought prompting evolved from external technique to built-in feature in OpenAI's o1.

Founder François Chollet argues on X that true AGI requires no task-specific human guidance, since untrained humans solve these tasks independently. The distinction matters: general intelligence means facing any new task without preparation, not broad training across varied task categories.

Historical Predictive Power

ARC-AGI-1 and ARC-AGI-2 predicted major AI breakthroughs before other benchmarks detected them. ARC-AGI-1 precisely identified the reasoning model breakthrough—OpenAI's o3 achieved significant gains—when other benchmarks had plateaued. ARC-AGI-2 captured modern reasoning model progress and scaffolding's rise, now deployed in production systems like Claude Code. Both predecessors are now saturated.

ARC-AGI-3 targets the next frontier: agentic intelligence—navigating completely unfamiliar environments without domain-specific training.

Prize and Public Access

The ARC Prize Foundation made 25 environments publicly available and launched the ARC Prize 2026 on Kaggle with $2 million in total prize money for any AI system matching untrained human performance across the full benchmark.

The foundation plans to limit maximum attempts to five times the human attempt count due to cost constraints.

What This Means

Frontier models' sub-1% performance isn't an artifact of missing scaffolding—it reflects fundamental limits in zero-shot environment adaptation. The benchmark cleanly separates genuine generalization capability from task-specific engineering. ARC-AGI-3's predictive track record suggests it will flag the next genuine capability breakthrough when it arrives. The $2M prize remains unclaimed, indicating the gap between current AI and human-level task-agnostic problem-solving remains substantial.

Related Articles

product update

OpenAI launches GPT-Realtime-2 with GPT-5-class reasoning, adds real-time translation across 70 languages

OpenAI has added three voice intelligence features to its Realtime API: GPT-Realtime-2 with GPT-5-class reasoning for complex conversational requests, GPT-Realtime-Translate supporting 70 input languages and 13 output languages, and GPT-Realtime-Whisper for live speech-to-text transcription. Translation and transcription are billed by the minute, while GPT-Realtime-2 uses token-based pricing.

model release

OpenAI releases GPT-Realtime-2 reasoning voice model with two specialized variants for translation and transcription

OpenAI has released three new realtime voice models through its Realtime API: GPT-Realtime-2 with GPT-5-class reasoning capabilities, GPT-Realtime-Translate supporting 70 input languages, and GPT-Realtime-Whisper for streaming transcription. The models are priced at $32-64 per 1M audio tokens for GPT-Realtime-2, and $0.017-0.034 per minute for the specialized variants.

analysis

Image AI models drive 6.5x more app downloads than text model updates, Appfigures data shows

Image model releases are generating 6.5 times more mobile app downloads than traditional text model updates, according to Appfigures. Google's Gemini added 22 million downloads in 28 days following its image model release, while ChatGPT added 12 million after GPT-4o image capabilities launched.

changelog

OpenAI Fixed GPT-5.5's Goblin Obsession by Explicitly Banning Mythical Creature References

OpenAI discovered its GPT-5.1 through GPT-5.4 models developed an increasing fixation on goblins, gremlins, and other mythical creatures. The issue traced back to reinforcement learning rewards used to develop a discontinued 'Nerdy personality' feature, which persisted across model generations.

Comments

Loading...