ARC-AGI-3 benchmark: frontier AI models score below 1%, humans solve all 135 tasks
The ARC Prize Foundation released ARC-AGI-3, an interactive benchmark requiring AI agents to explore environments, form hypotheses, and execute plans without instructions. All 135 environments were solved by untrained humans, yet frontier models—including Gemini 3.1 Pro Preview (0.37%), GPT 5.4 (0.26%), Opus 4.6 (0.25%), and Grok-4.20 (0.00%)—scored below 1%.
ARC-AGI-3 Benchmark: Frontier AI Models Score Below 1%, Humans Solve All Tasks
The ARC Prize Foundation has released ARC-AGI-3, a new interactive benchmark that exposes a stark capability gap between frontier AI systems and untrained humans. All 135 benchmark environments were solved by humans with zero prior knowledge and no instructions. Every tested frontier model scored below 1 percent.
Frontier Model Performance
Official leaderboard results from API-based testing without custom scaffolding:
- Gemini 3.1 Pro Preview: 0.37%
- GPT 5.4: 0.26%
- Opus 4.6: 0.25%
- Grok-4.20: 0.00%
Unlike ARC-AGI-1 and ARC-AGI-2, which presented static input-output pattern-matching tasks, ARC-AGI-3 places AI agents in turn-based game environments where they must independently discover objectives, form hypotheses about game mechanics, and execute multi-step plans—exactly as untrained human players do.
The RHAE Scoring Metric
ARC-AGI-3 uses Relative Human Action Efficiency (RHAE) to measure performance, making direct comparison with predecessor benchmarks impossible. The metric counts only actions that change game state; reasoning or internal computation doesn't factor in.
Efficiency is calculated per level using a squared formula: (human actions / AI actions)². A model requiring 100 actions versus a human's 10 receives 1% per level, not 10%. Only the second-best human performer (out of ten first-time players) sets the baseline—the top performer is excluded to filter outliers. Faster-than-human performance caps at 1.0 per level. Later levels receive higher weight due to increased complexity.
Why Scaffolding Doesn't Transfer
The official leaderboard uses standardized prompting across all models to measure general intelligence, not human engineering effort. A critical Duke University finding demonstrates why: Opus 4.6 achieved 97.1% on a known environment with hand-crafted scaffolding but scored 0% on unfamiliar tasks. This bimodal pattern proves task-specific harnesses don't transfer to novel problems.
The ARC Prize Foundation maintains a separate community leaderboard for scaffolding-driven results with explicit warnings against interpreting these as AGI progress. However, the foundation expects successful harness techniques to eventually migrate into models themselves—similar to how chain-of-thought prompting evolved from external technique to built-in feature in OpenAI's o1.
Founder François Chollet argues on X that true AGI requires no task-specific human guidance, since untrained humans solve these tasks independently. The distinction matters: general intelligence means facing any new task without preparation, not broad training across varied task categories.
Historical Predictive Power
ARC-AGI-1 and ARC-AGI-2 predicted major AI breakthroughs before other benchmarks detected them. ARC-AGI-1 precisely identified the reasoning model breakthrough—OpenAI's o3 achieved significant gains—when other benchmarks had plateaued. ARC-AGI-2 captured modern reasoning model progress and scaffolding's rise, now deployed in production systems like Claude Code. Both predecessors are now saturated.
ARC-AGI-3 targets the next frontier: agentic intelligence—navigating completely unfamiliar environments without domain-specific training.
Prize and Public Access
The ARC Prize Foundation made 25 environments publicly available and launched the ARC Prize 2026 on Kaggle with $2 million in total prize money for any AI system matching untrained human performance across the full benchmark.
The foundation plans to limit maximum attempts to five times the human attempt count due to cost constraints.
What This Means
Frontier models' sub-1% performance isn't an artifact of missing scaffolding—it reflects fundamental limits in zero-shot environment adaptation. The benchmark cleanly separates genuine generalization capability from task-specific engineering. ARC-AGI-3's predictive track record suggests it will flag the next genuine capability breakthrough when it arrives. The $2M prize remains unclaimed, indicating the gap between current AI and human-level task-agnostic problem-solving remains substantial.
Related Articles
Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro
Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.
Frontier AI Models Score Below 50% on First Enterprise IT Benchmark for Kubernetes Incident Response
Artificial Analysis and IBM Research have released ITBench-AA, the first benchmark evaluating AI models on enterprise Site Reliability Engineering tasks. Claude Opus 4.7 leads at 47%, followed by GPT-5.5 at 46% and Qwen3.7 Max at 42%—all frontier models score below 50% on Kubernetes incident response tasks requiring root-cause diagnosis across complex infrastructure.
ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language
ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.
Cline v3.85.0 Adds DeepSeek V4, Gemini 3.5 Flash, and GPT-5.5 Support
Cline, the AI coding assistant VS Code extension, released version 3.85.0 on May 25, 2025, adding support for DeepSeek V4 Flash and Pro models, Gemini 3.5 Flash across multiple providers, and GPT-5.5 through SAP AI Core. The update also fixes Vertex AI global endpoint handling for Claude models.
Comments
Loading...