researchAnthropic

AI agent skills fail in real-world conditions, researchers find testing 34,000 skills

TL;DR

A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.

3 min read
0

AI Agent Skills Fail Under Realistic Conditions, Researchers Find

A comprehensive study by UC Santa Barbara, MIT CSAIL, and the MIT-IBM Watson AI Lab reveals that AI agent skills—specialized knowledge modules that systems like Claude Code use to handle domain-specific tasks—deliver far less benefit than benchmarks suggest.

Benchmark Gap Exposed

The core problem: existing skill benchmarks like SKILLSBENCH hand-deliver curated, task-specific skills directly to agents. In one example, agents trying to identify flood days at USGS gauging stations received three skills containing the exact API for downloading water data, the specific flood threshold URL, and ready-made code snippets. As the researchers note, "These skills combined almost directly spell out the exact solution guide for the task."

Real-world conditions work differently. Agents must search through large, noisy skill collections, recognize which ones apply to their task, and adapt general-purpose skills to specific problems without guarantees that suitable skills exist at all.

The 34,000-Skill Test

The team aggregated 34,198 real skills from open-source repositories (skillhub.club and skills.sh) and tested them across six increasingly realistic scenarios:

  1. Curated skills force-loaded
  2. Skills loaded with distractors added
  3. Independent skill search required
  4. Search without curated skills in the pool
  5. Full realistic conditions
  6. No skills baseline

Three models faced testing: Claude Opus 4.6 with Claude Code, Kimi K2.5 with Terminus-2, and Qwen3.5-397B-A17B with Qwen Code.

Performance Collapses Under Realism

Results showed consistent degradation:

Claude Opus 4.6:

  • Curated skills: 55.4% pass rate
  • Independent search: 40.1%
  • Realistic scenario (no curated skills in pool): 38.4%
  • No-skill baseline: 35.4%

The advantage compressed from 20 percentage points to 3 percentage points.

Weaker models performed worse with skills:

  • Kimi K2.5: 19.8% pass rate with skills vs. 21.8% baseline
  • Qwen3.5-397B: 19.7% with skills vs. 20.5% baseline

Irrelevant skills actively harmed weaker models by consuming tokens and computational resources on loading and following useless instructions.

Three Critical Bottlenecks

Skill selection failure: Even when curated skills were directly available, Claude loaded them only 49% of the time. Adding distractors dropped this to 31%. In the hardest scenario, Claude loaded skills in just 16% of runs.

Weak retrieval: The best retrieval method tested (agentic hybrid search) achieved only 65.5% Recall@5. Simpler semantic search performed 18.7 percentage points worse at Recall@3.

Poor adaptation: Agents cannot effectively adapt general-purpose skills to specific tasks when no tailored skills exist. This limits skills' usefulness when exact matches don't exist.

Refinement as Multiplier, Not Solution

Task-specific refinement—where agents explore tasks, evaluate retrieved skills, and build new ones—showed promise. Claude improved from 40.1% to 48.2% on SKILLSBENCH through refinement.

However, researchers found refinement works primarily as a multiplier of existing skill quality rather than a source of new knowledge. It only helps when initially retrieved skills contain relevant information.

Task-independent offline refinement showed inconsistent results.

Prior Evidence

A 2024 Vercel study flagged the same core problem: agents failed to retrieve available skills in 56% of test cases, matching the no-documentation baseline. A passively-loaded Markdown file achieved 100% pass rate versus 79% for the skill system.

What This Means

The skill abstraction, introduced by Anthropic in October 2025 and adopted by multiple platforms, appears fundamentally mismatched to how agents actually operate. The technology works in controlled settings where the right tool is handed directly to the system. In production environments requiring independent skill discovery and application, benefits shrink to marginal—or become liabilities for weaker models.

The research suggests three necessary improvements: better retrieval methods, more effective offline skill refinement, and skill ecosystems designed with varying model capabilities in mind. The underlying finding is sobering: AI agents struggle with the meta-task of skill selection as much as the domain tasks themselves.

Related Articles

product update

Anthropic exits Claude Cowork research preview with enterprise features, launches Claude Managed Agents beta

Anthropic has promoted Claude Cowork from research preview to general availability, adding six enterprise features including role-based access controls, group spend limits, and usage analytics. The company simultaneously launched Claude Managed Agents in public beta—a composable API suite for building and deploying cloud-hosted agents without custom infrastructure work.

product update

Anthropic launches Claude Managed Agents for autonomous AI agents in public beta

Anthropic has launched Claude Managed Agents as a public beta, providing developers with a hosted platform to build and run autonomous AI agents without managing their own infrastructure. The service costs $0.08 per session hour on top of standard token pricing and is currently available exclusively on Anthropic's infrastructure.

product update

Anthropic blocks Claude subscriptions for OpenClaw, citing capacity constraints

Anthropic has disallowed subscription-based pricing for users accessing Claude through open-source agentic tools like OpenClaw, effective April 4, 2026. The restriction comes as the company faces elevated service errors and struggles to balance capacity with demand. Third-party tool usage will now draw from pay-per-token rates instead of subscription limits.

product update

Anthropic adds Ultraplan to Claude Code, moving task planning to the cloud

Anthropic has launched Ultraplan, a new feature for Claude Code that offloads programming task planning to the cloud. The feature enables developers to initiate planning jobs from the terminal while the planning executes in the browser, supporting inline comments, emoji reactions, and revision requests on individual plan sections.

Comments

Loading...