researchAnthropic

AI agent skills fail in real-world conditions, researchers find testing 34,000 skills

TL;DR

A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.

3 min read
0

AI Agent Skills Fail Under Realistic Conditions, Researchers Find

A comprehensive study by UC Santa Barbara, MIT CSAIL, and the MIT-IBM Watson AI Lab reveals that AI agent skills—specialized knowledge modules that systems like Claude Code use to handle domain-specific tasks—deliver far less benefit than benchmarks suggest.

Benchmark Gap Exposed

The core problem: existing skill benchmarks like SKILLSBENCH hand-deliver curated, task-specific skills directly to agents. In one example, agents trying to identify flood days at USGS gauging stations received three skills containing the exact API for downloading water data, the specific flood threshold URL, and ready-made code snippets. As the researchers note, "These skills combined almost directly spell out the exact solution guide for the task."

Real-world conditions work differently. Agents must search through large, noisy skill collections, recognize which ones apply to their task, and adapt general-purpose skills to specific problems without guarantees that suitable skills exist at all.

The 34,000-Skill Test

The team aggregated 34,198 real skills from open-source repositories (skillhub.club and skills.sh) and tested them across six increasingly realistic scenarios:

  1. Curated skills force-loaded
  2. Skills loaded with distractors added
  3. Independent skill search required
  4. Search without curated skills in the pool
  5. Full realistic conditions
  6. No skills baseline

Three models faced testing: Claude Opus 4.6 with Claude Code, Kimi K2.5 with Terminus-2, and Qwen3.5-397B-A17B with Qwen Code.

Performance Collapses Under Realism

Results showed consistent degradation:

Claude Opus 4.6:

  • Curated skills: 55.4% pass rate
  • Independent search: 40.1%
  • Realistic scenario (no curated skills in pool): 38.4%
  • No-skill baseline: 35.4%

The advantage compressed from 20 percentage points to 3 percentage points.

Weaker models performed worse with skills:

  • Kimi K2.5: 19.8% pass rate with skills vs. 21.8% baseline
  • Qwen3.5-397B: 19.7% with skills vs. 20.5% baseline

Irrelevant skills actively harmed weaker models by consuming tokens and computational resources on loading and following useless instructions.

Three Critical Bottlenecks

Skill selection failure: Even when curated skills were directly available, Claude loaded them only 49% of the time. Adding distractors dropped this to 31%. In the hardest scenario, Claude loaded skills in just 16% of runs.

Weak retrieval: The best retrieval method tested (agentic hybrid search) achieved only 65.5% Recall@5. Simpler semantic search performed 18.7 percentage points worse at Recall@3.

Poor adaptation: Agents cannot effectively adapt general-purpose skills to specific tasks when no tailored skills exist. This limits skills' usefulness when exact matches don't exist.

Refinement as Multiplier, Not Solution

Task-specific refinement—where agents explore tasks, evaluate retrieved skills, and build new ones—showed promise. Claude improved from 40.1% to 48.2% on SKILLSBENCH through refinement.

However, researchers found refinement works primarily as a multiplier of existing skill quality rather than a source of new knowledge. It only helps when initially retrieved skills contain relevant information.

Task-independent offline refinement showed inconsistent results.

Prior Evidence

A 2024 Vercel study flagged the same core problem: agents failed to retrieve available skills in 56% of test cases, matching the no-documentation baseline. A passively-loaded Markdown file achieved 100% pass rate versus 79% for the skill system.

What This Means

The skill abstraction, introduced by Anthropic in October 2025 and adopted by multiple platforms, appears fundamentally mismatched to how agents actually operate. The technology works in controlled settings where the right tool is handed directly to the system. In production environments requiring independent skill discovery and application, benefits shrink to marginal—or become liabilities for weaker models.

The research suggests three necessary improvements: better retrieval methods, more effective offline skill refinement, and skill ecosystems designed with varying model capabilities in mind. The underlying finding is sobering: AI agents struggle with the meta-task of skill selection as much as the domain tasks themselves.

Related Articles

model release

Anthropic's Unreleased Claude Mythos Preview Finds 10,000+ Vulnerabilities in One Month

Anthropic's unreleased Claude Mythos Preview model has discovered more than 10,000 vulnerabilities across partner organizations in its first month of deployment through Project Glasswing. The company reports partners are finding bugs at 10x their previous rate, with Cloudflare discovering 2,000 bugs and Mozilla finding 271 Firefox vulnerabilities — 10x more than with previous Claude models.

product update

Anthropic adds MCP tunnels and self-hosted sandboxes to Claude Managed Agents for enterprise security

Anthropic has added two enterprise security features to Claude Managed Agents: MCP tunnels, which route agent services through private networks without public internet exposure, and self-hosted sandboxes, which keep sensitive tool execution within customer infrastructure while Anthropic handles orchestration.

product update

Anthropic launches contract review tool in Claude for Small Business that flags risky clauses

Anthropic has released Claude for Small Business, a collection of 31 AI skills for Claude Cowork subscribers. The standout feature is /review-contract, which analyzes legal contracts and flags problematic clauses in approximately five minutes. The tool requires at minimum a $20/month Claude Pro subscription.

analysis

Anthropic's Mythos Preview solves previously unsolvable cybersecurity test in updated checkpoint

A month after its initial release, a newer checkpoint of Anthropic's Mythos Preview became the first model to complete the UK AI Safety Institute's 'Cooling Tower' cyber range, solving it in 3 of 10 attempts. The model also completed 'The Last Ones' range in 6 of 10 attempts, surpassing OpenAI's GPT-5.5 and demonstrating capability improvements within a single model version.

Comments

Loading...