AI agent skills fail in real-world conditions, researchers find testing 34,000 skills

TL;DR

A large-scale study testing 34,198 real-world skills reveals that AI agent performance drops drastically when moving from curated benchmarks to realistic conditions. Claude Opus 4.6 saw pass rates fall from 55.4% with hand-selected skills to 38.4% in truly realistic scenarios, while weaker models like Kimi K2.5 actually perform below their no-skill baseline.

April 12, 2026 · 10:50 AM3 min read

AI Agent Skills Fail Under Realistic Conditions, Researchers Find

A comprehensive study by UC Santa Barbara, MIT CSAIL, and the MIT-IBM Watson AI Lab reveals that AI agent skills—specialized knowledge modules that systems like Claude Code use to handle domain-specific tasks—deliver far less benefit than benchmarks suggest.

Benchmark Gap Exposed

The core problem: existing skill benchmarks like SKILLSBENCH hand-deliver curated, task-specific skills directly to agents. In one example, agents trying to identify flood days at USGS gauging stations received three skills containing the exact API for downloading water data, the specific flood threshold URL, and ready-made code snippets. As the researchers note, "These skills combined almost directly spell out the exact solution guide for the task."

Real-world conditions work differently. Agents must search through large, noisy skill collections, recognize which ones apply to their task, and adapt general-purpose skills to specific problems without guarantees that suitable skills exist at all.

The 34,000-Skill Test

The team aggregated 34,198 real skills from open-source repositories (skillhub.club and skills.sh) and tested them across six increasingly realistic scenarios:

Curated skills force-loaded
Skills loaded with distractors added
Independent skill search required
Search without curated skills in the pool
Full realistic conditions
No skills baseline

Three models faced testing: Claude Opus 4.6 with Claude Code, Kimi K2.5 with Terminus-2, and Qwen3.5-397B-A17B with Qwen Code.

Performance Collapses Under Realism

Results showed consistent degradation:

Claude Opus 4.6:

Curated skills: 55.4% pass rate
Independent search: 40.1%
Realistic scenario (no curated skills in pool): 38.4%
No-skill baseline: 35.4%

The advantage compressed from 20 percentage points to 3 percentage points.

Weaker models performed worse with skills:

Kimi K2.5: 19.8% pass rate with skills vs. 21.8% baseline
Qwen3.5-397B: 19.7% with skills vs. 20.5% baseline

Irrelevant skills actively harmed weaker models by consuming tokens and computational resources on loading and following useless instructions.

Three Critical Bottlenecks

Skill selection failure: Even when curated skills were directly available, Claude loaded them only 49% of the time. Adding distractors dropped this to 31%. In the hardest scenario, Claude loaded skills in just 16% of runs.

Weak retrieval: The best retrieval method tested (agentic hybrid search) achieved only 65.5% Recall@5. Simpler semantic search performed 18.7 percentage points worse at Recall@3.

Poor adaptation: Agents cannot effectively adapt general-purpose skills to specific tasks when no tailored skills exist. This limits skills' usefulness when exact matches don't exist.

Refinement as Multiplier, Not Solution

Task-specific refinement—where agents explore tasks, evaluate retrieved skills, and build new ones—showed promise. Claude improved from 40.1% to 48.2% on SKILLSBENCH through refinement.

However, researchers found refinement works primarily as a multiplier of existing skill quality rather than a source of new knowledge. It only helps when initially retrieved skills contain relevant information.

Task-independent offline refinement showed inconsistent results.

Prior Evidence

A 2024 Vercel study flagged the same core problem: agents failed to retrieve available skills in 56% of test cases, matching the no-documentation baseline. A passively-loaded Markdown file achieved 100% pass rate versus 79% for the skill system.

What This Means

The skill abstraction, introduced by Anthropic in October 2025 and adopted by multiple platforms, appears fundamentally mismatched to how agents actually operate. The technology works in controlled settings where the right tool is handed directly to the system. In production environments requiring independent skill discovery and application, benefits shrink to marginal—or become liabilities for weaker models.

The research suggests three necessary improvements: better retrieval methods, more effective offline skill refinement, and skill ecosystems designed with varying model capabilities in mind. The underlying finding is sobering: AI agents struggle with the meta-task of skill selection as much as the domain tasks themselves.

Source: the-decoder.com ↗

ai-agents skills benchmarks claude ai-research generative-ai agentic-systems retrieval

product updateJuly 10, 2026

Anthropic adds sandboxed in-app browser to Claude Code desktop app

Anthropic has added an in-app browser to Claude Code's desktop application. The sandboxed browser allows Claude to read, click through, and interact with documentation, designs, and local development servers, with configurable session persistence.

product updateJuly 9, 2026

Anthropic launches Claude Reflect analytics dashboard to track AI usage patterns

Anthropic has released Claude Reflect, an analytics dashboard that tracks how users interact with Claude, including conversation topics, usage patterns, and task types. The feature, available in beta for Free, Pro, and Max tier users with memory enabled, includes mindfulness prompts and suggestions for better AI integration.

product updateJuly 9, 2026

Anthropic tests feature to prompt Claude users about overuse, adds usage tracking dashboard

Anthropic is testing a beta feature in Claude that tracks usage patterns and periodically prompts users to consider if they're using the chatbot too much. The feature shows usage summaries over periods from one to twelve months and includes quiet hours scheduling.

product updateJuly 8, 2026

Anthropic launches Claude apps gateway for AWS, enabling centralized control of Claude Code and Claude Desktop deploymen

Anthropic has released Claude apps gateway for AWS, a self-hosted control plane that gives enterprises centralized management of Claude Code and Claude Desktop deployments. The gateway runs as a stateless container on AWS infrastructure and handles identity through OIDC, policy enforcement, telemetry, request routing to Amazon Bedrock or Claude Platform on AWS, and per-user spend caps.