IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
IBM Research launches Open Agent Leaderboard for complete AI agent systems
IBM Research has released the Open Agent Leaderboard, the first open benchmark designed to evaluate full AI agent systems rather than just the models that power them. The leaderboard reports both quality and cost metrics across six diverse benchmarks, revealing that agent architecture significantly impacts performance even when using identical models.
Testing generality, not specialization
The benchmark evaluates agents across six established tasks spanning different domains: SWE-Bench Verified (code bug fixes), BrowseComp+ (web research), AppWorld (personal task completion), and three tau2-Bench variants (customer service and technical support). According to IBM Research, agents are tested as general-purpose systems without benchmark-specific tuning.
The leaderboard shows that the top three configurations all use the same underlying model but achieve different success rates and costs due to variations in agent architecture. "Same model, different agents, different results — the agent matters," the researchers write.
Failed runs cost 20-54% more
One of the most significant findings concerns failure behavior. IBM Research reports that failed agent runs cost 20-54% more than successful ones, with some agents failing fast and cheaply while others burn through expensive runs before terminating. This cost differential matters for production deployments where failure patterns directly impact operational expenses.
Tool shortlisting improves all models
The research identifies specific architectural components that improve performance. Tool shortlisting, which helps agents focus on relevant tools rather than searching through all available options, improved performance across every model tested. The technique "turned otherwise failing configurations into viable ones," according to the paper.
General agents match specialized systems
Contrary to expectations, IBM Research found that general-purpose agents are already competitive with specialized ones. "Across most benchmarks, general agents match or even outperform the best specialized systems," the researchers report. This suggests that single agents can increasingly handle diverse tasks without job-specific customization.
Unified protocol enables comparison
The technical foundation is Exgentic, an open evaluation framework that implements a unified protocol. The protocol standardizes how different agent systems interact with benchmarks by providing a consistent structure: a task description, context information, and available actions. This standardization allows fair comparison across agent architectures while preserving each benchmark's original design.
What this means
By evaluating complete agent systems rather than isolated models, the leaderboard makes visible what drives real-world performance: planning strategies, memory management, tool selection, and error recovery. The finding that failed runs cost significantly more than successful ones highlights a critical operational consideration typically absent from model benchmarks. Most importantly, the open release of methodology, framework, and results enables the research community to reproduce evaluations and submit new agent configurations. This transparency is essential for understanding which architectural choices generalize across tasks and which improvements come from the model versus the agent wrapper. For teams deploying AI agents, the leaderboard provides the first standardized way to compare full system costs and capabilities rather than relying solely on model performance claims.
Related Articles
QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.
Zhipu's GLM-5.2 matches Anthropic's Claude Opus 4.8 on agentic benchmark at one-fifth the cost
Zhipu AI's open-source GLM-5.2 model scores within one percentage point of Anthropic's Claude Opus 4.8 on a key agentic benchmark while costing approximately one-fifth as much. The release comes as U.S. government restrictions limit access to Anthropic's Fable and OpenAI's GPT-5.6 models.
Gemini 3.5 Flash ranks 6th in Android coding benchmark at 3x cost of Gemini 3.1 Pro
Google's latest Android Bench results show Gemini 3.5 Flash ranking 6th with a 63.7% success rate, despite averaging $147.10 per benchmark run compared to Gemini 3.1 Pro Preview's $47.90. The newer model used 355.9 tokens per run versus 73.3 for its predecessor, while GPT 5.5 leads the benchmark at 74% success rate.
ServiceNow Releases First Code-Switching ASR Benchmark: ElevenLabs Scribe V2 Leads with Lowest WER Across Four Language
ServiceNow released AU-Harness, the first comprehensive benchmark for code-switched speech recognition in enterprise voice agents, testing seven ASR systems including ElevenLabs, Gemini, and AssemblyAI. The benchmark covers 918 utterances across Spanish-English, French-English, Canadian French-English, and German-English, measuring Word Error Rate (WER), Semantic WER (SWER), and Answer Error Rate (AER). ElevenLabs Scribe V2 achieved the lowest WER across all language pairs, followed closely by AssemblyAI Universal-3 Pro.
Comments
Loading...