IBM Research launches Open Agent Leaderboard, showing same models achieve different results based on agent architecture
IBM Research has launched the Open Agent Leaderboard, the first open benchmark that evaluates complete AI agent systems rather than just underlying models. The leaderboard reveals that agents using identical models can achieve significantly different success rates and costs depending on system architecture, with failed runs costing 20-54% more than successful ones.
IBM Research launches Open Agent Leaderboard for complete AI agent systems
IBM Research has released the Open Agent Leaderboard, the first open benchmark designed to evaluate full AI agent systems rather than just the models that power them. The leaderboard reports both quality and cost metrics across six diverse benchmarks, revealing that agent architecture significantly impacts performance even when using identical models.
Testing generality, not specialization
The benchmark evaluates agents across six established tasks spanning different domains: SWE-Bench Verified (code bug fixes), BrowseComp+ (web research), AppWorld (personal task completion), and three tau2-Bench variants (customer service and technical support). According to IBM Research, agents are tested as general-purpose systems without benchmark-specific tuning.
The leaderboard shows that the top three configurations all use the same underlying model but achieve different success rates and costs due to variations in agent architecture. "Same model, different agents, different results — the agent matters," the researchers write.
Failed runs cost 20-54% more
One of the most significant findings concerns failure behavior. IBM Research reports that failed agent runs cost 20-54% more than successful ones, with some agents failing fast and cheaply while others burn through expensive runs before terminating. This cost differential matters for production deployments where failure patterns directly impact operational expenses.
Tool shortlisting improves all models
The research identifies specific architectural components that improve performance. Tool shortlisting, which helps agents focus on relevant tools rather than searching through all available options, improved performance across every model tested. The technique "turned otherwise failing configurations into viable ones," according to the paper.
General agents match specialized systems
Contrary to expectations, IBM Research found that general-purpose agents are already competitive with specialized ones. "Across most benchmarks, general agents match or even outperform the best specialized systems," the researchers report. This suggests that single agents can increasingly handle diverse tasks without job-specific customization.
Unified protocol enables comparison
The technical foundation is Exgentic, an open evaluation framework that implements a unified protocol. The protocol standardizes how different agent systems interact with benchmarks by providing a consistent structure: a task description, context information, and available actions. This standardization allows fair comparison across agent architectures while preserving each benchmark's original design.
What this means
By evaluating complete agent systems rather than isolated models, the leaderboard makes visible what drives real-world performance: planning strategies, memory management, tool selection, and error recovery. The finding that failed runs cost significantly more than successful ones highlights a critical operational consideration typically absent from model benchmarks. Most importantly, the open release of methodology, framework, and results enables the research community to reproduce evaluations and submit new agent configurations. This transparency is essential for understanding which architectural choices generalize across tasks and which improvements come from the model versus the agent wrapper. For teams deploying AI agents, the leaderboard provides the first standardized way to compare full system costs and capabilities rather than relying solely on model performance claims.
Related Articles
QIMMA Arabic Leaderboard Discards 3.1% of ArabicMMLU Samples After Quality Validation
TII UAE released QIMMA, an Arabic LLM leaderboard that validates benchmark quality before evaluating models. The validation pipeline, using Qwen3-235B and DeepSeek-V3 plus human review, discarded 3.1% of ArabicMMLU samples and found systematic quality issues across 14 benchmarks.
Augment Code's agent matches Claude Code quality at 33% lower cost on Opus 4.7
Augment Code benchmarked its Auggie agent against Claude Code on Claude Opus 4.7, reporting a 67.4% pass rate versus 66.3% while cutting costs by 33%. The company attributes savings to a semantic context engine that reduces cache read tokens by 32% and output tokens by 37% compared to Claude Code's keyword-based retrieval.
Gemini handles video analysis across YouTube and 1.65GB local files, Claude fails entirely
In direct testing, Google's Gemini successfully analyzed video content from YouTube links and local files up to 1.65GB, accurately understanding context without audio or metadata. Anthropic's Claude cannot process video at all, while OpenAI's ChatGPT faces a 500MB file size limit without Codex assistance.
ChatGPT Images 2.0 scores 97% in head-to-head image generation benchmark against Google's Gemini Nano Banana at 85%
OpenAI's ChatGPT Images 2.0 scored 97% versus Google's Gemini Nano Banana at 85% in a nine-test image generation benchmark conducted by ZDNET. The tests measured capabilities including image restoration, text rendering, and prompt adherence, with Nano Banana losing points primarily for fabricating details and text errors.
Comments
Loading...