benchmarkAnthropic

FinRetrieval benchmark reveals Claude Opus achieves 90.8% accuracy on financial data retrieval with APIs

Researchers introduced FinRetrieval, a 500-question benchmark evaluating AI agents' ability to retrieve specific financial data from structured databases. Testing 14 configurations across Anthropic, OpenAI, and Google, the benchmark reveals Claude Opus achieves 90.8% accuracy with structured data APIs but only 19.8% with web search—a 71 percentage point performance gap that exceeds competitors by 3-4x.

March 6, 2026 · 5:54 AM2 min read

FinRetrieval Benchmark Reveals Critical Tool Dependency in Financial AI Agents

Researchers have released FinRetrieval, a benchmark designed to evaluate how effectively AI agents retrieve specific numeric values from financial databases—a critical capability for financial research and analysis applications.

Benchmark Scope and Methodology

The benchmark consists of 500 financial retrieval questions with verified ground truth answers. Researchers tested 14 different agent configurations across three frontier AI providers: Anthropic, OpenAI, and Google. The evaluation includes complete tool call execution traces, providing transparent visibility into agent decision-making processes.

Key Performance Findings

Tool availability emerges as the dominant performance differentiator. Claude Opus from Anthropic achieved 90.8% accuracy when equipped with structured data APIs but dropped to only 19.8% accuracy relying on web search alone—a 71 percentage point accuracy gap. This disparity significantly exceeds performance variance between competing providers by 3-4x.

OpenAI's models showed different patterns in reasoning mode effectiveness. Reasoning mode provided a +9.0 percentage point improvement for OpenAI models, while Claude benefited by only +2.8 percentage points. Researchers attribute this difference not to reasoning ability gaps but to differences in how models utilize tools in base mode, suggesting that stronger foundational tool-calling capability reduces marginal gains from extended reasoning.

Geographic and Domain-Specific Insights

The benchmark identified a 5.6 percentage point accuracy advantage for US-based financial data over international data. Rather than reflecting model limitations in understanding different markets, researchers determined this gap stems from inconsistent fiscal year naming conventions across regions—a data preparation issue rather than a capability gap.

Dataset and Research Contribution

Researchers are releasing three components to the research community: the complete 500-question dataset, evaluation code for reproducible benchmarking, and full tool execution traces documenting agent decision-making. This transparency enables further research on how to improve financial AI systems and understand agent behavior in structured data retrieval tasks.

What This Means

FinRetrieval provides the first systematic evaluation of a significant blind spot in AI agent capabilities: financial data retrieval. The stark performance cliff between structured APIs and web search indicates that current AI agents remain heavily dependent on tool design rather than robust understanding of financial information retrieval. For financial institutions deploying AI agents, tool architecture decisions—not just model selection—will drive real-world performance. The reasoning mode findings suggest that for capability-constrained models, extended reasoning provides meaningful gains, but frontier models may need other architectural improvements to benefit from computational scaling on this task.

Source: arxiv.org ↗

benchmark financial-ai agent-evaluation tool-use claude openai google data-retrieval