LLM News

Every LLM release, update, and milestone.

Filtered by:tool-use✕ clear

research

EvoTool optimizes LLM agent tool-use policies via evolutionary algorithms without gradients

Researchers propose EvoTool, a gradient-free evolutionary framework that optimizes tool-use policies in LLM agents by decomposing them into four modules and iteratively improving each through blame attribution and targeted mutation. The approach outperforms GPT-4.1 and Qwen3-8B baselines by over 5 percentage points across four benchmarks.

March 6, 2026 · 6:07 AM2 min read

llm-agents tool-use policy-optimization

via arxiv.org ↗

benchmarkAnthropic

FinRetrieval benchmark reveals Claude Opus achieves 90.8% accuracy on financial data retrieval with APIs

Researchers introduced FinRetrieval, a 500-question benchmark evaluating AI agents' ability to retrieve specific financial data from structured databases. Testing 14 configurations across Anthropic, OpenAI, and Google, the benchmark reveals Claude Opus achieves 90.8% accuracy with structured data APIs but only 19.8% with web search—a 71 percentage point performance gap that exceeds competitors by 3-4x.

March 6, 2026 · 5:54 AM2 min read

benchmark financial-ai agent-evaluation

via arxiv.org ↗

product updateOpenAI

OpenAI Python SDK v2.25.0 adds GPT-5.4 support with new tool search and computer control features

OpenAI has released version 2.25.0 of its Python SDK, adding support for GPT-5.4 and introducing a new tool search feature alongside a computer control tool for agent-based automation. The update, released March 5, 2026, also includes API schema refinements and parameter changes to the prompt cache and message handling.

March 5, 2026 · 6:50 PM2 min read

openai python-sdk gpt-5-4

via github.com ↗

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

March 5, 2026 · 5:08 AM2 min read

benchmark data-science web-agents

via arxiv.org ↗