LLM News

Every LLM release, update, and milestone.

Filtered by:openai✕ clear
benchmarkAnthropic

FinRetrieval benchmark reveals Claude Opus achieves 90.8% accuracy on financial data retrieval with APIs

Researchers introduced FinRetrieval, a 500-question benchmark evaluating AI agents' ability to retrieve specific financial data from structured databases. Testing 14 configurations across Anthropic, OpenAI, and Google, the benchmark reveals Claude Opus achieves 90.8% accuracy with structured data APIs but only 19.8% with web search—a 71 percentage point performance gap that exceeds competitors by 3-4x.

product updateOpenAI

OpenAI Python SDK v2.25.0 adds GPT-5.4 support with new tool search and computer control features

OpenAI has released version 2.25.0 of its Python SDK, adding support for GPT-5.4 and introducing a new tool search feature alongside a computer control tool for agent-based automation. The update, released March 5, 2026, also includes API schema refinements and parameter changes to the prompt cache and message handling.

2 min readvia github.com
model releaseOpenAI

OpenAI launches GPT-5.4 with native computer use capabilities for autonomous agents

OpenAI has launched GPT-5.4, its latest model with native computer use capabilities that allow it to operate computers and complete tasks across applications. The release represents a step toward autonomous AI agents that can handle complex jobs independently. The model includes advancements in reasoning, coding, and professional work with spreadsheets, documents, and presentations.

1 min readvia theverge.com
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

2 min readvia the-decoder.com