LLM News

Every LLM release, update, and milestone.

Filtered by:multi-step-reasoning✕ clear
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

2 min readvia arxiv.org
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

2 min readvia arxiv.org
research

RAPO framework improves LLM agent reasoning by combining retrieval with reinforcement learning

Researchers introduce RAPO (Retrieval-Augmented Policy Optimization), a reinforcement learning framework that improves LLM agent reasoning by incorporating off-policy retrieval signals during training. The method achieves an average 5.0% performance gain across fourteen datasets and delivers 1.2x faster training efficiency compared to existing agentic RL approaches.

research

Researchers introduce Super Research benchmark for complex multi-step LLM reasoning

Researchers have introduced Super Research, a benchmark designed to evaluate how well large language models can handle highly complex questions requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources. The benchmark consists of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and reconciliation of conflicting evidence across 1,000+ web pages.

2 min readvia arxiv.org
benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

2 min readvia arxiv.org