LLM News

Every LLM release, update, and milestone.

Filtered by:dataset✕ clear

benchmark

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

Researchers have released CompMath-MCQ, a new benchmark dataset containing 1,500 originally authored graduate-level mathematics questions designed to test LLM performance on advanced topics. The dataset covers linear algebra, numerical optimization, vector calculus, probability, and Python-based scientific computing—areas largely absent from existing math benchmarks. Baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge.

March 5, 2026 · 5:55 AM2 min read

benchmark mathematics LLM-evaluation

via arxiv.org ↗

benchmark

CareMedEval benchmark reveals LLMs struggle with biomedical critical appraisal despite reasoning improvements

Researchers introduced CareMedEval, a 534-question benchmark derived from French medical student exams, to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Testing state-of-the-art models reveals none exceed 50% exact match accuracy, with particular weakness in evaluating study limitations and statistical analysis.

March 5, 2026 · 5:07 AM2 min read

benchmark biomedical-ai llm-evaluation

via arxiv.org ↗

research

Search Arena dataset reveals users trust citations over accuracy in search-augmented LLMs

Researchers released Search Arena, a crowd-sourced dataset of 24,000+ multi-turn interactions with search-augmented LLMs, revealing that users perceive credibility based on citation count even when sources don't support claims. The analysis uncovers a critical gap between perceived and actual credibility in search-augmented systems.

March 5, 2026 · 1:25 AM2 min read

search-augmented-llms credibility citations

via arxiv.org ↗

research

Researchers model human intervention patterns to build more collaborative web agents

A new research paper introduces methods for predicting when humans will intervene in autonomous web agents by analyzing distinct interaction patterns. The work, which includes a dataset of 400 real-user web navigation trajectories with over 4,200 interleaved human-agent actions, shows that intervention-aware models improved agent usefulness by 26.5% in user studies.

February 20, 2026 · 3:22 AM2 min read

web-agents human-ai-collaboration intervention-modeling

via arxiv.org ↗