LLM News

Every LLM release, update, and milestone.

Filtered by:evaluation✕ clear
benchmark

MPCEval benchmark reveals multi-party conversation generation lags on speaker consistency

Researchers introduce MPCEval, a specialized benchmark for evaluating multi-party conversation generation—a capability increasingly used in smart reply and collaborative AI assistants. The benchmark decomposes conversation quality into speaker modeling, content quality, and speaker-content consistency, revealing that current models struggle with participation balance and maintaining consistent speaker behavior across longer exchanges.

benchmark

RoboMME benchmark reveals memory architecture trade-offs in robotic vision-language models

Researchers introduce RoboMME, a large-scale standardized benchmark for evaluating memory in robotic vision-language-action (VLA) models across 16 manipulation tasks. The study tests 14 memory-augmented VLA variants and finds that no single memory architecture excels across all task types—each design offers distinct trade-offs depending on temporal, spatial, object, and procedural demands.

benchmark

OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning

Researchers have introduced OmniVideoBench, a large-scale evaluation framework comprising 1,000 manually verified question-answer pairs derived from 628 videos (ranging from seconds to 30 minutes) designed to measure synergistic audio-visual reasoning in multimodal large language models. Testing reveals a significant performance gap between open-source and closed-source MLLMs on genuine cross-modal reasoning tasks.

benchmark

AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation

Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.

2 min readvia arxiv.org
benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark containing 870 tasks across 29 websites requiring agents to acquire, clean, and analyze multimodal data from the internet. Current state-of-the-art LLM agents achieve only 15% success on WebDS tasks despite reaching 80% on simpler web benchmarks, while humans achieve 90% accuracy.

2 min readvia arxiv.org
benchmark

New benchmark evaluates music reward models trained on text, lyrics, and audio

Researchers have released CMI-RewardBench, a comprehensive evaluation framework for music reward models that handle mixed text, lyrics, and audio inputs. The benchmark includes 110,000 pseudo-labeled samples and human-annotated data, along with publicly available reward models designed for fine-grained music generation alignment.

benchmark

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

Researchers have released TrustMH-Bench, a comprehensive evaluation framework that tests large language models across eight trustworthiness dimensions specifically for mental health applications. Testing six general-purpose LLMs and six specialized mental health models revealed significant deficiencies across reliability, crisis identification, safety, fairness, privacy, robustness, anti-sycophancy, and ethics—with even advanced models like GPT-5.1 failing to maintain consistently high performance.

benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

2 min readvia arxiv.org
benchmarkOpenAI

OpenAI says SWE-bench Verified is broken—most tasks reject correct solutions

OpenAI is calling for the retirement of SWE-bench Verified, the widely-used AI coding benchmark, claiming most tasks are flawed enough to reject correct solutions. The company argues that leading AI models have likely seen the answers during training, meaning benchmark scores measure memorization rather than genuine coding ability.

2 min readvia the-decoder.com