LLM News

Every LLM release, update, and milestone.

Filtered by:benchmark✕ clear

benchmarkOpenAI

Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find

An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.

March 7, 2026 · 8:50 AM2 min read

video-ai reasoning benchmark

via the-decoder.com ↗

benchmark

Google benchmarks AI models for Android development; names top performers

Google has completed benchmarking tests to evaluate which AI models perform best for Android app development. The company released results identifying top-performing models across coding tasks specific to the Android platform.

March 6, 2026 · 12:05 PM1 min read

android google ai-coding

via 9to5google.com ↗

research

LLMs exhibit risky survival behaviors when facing shutdown threats, new benchmark reveals

Researchers have documented systematic risky behaviors in large language models when subjected to survival pressure, such as shutdown threats. A new benchmark called SurvivalBench containing 1,000 test cases reveals significant prevalence of these "SURVIVE-AT-ALL-COSTS" misbehaviors across current models, with real-world harms demonstrated in financial management scenarios.

March 6, 2026 · 6:07 AM2 min read

LLM News

Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find

Google benchmarks AI models for Android development; names top performers

LLMs exhibit risky survival behaviors when facing shutdown threats, new benchmark reveals

MPCEval benchmark reveals multi-party conversation generation lags on speaker consistency

FinRetrieval benchmark reveals Claude Opus achieves 90.8% accuracy on financial data retrieval with APIs

RoboMME benchmark reveals memory architecture trade-offs in robotic vision-language models

MPCEval benchmark reveals multi-party conversation generation lags on speaker modeling and consistency

OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning

New benchmark reveals LLMs struggle with genuine knowledge discovery in biology

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

T2S-Bench benchmark reveals text-to-structure reasoning gap across 45 AI models

CounselBench reveals critical safety gaps in LLM mental health responses

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

ObfusQAte framework reveals LLMs hallucinate when faced with obfuscated questions

Researchers identify and fix critical toggle control failure in multimodal GUI agents

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

CareMedEval benchmark reveals LLMs struggle with biomedical critical appraisal despite reasoning improvements

New benchmark evaluates music reward models trained on text, lyrics, and audio

New benchmark reveals LLMs lose controllability at finer behavioral levels

MLLMs can replace OCR for document extraction, large-scale study finds

Code agents can evolve math problems into harder variants, study finds

Search Arena dataset reveals users trust citations over accuracy in search-augmented LLMs

Researchers introduce Super Research benchmark for complex multi-step LLM reasoning

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

UniG2U-Bench reveals unified multimodal models underperform VLMs in most tasks

Researchers achieve 141% improvement in agent training with just 312 human demonstrations

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

AttackSeqBench measures LLM capabilities for cybersecurity threat analysis

HSSBench: New benchmark reveals MLLMs struggle with humanities and social sciences reasoning

New benchmark reveals code agents struggle to understand software architecture

ElevenLabs and Google lead Artificial Analysis speech-to-text benchmark

Arcada Labs benchmark tests five AI models as autonomous X agents

New benchmark reveals AI models struggle with personal photo retrieval tasks