LLM News

Every LLM release, update, and milestone.

Filtered by:llm-evaluation✕ clear

research

T2S-Bench benchmark reveals text-to-structure reasoning gap across 45 AI models

Researchers introduced T2S-Bench, a new benchmark with 1,800 samples across 6 scientific domains and 32 structural types, evaluating text-to-structure reasoning in 45 mainstream models. The benchmark reveals substantial capability gaps: average accuracy on multi-hop reasoning tasks is only 52.1%, while Structure-of-Thought (SoT) prompting alone yields +5.7% improvement on average across eight text-processing tasks.

March 5, 2026 · 5:53 AM2 min read

benchmark reasoning text-to-structure

via arxiv.org ↗

benchmarkOpenAI

CounselBench reveals critical safety gaps in LLM mental health responses

CounselBench, a new expert-evaluated benchmark, tested GPT-4, LLaMA 3, Gemini, and other LLMs on 2,000 mental health patient questions rated by 100 clinicians. The study found LLMs frequently provide unauthorized medical advice, overgeneralize, and lack personalization—with models systematically overrating their own performance on safety dimensions.

March 5, 2026 · 5:39 AM2 min read

benchmark mental-health safety

via arxiv.org ↗

benchmark

WebDS benchmark reveals 80% performance gap between AI agents and humans on real-world data science tasks

Researchers introduced WebDS, the first end-to-end web-based data science benchmark comprising 870 tasks across 29 websites. Current state-of-the-art LLM agents achieve only 15-20% success rates on these complex, multi-step data acquisition and analysis tasks, while humans reach approximately 90% accuracy, revealing significant gaps in agent capabilities.

March 5, 2026 · 5:38 AM2 min read

benchmark data-science web-agents

via arxiv.org ↗

research

Study reveals preference leakage bias when LLMs judge synthetically-trained models

A new arXiv paper identifies preference leakage, a fundamental contamination problem in LLM-based evaluation where language models used as judges systematically favor models trained on data they synthesized. The researchers confirm the bias occurs across multiple model families and benchmarks, making it harder to detect than previously known LLM judge biases.

March 5, 2026 · 5:25 AM3 min read

llm-evaluation benchmark-contamination data-synthesis

via arxiv.org ↗

research

Reasoning models fail at theory of mind tasks despite math excellence

A systematic study of nine advanced language models reveals that large reasoning models—designed to excel at step-by-step math and coding—actually underperform or match non-reasoning models on theory of mind tasks. The research identifies a critical weakness: longer reasoning chains actively harm social reasoning performance, suggesting current reasoning architectures don't transfer to socio-cognitive skills.

March 5, 2026 · 5:23 AM2 min read

theory-of-mind reasoning-models llm-evaluation

via arxiv.org ↗

research

Researchers expose 'preference leakage' bias in LLM judging systems

Researchers have identified a contamination problem called preference leakage in LLM-as-a-judge evaluation systems, where judges systematically favor data generated by related models. The bias occurs when the judge LLM is the same as the generator, inherits from it, or belongs to the same model family—making it harder to detect than previous LLM evaluation biases.

March 5, 2026 · 5:09 AM2 min read

benchmarking llm-evaluation contamination

via arxiv.org ↗

benchmark

CareMedEval benchmark reveals LLMs struggle with biomedical critical appraisal despite reasoning improvements

Researchers introduced CareMedEval, a 534-question benchmark derived from French medical student exams, to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Testing state-of-the-art models reveals none exceed 50% exact match accuracy, with particular weakness in evaluating study limitations and statistical analysis.

March 5, 2026 · 5:07 AM2 min read

benchmark biomedical-ai llm-evaluation

via arxiv.org ↗

research

Researchers introduce Super Research benchmark for complex multi-step LLM reasoning

Researchers have introduced Super Research, a benchmark designed to evaluate how well large language models can handle highly complex questions requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources. The benchmark consists of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and reconciliation of conflicting evidence across 1,000+ web pages.

March 5, 2026 · 1:21 AM2 min read

benchmark research complex-reasoning

via arxiv.org ↗

benchmark

AttackSeqBench measures LLM capabilities for cybersecurity threat analysis

Researchers introduced AttackSeqBench, a benchmark for evaluating how well large language models understand and reason about cyber attack sequences in threat intelligence reports. The evaluation tested 7 LLMs and 5 reasoning models across multiple tasks, revealing gaps in their ability to extract actionable security insights from unstructured cybersecurity data.

March 5, 2026 · 1:05 AM2 min read

benchmark cybersecurity llm-evaluation

via arxiv.org ↗

research

New benchmark reveals code agents struggle to understand software architecture

A new research benchmark called Theory of Code Space (ToCS) exposes a critical limitation in AI code agents: they cannot reliably build and maintain understanding of software architecture during codebase exploration. The benchmark places agents in procedurally generated Python projects with partial observability, revealing that even frontier LLM agents score poorly at discovering module dependencies and cross-cutting invariants.

March 5, 2026 · 12:50 AM2 min read

code-agents software-architecture benchmark

via arxiv.org ↗