LLM News

Every LLM release, update, and milestone.

Filtered by:LLM-evaluation✕ clear

research

New benchmark reveals LLMs struggle with genuine knowledge discovery in biology

Researchers have introduced DBench-Bio, a dynamic benchmark that addresses a fundamental problem: existing AI evaluations use static datasets that models likely encountered during training. The new framework uses a three-stage pipeline to generate monthly-updated questions from recent biomedical papers, testing whether leading LLMs can actually discover new knowledge rather than regurgitate training data.

March 5, 2026 · 6:07 AM2 min read

benchmark knowledge-discovery LLM-evaluation

via arxiv.org ↗

benchmark

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

Researchers have released CompMath-MCQ, a new benchmark dataset containing 1,500 originally authored graduate-level mathematics questions designed to test LLM performance on advanced topics. The dataset covers linear algebra, numerical optimization, vector calculus, probability, and Python-based scientific computing—areas largely absent from existing math benchmarks. Baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge.

March 5, 2026 · 5:55 AM2 min read

benchmark mathematics LLM-evaluation

via arxiv.org ↗

benchmark

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

Researchers have released TrustMH-Bench, a comprehensive evaluation framework that tests large language models across eight trustworthiness dimensions specifically for mental health applications. Testing six general-purpose LLMs and six specialized mental health models revealed significant deficiencies across reliability, crisis identification, safety, fairness, privacy, robustness, anti-sycophancy, and ethics—with even advanced models like GPT-5.1 failing to maintain consistently high performance.

March 5, 2026 · 1:10 AM2 min read

benchmark mental-health trustworthiness

via arxiv.org ↗