LLM News

Every LLM release, update, and milestone.

Filtered by:LLM-evaluation✕ clear
research

New benchmark reveals LLMs struggle with genuine knowledge discovery in biology

Researchers have introduced DBench-Bio, a dynamic benchmark that addresses a fundamental problem: existing AI evaluations use static datasets that models likely encountered during training. The new framework uses a three-stage pipeline to generate monthly-updated questions from recent biomedical papers, testing whether leading LLMs can actually discover new knowledge rather than regurgitate training data.

benchmark

New benchmark reveals LLMs struggle with graduate-level math and computational reasoning

Researchers have released CompMath-MCQ, a new benchmark dataset containing 1,500 originally authored graduate-level mathematics questions designed to test LLM performance on advanced topics. The dataset covers linear algebra, numerical optimization, vector calculus, probability, and Python-based scientific computing—areas largely absent from existing math benchmarks. Baseline testing with state-of-the-art LLMs indicates that advanced computational mathematical reasoning remains a significant challenge.

2 min readvia arxiv.org
benchmark

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

Researchers have released TrustMH-Bench, a comprehensive evaluation framework that tests large language models across eight trustworthiness dimensions specifically for mental health applications. Testing six general-purpose LLMs and six specialized mental health models revealed significant deficiencies across reliability, crisis identification, safety, fairness, privacy, robustness, anti-sycophancy, and ethics—with even advanced models like GPT-5.1 failing to maintain consistently high performance.