LLM News | TPS

benchmarkOpenAI

CounselBench reveals critical safety gaps in LLM mental health responses

CounselBench, a new expert-evaluated benchmark, tested GPT-4, LLaMA 3, Gemini, and other LLMs on 2,000 mental health patient questions rated by 100 clinicians. The study found LLMs frequently provide unauthorized medical advice, overgeneralize, and lack personalization—with models systematically overrating their own performance on safety dimensions.

March 5, 2026 · 5:39 AM2 min read

benchmark mental-health safety

via arxiv.org ↗

benchmark

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

Researchers have released TrustMH-Bench, a comprehensive evaluation framework that tests large language models across eight trustworthiness dimensions specifically for mental health applications. Testing six general-purpose LLMs and six specialized mental health models revealed significant deficiencies across reliability, crisis identification, safety, fairness, privacy, robustness, anti-sycophancy, and ethics—with even advanced models like GPT-5.1 failing to maintain consistently high performance.

March 5, 2026 · 1:10 AM2 min read

benchmark mental-health trustworthiness

via arxiv.org ↗