LLM News

Every LLM release, update, and milestone.

Filtered by:safety✕ clear

benchmarkOpenAI

CounselBench reveals critical safety gaps in LLM mental health responses

CounselBench, a new expert-evaluated benchmark, tested GPT-4, LLaMA 3, Gemini, and other LLMs on 2,000 mental health patient questions rated by 100 clinicians. The study found LLMs frequently provide unauthorized medical advice, overgeneralize, and lack personalization—with models systematically overrating their own performance on safety dimensions.

March 5, 2026 · 5:39 AM2 min read

benchmark mental-health safety

via arxiv.org ↗

research

New benchmark reveals LLMs lose controllability at finer behavioral levels

A new arXiv paper introduces SteerEval, a hierarchical benchmark for measuring how well large language models can be controlled across language features, sentiment, and personality. The research reveals that existing steering methods degrade significantly at finer-grained behavioral specification levels, raising concerns for deployment in sensitive domains.

March 5, 2026 · 1:51 AM2 min read

research benchmark llm-controllability

via arxiv.org ↗

research

NExT-Guard enables real-time LLM safety without training or token labels

Researchers have developed NExT-Guard, a training-free framework that monitors large language models for unsafe content during streaming inference by analyzing latent features from Sparse Autoencoders. The approach outperforms supervised training methods while eliminating the need for expensive token-level annotations, making real-time safety monitoring scalable across different models.

March 5, 2026 · 1:39 AM2 min read

safety streaming sparse-autoencoders

via arxiv.org ↗

benchmark

New benchmark reveals major trustworthiness gaps in LLMs for mental health applications

Researchers have released TrustMH-Bench, a comprehensive evaluation framework that tests large language models across eight trustworthiness dimensions specifically for mental health applications. Testing six general-purpose LLMs and six specialized mental health models revealed significant deficiencies across reliability, crisis identification, safety, fairness, privacy, robustness, anti-sycophancy, and ethics—with even advanced models like GPT-5.1 failing to maintain consistently high performance.

March 5, 2026 · 1:10 AM2 min read

benchmark mental-health trustworthiness

via arxiv.org ↗

research

New safety steering technique reduces unsafe T2I outputs without degrading image quality

Researchers introduce Conditioned Activation Transport (CAT), a technique that reduces unsafe content generation in text-to-image models during inference without the quality degradation seen in previous linear steering approaches. The method uses a contrastive dataset of 2,300 safe/unsafe prompt pairs and geometry-based conditioning to target only unsafe activation regions.

March 5, 2026 · 1:08 AM2 min read

text-to-image safety activation-steering

via arxiv.org ↗