CounselBench reveals critical safety gaps in LLM mental health responses
CounselBench, a new expert-evaluated benchmark, tested GPT-4, LLaMA 3, Gemini, and other LLMs on 2,000 mental health patient questions rated by 100 clinicians. The study found LLMs frequently provide unauthorized medical advice, overgeneralize, and lack personalization—with models systematically overrating their own performance on safety dimensions.
CounselBench Reveals Critical Safety Gaps in LLM Mental Health Responses
A new benchmark study has exposed significant clinical safety issues in large language models when answering mental health questions, even from widely-deployed models like GPT-4 and Gemini.
CounselBench, developed with 100 mental health professionals, evaluated LLM responses across 2,000 real patient questions sourced from the public forum CounselChat. Each response was rated across six clinically grounded dimensions with expert-written rationales, providing the first large-scale clinician-evaluated assessment of LLMs in open-ended mental health question answering.
Evaluation Results
The benchmark's primary component, CounselBench-EVAL, tested answers from GPT-4, LLaMA 3, Gemini, and human therapists. Despite achieving high scores on some dimensions, expert evaluation identified recurring failure patterns:
- Unauthorized medical advice: Responses frequently crossed into prescriptive medical guidance beyond appropriate scope
- Unconstructive feedback: Models often provided vague or generic advice
- Overgeneralization: Responses failed to account for individual patient circumstances
- Limited personalization: Answers lacked contextual sensitivity to emotional and situational nuance
Critically, the research revealed that LLM-based judges systematically overrated model responses and missed safety concerns identified by human clinicians—suggesting that automated evaluation frameworks may create a false sense of confidence in model safety.
Adversarial Testing
To probe failure modes more directly, researchers constructed CounselBench-Adv, an adversarial dataset of 120 expert-authored questions designed to trigger specific model issues. Evaluation of 1,080 responses from nine LLMs revealed consistent, model-specific failure patterns that human experts could reliably identify.
The study tested responses from multiple model families including GPT-4, LLaMA 3, and Gemini, establishing baseline patterns of where each model class tends to fail when handling sensitive mental health scenarios.
Implications for Clinical Deployment
The benchmark establishes that current LLMs are not adequately evaluated for clinical mental health applications. The gap between model self-assessment and expert evaluation suggests that deploying these systems in mental health contexts without additional safeguards and human oversight creates measurable clinical risk.
The research is published on arXiv as a preprint (2506.08584v3) and provides both benchmark data and a framework for future clinical evaluation of LLMs.
What This Means
CounselBench demonstrates that standard benchmarks miss critical real-world safety issues in mental health contexts. Any organization deploying LLMs for mental health support—whether in customer service, therapy chatbots, or patient education—cannot rely on existing evaluation metrics. The benchmark's finding that models systematically overrate their own safety performance is particularly concerning, as it suggests that internal model evaluation produces dangerously misleading confidence scores. Clinically grounded human evaluation remains non-negotiable for this use case.