CareMedEval benchmark reveals LLMs struggle with biomedical critical appraisal despite reasoning improvements
Researchers introduced CareMedEval, a 534-question benchmark derived from French medical student exams, to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Testing state-of-the-art models reveals none exceed 50% exact match accuracy, with particular weakness in evaluating study limitations and statistical analysis.
New Benchmark Exposes LLM Limitations in Biomedical Reasoning
A new dataset called CareMedEval reveals that even leading language models fail to reliably assess scientific literature—a critical skill in medical practice. The benchmark contains 534 questions grounded in 37 peer-reviewed scientific papers, sourced from actual exams administered to French medical students.
Dataset Design and Scale
Unlike existing biomedical benchmarks that focus on factual knowledge retrieval, CareMedEval explicitly evaluates critical reading and reasoning about scientific findings. Each question is anchored to specific papers, requiring models to understand methodology, identify limitations, and interpret statistical results—skills essential to medical practice but rarely tested in current AI evaluations.
The dataset spans authentic medical exam materials, making it more representative of real-world critical appraisal tasks than synthetically generated benchmarks.
Benchmark Results
Testing across both generalist and biomedical-specialized LLMs produced sobering results:
- Maximum exact match accuracy: 50% across all tested models
- Performance plateaus even for the largest and most capable systems
- Intermediate reasoning tokens (chain-of-thought outputs) improve results but do not overcome fundamental limitations
- Worst performance: questions about study limitations and statistical analysis
These findings indicate that current LLMs cannot reliably support clinical decision-making based on literature review—a concerning limitation given proposed applications in clinical support systems.
Technical Insights
The researchers benchmarked both open-source and commercial models under various context configurations. Models that generate intermediate reasoning steps showed measurable improvements, suggesting that explicit reasoning processes help but remain insufficient for reliable performance on this task.
The particular weakness in statistical analysis questions indicates LLMs struggle with quantitative reasoning grounded in specific paper details, a core competency for evaluating biomedical research.
Implications
CareMedEval provides a challenging evaluation framework that exposes genuine limitations in current LLM capabilities rather than artificially inflating benchmark performance. The benchmark's grounding in authentic medical education materials makes it a credible signal of real-world readiness.
These results suggest that before deploying LLMs in clinical or research workflows requiring critical appraisal, significant architectural improvements or specialized fine-tuning will be necessary. The dataset opens a quantifiable path for measuring progress toward this capability.
What This Means
The biomedical field cannot yet rely on general-purpose or specialized LLMs for critical literature appraisal without human oversight. CareMedEval provides a concrete benchmark for tracking when—and if—this gap closes. For AI developers targeting medical applications, this benchmark represents a genuinely hard problem: models need not just domain knowledge but robust reasoning about research design and statistical validity.