benchmark

CFE-Bench: New STEM reasoning benchmark reveals frontier models struggle with multi-step logic

Researchers introduced CFE-Bench (Classroom Final Exam), a multimodal benchmark using authentic university homework and exam problems across 20+ STEM domains to evaluate LLM reasoning capabilities. Gemini 3.1 Pro Preview achieved the highest score at 59.69% accuracy, while analysis revealed frontier models frequently fail to maintain correct intermediate states in multi-step solutions.

March 5, 2026 · 1:06 AM2 min read

New STEM Reasoning Benchmark Exposes Multi-Step Logic Weaknesses in Frontier Models

A new benchmark called CFE-Bench (Classroom Final Exam) reveals fundamental limitations in how leading language models handle complex, multi-step reasoning across STEM domains.

The benchmark, introduced in a new arXiv paper (2602.19517), uses authentic university homework and exam problems curated from real coursework across more than 20 STEM fields. Each problem includes reference solutions provided by actual course instructors, creating a grounded evaluation framework distinct from synthetic benchmarks.

Performance Results

Google's Gemini 3.1 Pro Preview leads the field at 59.69% overall accuracy, followed by Gemini 3 Flash Preview at 55.46%. The scores indicate substantial room for improvement even among frontier models.

The benchmark's construction from real, repeatedly-used course materials makes it resistant to data contamination—a common concern with LLM evaluations—since these problems existed in actual curricula before being incorporated into the benchmark.

Diagnostic Findings: Where Models Fail

Beyond overall accuracy, researchers conducted detailed analysis by decomposing instructor solutions into structured reasoning flows. The findings reveal a specific failure mode: while frontier models often answer individual intermediate sub-questions correctly, they frequently fail to maintain correct intermediate states throughout full multi-step solutions.

This suggests that reasoning breakdowns don't stem from inability to handle individual steps, but rather from accumulation of small errors or loss of context across reasoning chains.

Step Efficiency Problem

Another key finding: model-generated solutions typically contain more reasoning steps than instructor-written solutions. This lower step efficiency directly increases error accumulation risk. Models are taking longer, more convoluted paths to answers—when they reach correct answers at all—compared to expert instruction.

The researchers provide code and data publicly at https://github.com/Analogy-AI/CFE_Bench, enabling further analysis and development of improved reasoning approaches.

What This Means

CFE-Bench identifies a precise diagnostic gap: frontier models have solved individual reasoning steps but remain weak at chaining them reliably. The benchmark offers a grounded, realistic alternative to synthetic evaluations and provides structural insight into why models fail, not just that they do. For developers building reasoning-dependent applications, the multi-step maintenance problem is likely the next critical frontier in LLM capabilities.

Source: arxiv.org ↗

benchmark reasoning STEM gemini multimodal evaluation arXiv multi-step-reasoning