Reasoning models fail at theory of mind tasks despite math excellence
A systematic study of nine advanced language models reveals that large reasoning models—designed to excel at step-by-step math and coding—actually underperform or match non-reasoning models on theory of mind tasks. The research identifies a critical weakness: longer reasoning chains actively harm social reasoning performance, suggesting current reasoning architectures don't transfer to socio-cognitive skills.
Reasoning Models Fail at Theory of Mind Tasks Despite Math Excellence
A new arXiv study challenges the assumption that advanced reasoning models' capabilities in mathematics and coding transfer to social reasoning. Researchers systematically evaluated nine large language models on theory of mind (ToM) benchmarks—tests that measure whether models can infer hidden mental states like beliefs, desires, and intentions.
Key Findings
The core finding is stark: reasoning models do not consistently outperform non-reasoning models on theory of mind tasks and sometimes perform significantly worse.
The researchers identified three specific failure modes:
1. Slow thinking collapses. As response length increases, accuracy drops substantially. Larger reasoning budgets actively hurt performance—the opposite of what we see in math and coding tasks. This suggests reasoning models are working against themselves when applied to social reasoning.
2. Moderate and adaptive reasoning helps. When reasoning length is constrained, performance improves. This indicates that dynamic, task-aware reasoning adaptation matters more than simply enabling extended thinking.
3. Option matching shortcuts. When multiple-choice options are removed, reasoning models markedly improve. This reveals that models aren't performing genuine deduction—they're pattern-matching against provided options rather than reasoning through the problem independently.
Intervention Results
The team designed two intervention approaches to verify and mitigate these problems:
- Slow-to-Fast (S2F) adaptive reasoning: Dynamically constrains reasoning length based on task requirements.
- Think-to-Match (T2M) shortcut prevention: Prevents models from relying on option matching patterns.
Both interventions improved performance, confirming that the failures stem from how these models approach reasoning, not from fundamental capability gaps.
Why This Matters
The study exposes a critical limitation in current reasoning model architecture. While models like those from Anthropic, OpenAI, and others have demonstrated genuine advances in formal reasoning (mathematics, coding, formal logic), this capability does not map onto social reasoning tasks.
Theory of mind is fundamental to natural human interaction—understanding others' beliefs, intentions, and perspectives. The inability of reasoning models to handle these tasks suggests that social reasoning requires fundamentally different capabilities than formal step-by-step deduction.
The research indicates the problem isn't a lack of computation or tokens. Extended thinking actually hurts performance. This suggests reasoning models may need architectural redesigns specifically tailored to socio-cognitive tasks rather than generalizing from their math/coding success.
What This Means
Current reasoning models have hit a clear capability ceiling on social reasoning. The advancement that made them exceptional at formal tasks—extended chain-of-thought reasoning—actively undermines their ability to model human mental states. Building systems that handle both formal and social reasoning will require either task-specific adaptation layers or a rethinking of how reasoning itself is implemented. For applications requiring both mathematical rigor and social understanding, current reasoning models are not a complete solution.