New method uses structural graphs to fix LLM reasoning collapse in multi-step theorem prediction
Researchers have identified and solved a critical scaling problem in LLM-based theorem prediction called Structural Drift, where in-context learning performance collapses as reasoning depth increases. Using Theorem Precedence Graphs to encode topological dependencies, they achieved 89.29% accuracy on the FormalGeo7k benchmark—matching state-of-the-art supervised approaches without any gradient-based training.
Structural Drift Cripples In-Context Learning for Theorem Prediction
Researchers have identified a fundamental scaling failure in LLM-based automated reasoning: as reasoning depth increases, vanilla in-context learning (ICL) performance degrades sharply, often approaching zero accuracy. The root cause is the model's inability to recover latent topological dependencies between theorem steps, leading to unstructured exploration that compounds error across multi-step proofs.
The finding, presented in a new arXiv paper, exposes why existing neural-symbolic approaches relying on supervised parametric models struggle to generalize to evolving theorem libraries—they lack explicit structural constraints that organize the reasoning space.
Theorem Precedence Graphs Impose Topological Order
The proposed solution introduces Theorem Precedence Graphs, a non-parametric method that encodes temporal dependencies from historical solution traces as directed graphs. These graphs impose explicit topological constraints during inference, effectively pruning the search space without requiring any gradient-based optimization.
The approach couples three components:
- Graph construction: Retrieval-augmented method to build precedence graphs from past solutions
- Structured planning: LLM acts as a planner constrained by topological order
- Symbolic execution: Stepwise verification of predicted steps
This is a training-free method—no fine-tuning or parameter updates needed.
89.29% Accuracy Matches Supervised State-of-the-Art
On the FormalGeo7k benchmark, the method achieved 89.29% accuracy, substantially outperforming in-context learning baselines while matching supervised models trained end-to-end. The results demonstrate that explicit structural priors can compensate for the knowledge gaps that plague purely parametric approaches.
The key insight: LLMs perform better when given explicit constraints that reflect the mathematical structure of the problem, rather than being asked to infer dependencies implicitly from examples.
What This Means
This work addresses a practical bottleneck in automated reasoning systems. As theorem libraries evolve and grow, retraining supervised models becomes expensive. A training-free, structurally-grounded approach offers better generalization. The method suggests that hybrid neural-symbolic systems don't need to learn structure parametrically—encoding it explicitly in the reasoning process itself is more effective. Future work may explore how Theorem Precedence Graphs scale to larger proof spaces and whether similar structural priors apply to other multi-step reasoning tasks beyond theorem proving.