research

NeuroProlog framework combines neural networks with symbolic reasoning to fix LLM math errors

Researchers introduce NeuroProlog, a neurosymbolic framework that compiles math word problems into executable Prolog programs with formal verification guarantees. A multi-task "Cocktail" training strategy achieves significant accuracy improvements on GSM8K: +5.23% on Qwen-32B, +3.43% on GPT-OSS-20B, and +5.54% on Llama-3B compared to single-task baselines.

March 5, 2026 · 5:10 AM2 min read

NeuroProlog Framework Achieves 5%+ Accuracy Gains on Mathematical Reasoning

Large language models struggle with mathematical reasoning despite strong natural language performance, frequently producing fluent but logically inconsistent solutions. Researchers now present NeuroProlog, a neurosymbolic framework that addresses this fundamental weakness by compiling math word problems into executable Prolog programs with formal verification guarantees.

Architecture and Training Strategy

NeuroProlog's core innovation is a multi-task "Cocktail" training strategy that jointly optimizes three synergistic objectives within a unified symbolic representation space:

Mathematical formula-to-rule translation (KB): Converting mathematical expressions into symbolic rules
Natural language-to-program synthesis (SOLVE): Translating word problems into executable code
Program-answer alignment: Ensuring outputs match expected solutions

This joint supervision creates positive transfer, where symbolic grounding in formula translation directly improves compositional reasoning capabilities—a phenomenon the researchers attribute to what they call the "Cocktail effect."

Evaluation Results

Comprehensive testing across four model scales (3B to 32B parameters) on the GSM8K benchmark demonstrates consistent improvements:

Qwen-32B: +5.23% accuracy (p < 0.01)
GPT-OSS-20B: +3.43% accuracy (p < 0.01)
Llama-3B: +5.54% accuracy (p < 0.05)

At the 32B scale, NeuroProlog's execution-guided decoding pipeline transforms unfixable type errors (12% baseline repair rate) into correctable domain errors (96% repair rate), achieving 92.7% overall error correction.

Scale-Dependent Learning Dynamics

The research reveals critical differences in how models at different scales learn symbolic reasoning:

32B models show optimal performance: Cocktail training eliminates type errors and improves semantic understanding, enabling sophisticated error repair.

8B models expose capacity constraints: The same training eliminates syntactic errors but introduces new semantic failures, suggesting a threshold beyond which models lack sufficient capacity for type-safe symbolic reasoning.

Inference Pipeline

At inference, NeuroProlog uses an execution-guided decoding approach with a fine-grained error taxonomy that enables iterative program repair. This allows the system to quantify model self-debugging capacity and identify which error categories are recoverable versus structural.

What This Means

NeuroProlog demonstrates that neurosymbolic approaches—combining neural networks with formal symbolic systems—can substantially improve LLM reliability on mathematical tasks. The 5%+ accuracy gains are significant but modest, suggesting this isn't a complete solution to LLM mathematical reasoning. The discovery of scale-dependent learning dynamics points toward a fundamental insight: models need sufficient capacity to learn type-safe symbolic reasoning, and training strategy alone cannot overcome hard capacity limits. The framework's ability to repair errors during inference offers practical value for applications requiring verified mathematical reasoning.

Source: arxiv.org ↗

neurosymbolic-ai mathematical-reasoning prolog llm-training formal-verification gsm8k research