research

New RL framework CORE helps LLMs bridge gap between solving math problems and understanding concepts

Researchers have identified a critical gap in how large language models learn mathematics: they can solve problems but often don't understand the underlying concepts. A new reinforcement learning framework called CORE addresses this by using explicit concept definitions as training signals, rather than just reinforcing correct final answers.

2 min read

LLMs Solve Math but Don't Understand Concepts

Large language models frequently produce correct answers to mathematical problems, yet fail when asked to apply the same concepts in unfamiliar contexts. This mismatch reveals a fundamental weakness: models are pattern-matching rather than reasoning conceptually.

Researchers have now introduced CORE (Concept-Oriented REinforcement), an RL training framework designed to close this gap by making mathematical concepts an explicit supervision signal during model training.

How CORE Works

The framework operates in three stages:

  1. Sanity probe: The researchers first demonstrated the problem empirically. LLMs can restate mathematical definitions but fail when taking concept-linked quizzes, quantifying the conceptual reasoning gap.

  2. Concept-aligned training: CORE synthesizes quizzes explicitly linked to concepts and injects brief concept snippets during rollouts. This "concept-priming" guides models toward concept-driven reasoning trajectories rather than surface-level pattern matching.

  3. Reinforcement via alignment: The framework reinforces conceptual reasoning through trajectory replacement after group failures—a lightweight forward-KL constraint that aligns unguided policies with concept-primed policies. Alternatively, it applies standard GRPO directly on concept-aligned quizzes.

Training Data and Evaluation

The approach uses high-quality, low-contamination textbook resources that explicitly link verifiable exercises to concise concept descriptions. This foundational pairing is critical: it allows the framework to provide fine-grained conceptual supervision rather than the coarse signal of right/wrong answers.

Across multiple models tested, CORE delivered consistent improvements over vanilla baselines and supervised fine-tuning (SFT) approaches. Crucially, gains appeared not only on in-domain concept-exercise suites but also on diverse out-of-domain math benchmarks, suggesting genuine conceptual transfer rather than memorization.

Key Technical Properties

CORE remains algorithm-agnostic and verifier-agnostic, meaning it can work with different RL algorithms and reward verification systems. The framework unifies two training approaches—direct training on concept-aligned quizzes and concept-injected rollouts—under outcome regularization, providing a theoretically grounded approach.

The explicit injection of concepts during rollouts is the core innovation. Rather than hoping models infer concepts from outcome signals, CORE makes conceptual reasoning part of the trajectory itself, then reinforces those trajectories.

What This Means

This research addresses a genuine failure mode in current LLM training: the ability to pass benchmarks without understanding. For mathematical reasoning specifically, CORE demonstrates that RL frameworks can be tuned to target conceptual competence directly, not just answer correctness. The approach is particularly relevant as organizations deploy LLMs for technical education and professional reasoning tasks where conceptual understanding matters more than pattern-matching. The algorithm-agnostic design means this technique could integrate into existing RL pipelines used by labs training reasoning models.

CORE: Concept-Oriented RL for Math Reasoning | TPS