New RLVR method reformulates reward-based LLM training as classification problem
A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.
New RLVR Framework Reformulates Reward-Based LLM Training as Classification Problem
A new arXiv paper (2602.05630) proposes a fundamental rethinking of how reinforcement learning with verifiable rewards (RLVR) trains large language models on complex reasoning tasks.
The paper identifies two core inefficiencies in current RLVR methods like GRPO: "Gradient Misassignment in Positives" (where correct solutions receive suboptimal gradient signals) and "Gradient Domination in Negatives" (where incorrect solutions disproportionately influence training). These issues lead to unstable and suboptimal policy updates.
The REAL Framework
Instead of treating rewards as scalar weights, the authors propose Rewards as Labels (REAL), which recasts the problem as categorical classification. The framework introduces anchor logits to enhance policy learning and mathematically ensures "monotonic and bounded gradient weighting" that distributes gradients evenly across rollouts.
This geometric shift—from scalar reward weighting to classification—produces more balanced gradient allocation and directly mitigates the identified gradient pathologies.
Benchmark Results
Experiments on mathematical reasoning benchmarks show consistent gains:
- 1.5B parameter model: REAL improves average Pass@1 by 6.7% over DAPO (a strong GRPO variant)
- 7B parameter model: 6.2% improvement over DAPO, 1.7% over GSPO
- Stability: Even using vanilla binary cross-entropy loss, REAL remains training-stable and exceeds DAPO by 4.5% on average
The results suggest the framework is robust across model scales and loss function choices.
Technical Contribution
The core insight is conceptual: verifiable rewards naturally fit a classification framing (correct/incorrect) rather than continuous scalar optimization. By formalizing this and adding anchor logits (learned reference points), the method achieves better gradient flow without requiring specialized loss functions.
The paper provides theoretical analysis showing why REAL's gradient weighting avoids the extreme values that plague current methods, leading to more stable training dynamics.
What This Means
This work addresses a real pain point in scaling reasoning-focused LLMs: training instability and suboptimal convergence when using rule-based reward signals. The 6-7% Pass@1 gains are meaningful in the context of mathematical benchmarks where incremental improvements compound.
If the method generalizes beyond mathematical reasoning (an open question the paper doesn't fully explore), it could become a standard approach for any RLVR application. The fact that it works with simple binary cross-entropy suggests practitioners won't need complex loss engineering to adopt it.
The framework is particularly relevant as more labs pursue verifiable reward systems for code, mathematics, and formal reasoning tasks where ground truth can be mechanically checked.