New RLVR method reformulates reward-based LLM training as classification problem
A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.