LLM News

Every LLM release, update, and milestone.

Filtered by:reinforcement-learning✕ clear
research

Researchers Identify 'Contextual Inertia' Bug in LLMs During Multi-Turn Conversations

Researchers have identified a critical failure mode in large language models called 'contextual inertia'—where models ignore new information in multi-turn conversations and rigidly stick to previous reasoning. A new training method called RLSTA uses single-turn performance as an anchor to stabilize multi-turn reasoning and recover performance lost to this phenomenon.

research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.

research

Reinforcement fine-tuning preserves model knowledge better than supervised fine-tuning, study finds

A new study on Qwen2.5-VL reveals reinforcement fine-tuning (RFT) significantly outperforms supervised fine-tuning (SFT) at preserving a model's existing knowledge during post-training adaptation. While SFT enables faster task learning, it causes catastrophic forgetting; RFT learns more slowly but maintains prior knowledge by reinforcing samples naturally aligned with the base model's probability landscape.

research

Self-confidence signals enable unsupervised reward training for text-to-image models

Researchers introduce SOLACE, a post-training framework that replaces external reward models with an internal self-confidence signal derived from how accurately a text-to-image model recovers injected noise. The method enables fully unsupervised optimization and shows measurable improvements in compositional generation, text rendering, and text-image alignment.

research

Study shows RL training enables LLMs to abstain on unanswerable temporal questions, outperforming GPT-4o

A new arXiv study presents the first systematic evaluation of training large language models to abstain—refuse to answer—on temporal questions they cannot reliably answer. Using reinforcement learning with abstention-aware rewards, researchers achieved 3.46-5.80% higher accuracy on temporal QA benchmarks than GPT-4o, while improving true positive rates on unanswerable questions by 20%.

2 min readvia arxiv.org
research

Knowledge graphs enable smaller models to outperform GPT-5.2 on complex reasoning

A new training approach using knowledge graphs as implicit reward models enables a 14-billion-parameter model to outperform much larger systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks. Researchers combined supervised fine-tuning and reinforcement learning with knowledge graph path signals to ground models in verifiable domain facts.

2 min readvia arxiv.org
research

New RLVR method reformulates reward-based LLM training as classification problem

A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.

research

ELMUR extends RL memory horizons 100,000x with structured external memory architecture

Researchers introduce ELMUR, a transformer variant that adds structured external memory to handle long-horizon reinforcement learning problems under partial observability. The system extends effective decision-making horizons beyond standard attention windows by up to 100,000x and achieves 100% success on synthetic tasks with corridors spanning one million steps.

research

RAPO framework improves LLM agent reasoning by combining retrieval with reinforcement learning

Researchers introduce RAPO (Retrieval-Augmented Policy Optimization), a reinforcement learning framework that improves LLM agent reasoning by incorporating off-policy retrieval signals during training. The method achieves an average 5.0% performance gain across fourteen datasets and delivers 1.2x faster training efficiency compared to existing agentic RL approaches.

research

New RL framework CORE helps LLMs bridge gap between solving math problems and understanding concepts

Researchers have identified a critical gap in how large language models learn mathematics: they can solve problems but often don't understand the underlying concepts. A new reinforcement learning framework called CORE addresses this by using explicit concept definitions as training signals, rather than just reinforcing correct final answers.

research

Perception-R1 uses visual reward signals to improve multimodal AI reasoning

Researchers propose Perception-R1, a method that adds visual perception reward signals to reinforcement learning training for multimodal AI models. The approach achieves state-of-the-art results on multiple reasoning benchmarks using just 1,442 training examples by explicitly teaching models to accurately perceive visual content before reasoning about it.

research

Researchers identify divergence term selection as key to preventing LLM performance collapse in RL fine-tuning

A new paper identifies a fundamental flaw in standard reinforcement learning fine-tuning approaches for large language models: the choice of divergence term directly causes the degradation of multi-attempt performance (Pass@k) despite single-attempt improvements. Researchers propose Diversity-Preserving Hybrid RL (DPH-RL), which uses mass-covering f-divergences to maintain broad solution coverage and prevent catastrophic forgetting.

research

VideoTemp-o3 combines temporal grounding with video QA in single agentic framework

Researchers have introduced VideoTemp-o3, a unified framework that addresses limitations in long-video understanding by combining temporal grounding and question-answering in a single agentic system. The approach uses a unified masking mechanism during training and reinforcement learning with dedicated reward signals to improve video segment localization and reduce hallucinations.

research

Researchers propose VCPO to stabilize asynchronous RL training for LLMs, cutting training time 2.5x

A new technique called Variance Controlled Policy Optimization (VCPO) addresses a fundamental problem in asynchronous reinforcement learning for LLMs: high variance in policy-gradient estimates from stale rollouts. The method scales learning rates based on effective sample size and applies a minimum-variance baseline, reducing long-context training time by 2.5x while maintaining synchronous performance.