TPSTokens Per Second

LLM News

Every LLM release, update, and milestone.

Filtered by:reinforcement-learning✕ clear

research

Researchers Identify 'Contextual Inertia' Bug in LLMs During Multi-Turn Conversations

Researchers have identified a critical failure mode in large language models called 'contextual inertia'—where models ignore new information in multi-turn conversations and rigidly stick to previous reasoning. A new training method called RLSTA uses single-turn performance as an anchor to stabilize multi-turn reasoning and recover performance lost to this phenomenon.

March 6, 2026 · 5:50 AM2 min read

research multi-turn-interaction reinforcement-learning

via arxiv.org ↗

research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.

March 6, 2026 · 5:37 AM2 min read

reinforcement-learning ppo llm-training

via arxiv.org ↗

research

Reinforcement fine-tuning preserves model knowledge better than supervised fine-tuning, study finds

A new study on Qwen2.5-VL reveals reinforcement fine-tuning (RFT) significantly outperforms supervised fine-tuning (SFT) at preserving a model's existing knowledge during post-training adaptation. While SFT enables faster task learning, it causes catastrophic forgetting; RFT learns more slowly but maintains prior knowledge by reinforcing samples naturally aligned with the base model's probability landscape.

March 6, 2026 · 5:10 AM2 min read

reinforcement-learning fine-tuning multimodal-models

via arxiv.org ↗

research

Self-confidence signals enable unsupervised reward training for text-to-image models

Researchers introduce SOLACE, a post-training framework that replaces external reward models with an internal self-confidence signal derived from how accurately a text-to-image model recovers injected noise. The method enables fully unsupervised optimization and shows measurable improvements in compositional generation, text rendering, and text-image alignment.

March 6, 2026 · 5:05 AM2 min read

text-to-image post-training reward-modeling

via arxiv.org ↗

research

MemSifter uses smaller proxy models to handle LLM memory retrieval, reducing computational overhead

Researchers introduce MemSifter, a framework that offloads memory retrieval to smaller proxy models instead of burdening the primary LLM. The approach uses outcome-driven reinforcement learning to optimize retrieval accuracy while minimizing computational overhead during inference.

March 5, 2026 · 5:54 AM2 min read

llm-research memory-retrieval reinforcement-learning

via arxiv.org ↗

research

Study shows RL training enables LLMs to abstain on unanswerable temporal questions, outperforming GPT-4o

A new arXiv study presents the first systematic evaluation of training large language models to abstain—refuse to answer—on temporal questions they cannot reliably answer. Using reinforcement learning with abstention-aware rewards, researchers achieved 3.46-5.80% higher accuracy on temporal QA benchmarks than GPT-4o, while improving true positive rates on unanswerable questions by 20%.

March 5, 2026 · 5:36 AM2 min read

research abstention temporal-qa

via arxiv.org ↗

research

Knowledge graphs enable smaller models to outperform GPT-5.2 on complex reasoning

A new training approach using knowledge graphs as implicit reward models enables a 14-billion-parameter model to outperform much larger systems like GPT-5.2 and Gemini 3 Pro on complex multi-hop reasoning tasks. Researchers combined supervised fine-tuning and reinforcement learning with knowledge graph path signals to ground models in verifiable domain facts.

March 5, 2026 · 5:23 AM2 min read

research reasoning knowledge-graphs

via arxiv.org ↗

research

New RLVR method reformulates reward-based LLM training as classification problem

A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.

March 5, 2026 · 5:23 AM2 min read

rlvr reinforcement-learning llm-training

via arxiv.org ↗

research

ELMUR extends RL memory horizons 100,000x with structured external memory architecture

Researchers introduce ELMUR, a transformer variant that adds structured external memory to handle long-horizon reinforcement learning problems under partial observability. The system extends effective decision-making horizons beyond standard attention windows by up to 100,000x and achieves 100% success on synthetic tasks with corridors spanning one million steps.

March 5, 2026 · 5:07 AM2 min read

reinforcement-learning transformer-architecture memory-architecture

via arxiv.org ↗

research

RAPO framework improves LLM agent reasoning by combining retrieval with reinforcement learning

Researchers introduce RAPO (Retrieval-Augmented Policy Optimization), a reinforcement learning framework that improves LLM agent reasoning by incorporating off-policy retrieval signals during training. The method achieves an average 5.0% performance gain across fourteen datasets and delivers 1.2x faster training efficiency compared to existing agentic RL approaches.

March 5, 2026 · 1:51 AM2 min read

reinforcement-learning llm-agents agentic-ai

via arxiv.org ↗

research

New RL framework CORE helps LLMs bridge gap between solving math problems and understanding concepts

Researchers have identified a critical gap in how large language models learn mathematics: they can solve problems but often don't understand the underlying concepts. A new reinforcement learning framework called CORE addresses this by using explicit concept definitions as training signals, rather than just reinforcing correct final answers.

March 5, 2026 · 1:07 AM2 min read

reinforcement-learning mathematical-reasoning LLM-training

via arxiv.org ↗

research

Perception-R1 uses visual reward signals to improve multimodal AI reasoning

Researchers propose Perception-R1, a method that adds visual perception reward signals to reinforcement learning training for multimodal AI models. The approach achieves state-of-the-art results on multiple reasoning benchmarks using just 1,442 training examples by explicitly teaching models to accurately perceive visual content before reasoning about it.

March 5, 2026 · 12:53 AM2 min read

multimodal-ai reinforcement-learning mllm

via arxiv.org ↗

research

Researchers identify divergence term selection as key to preventing LLM performance collapse in RL fine-tuning

A new paper identifies a fundamental flaw in standard reinforcement learning fine-tuning approaches for large language models: the choice of divergence term directly causes the degradation of multi-attempt performance (Pass@k) despite single-attempt improvements. Researchers propose Diversity-Preserving Hybrid RL (DPH-RL), which uses mass-covering f-divergences to maintain broad solution coverage and prevent catastrophic forgetting.

March 5, 2026 · 12:53 AM2 min read

reinforcement-learning large-language-models rlvr

via arxiv.org ↗

research

VideoTemp-o3 combines temporal grounding with video QA in single agentic framework

Researchers have introduced VideoTemp-o3, a unified framework that addresses limitations in long-video understanding by combining temporal grounding and question-answering in a single agentic system. The approach uses a unified masking mechanism during training and reinforcement learning with dedicated reward signals to improve video segment localization and reduce hallucinations.

March 5, 2026 · 12:51 AM2 min read

video-understanding temporal-grounding long-form-video

via arxiv.org ↗

research

Researchers propose VCPO to stabilize asynchronous RL training for LLMs, cutting training time 2.5x

A new technique called Variance Controlled Policy Optimization (VCPO) addresses a fundamental problem in asynchronous reinforcement learning for LLMs: high variance in policy-gradient estimates from stale rollouts. The method scales learning rates based on effective sample size and applies a minimum-variance baseline, reducing long-context training time by 2.5x while maintaining synchronous performance.

February 20, 2026 · 3:21 AM2 min read

reinforcement-learning llm-training asynchronous-optimization

via arxiv.org ↗