LLM News

Every LLM release, update, and milestone.

Filtered by:llm-training✕ clear

research

Researchers propose WIM rating system to replace subjective numerical scores in LLM training

A new research paper introduces the What Is Missing (WIM) rating system, which generates model output rankings from natural-language feedback rather than subjective numerical scores. The approach integrates into existing LLM training pipelines and claims to reduce ties and increase training signal clarity compared to discrete ratings.

March 6, 2026 · 5:53 AM2 min read

llm-training preference-learning dpo

via arxiv.org ↗

research

Researchers Identify 'Contextual Inertia' Bug in LLMs During Multi-Turn Conversations

Researchers have identified a critical failure mode in large language models called 'contextual inertia'—where models ignore new information in multi-turn conversations and rigidly stick to previous reasoning. A new training method called RLSTA uses single-turn performance as an anchor to stabilize multi-turn reasoning and recover performance lost to this phenomenon.

March 6, 2026 · 5:50 AM2 min read

research multi-turn-interaction reinforcement-learning

via arxiv.org ↗

research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.

March 6, 2026 · 5:37 AM2 min read

reinforcement-learning ppo llm-training

via arxiv.org ↗

researchNVIDIA

POET-X reduces LLM training memory by 40%, enables billion-parameter models on single H100

Researchers introduce POET-X, a memory-efficient variant of the Reparameterized Orthogonal Equivalence Training framework that reduces computational overhead in LLM training. The method enables pretraining of billion-parameter models on a single Nvidia H100 GPU, where standard optimizers like AdamW exhaust memory.

March 6, 2026 · 5:22 AM2 min read

llm-training optimization memory-efficiency

via arxiv.org ↗

research

New Method Reduces AI Over-Refusal Without Sacrificing Safety Alignment

A new alignment technique called Discernment via Contrastive Refinement (DCR) addresses a persistent problem in safety-aligned LLMs: over-refusal, where models reject benign requests as toxic. The method uses contrastive refinement to help models better distinguish genuinely harmful prompts from superficially toxic ones, reducing refusals while preserving safety.

March 5, 2026 · 6:06 AM2 min read

research safety-alignment over-refusal

via arxiv.org ↗

research

Researchers develop data synthesis method to improve multimodal AI reasoning on charts and documents

A new research paper proposes COGS (COmposition-Grounded data Synthesis), a framework that decomposes questions into primitive perception and reasoning factors to generate synthetic training data. The method substantially improves multimodal model performance on chart reasoning and document understanding tasks with minimal human annotation.

March 5, 2026 · 5:24 AM2 min read

multimodal visual-reasoning data-synthesis

via arxiv.org ↗

research

New RLVR method reformulates reward-based LLM training as classification problem

A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.

March 5, 2026 · 5:23 AM2 min read

rlvr reinforcement-learning llm-training

via arxiv.org ↗

research

NeuroProlog framework combines neural networks with symbolic reasoning to fix LLM math errors

Researchers introduce NeuroProlog, a neurosymbolic framework that compiles math word problems into executable Prolog programs with formal verification guarantees. A multi-task "Cocktail" training strategy achieves significant accuracy improvements on GSM8K: +5.23% on Qwen-32B, +3.43% on GPT-OSS-20B, and +5.54% on Llama-3B compared to single-task baselines.

March 5, 2026 · 5:10 AM2 min read

neurosymbolic-ai mathematical-reasoning prolog

via arxiv.org ↗

research

Code agents can evolve math problems into harder variants, study finds

A new study demonstrates that code agents can autonomously evolve existing math problems into more complex, solvable variations through systematic exploration. The multi-agent framework addresses a critical bottleneck in training advanced LLMs toward IMO-level mathematical reasoning by providing a scalable mechanism for synthesizing high-difficulty problems.

March 5, 2026 · 1:38 AM2 min read

research code-agents mathematics

via arxiv.org ↗

research

Researchers propose VCPO to stabilize asynchronous RL training for LLMs, cutting training time 2.5x

A new technique called Variance Controlled Policy Optimization (VCPO) addresses a fundamental problem in asynchronous reinforcement learning for LLMs: high variance in policy-gradient estimates from stale rollouts. The method scales learning rates based on effective sample size and applies a minimum-variance baseline, reducing long-context training time by 2.5x while maintaining synchronous performance.

February 20, 2026 · 3:21 AM2 min read

reinforcement-learning llm-training asynchronous-optimization

via arxiv.org ↗