LLM News

Every LLM release, update, and milestone.

Filtered by:ppo✕ clear

research

Researchers propose WIM rating system to replace subjective numerical scores in LLM training

A new research paper introduces the What Is Missing (WIM) rating system, which generates model output rankings from natural-language feedback rather than subjective numerical scores. The approach integrates into existing LLM training pipelines and claims to reduce ties and increase training signal clarity compared to discrete ratings.

March 6, 2026 · 5:53 AM2 min read

llm-training preference-learning dpo

via arxiv.org ↗

research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.

March 6, 2026 · 5:37 AM2 min read

reinforcement-learning ppo llm-training

via arxiv.org ↗