LLM News

Every LLM release, update, and milestone.

Filtered by:ppo✕ clear
research

Researchers propose WIM rating system to replace subjective numerical scores in LLM training

A new research paper introduces the What Is Missing (WIM) rating system, which generates model output rankings from natural-language feedback rather than subjective numerical scores. The approach integrates into existing LLM training pipelines and claims to reduce ties and increase training signal clarity compared to discrete ratings.

2 min readvia arxiv.org
research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.