LLM News

Every LLM release, update, and milestone.

Filtered by:asynchronous-optimization✕ clear
research

Researchers propose VCPO to stabilize asynchronous RL training for LLMs, cutting training time 2.5x

A new technique called Variance Controlled Policy Optimization (VCPO) addresses a fundamental problem in asynchronous reinforcement learning for LLMs: high variance in policy-gradient estimates from stale rollouts. The method scales learning rates based on effective sample size and applies a minimum-variance baseline, reducing long-context training time by 2.5x while maintaining synchronous performance.