LLM News

Every LLM release, update, and milestone.

Filtered by:policy-optimization✕ clear

research

EvoTool optimizes LLM agent tool-use policies via evolutionary algorithms without gradients

Researchers propose EvoTool, a gradient-free evolutionary framework that optimizes tool-use policies in LLM agents by decomposing them into four modules and iteratively improving each through blame attribution and targeted mutation. The approach outperforms GPT-4.1 and Qwen3-8B baselines by over 5 percentage points across four benchmarks.

March 6, 2026 · 6:07 AM2 min read

llm-agents tool-use policy-optimization

via arxiv.org ↗

research

BandPO improves LLM reinforcement learning by replacing fixed clipping with probability-aware bounds

Researchers introduce BandPO, a method that replaces the fixed clipping mechanism in PPO with dynamic, probability-aware clipping intervals. The approach addresses a critical limitation: canonical clipping disproportionately suppresses high-advantage tail strategies and causes rapid entropy collapse. Experiments show consistent improvements over standard clipping methods.

March 6, 2026 · 5:37 AM2 min read

reinforcement-learning ppo llm-training

via arxiv.org ↗

research

New RLVR method reformulates reward-based LLM training as classification problem

A new research paper proposes Rewards as Labels (REAL), a framework that reframes reinforcement learning with verifiable rewards as a classification problem rather than scalar weighting. The method addresses fundamental gradient optimization issues in current GRPO variants and demonstrates measurable improvements on mathematical reasoning benchmarks.

March 5, 2026 · 5:23 AM2 min read

rlvr reinforcement-learning llm-training

via arxiv.org ↗

research

RAPO framework improves LLM agent reasoning by combining retrieval with reinforcement learning

Researchers introduce RAPO (Retrieval-Augmented Policy Optimization), a reinforcement learning framework that improves LLM agent reasoning by incorporating off-policy retrieval signals during training. The method achieves an average 5.0% performance gain across fourteen datasets and delivers 1.2x faster training efficiency compared to existing agentic RL approaches.

March 5, 2026 · 1:51 AM2 min read

reinforcement-learning llm-agents agentic-ai

via arxiv.org ↗