research

Researchers identify divergence term selection as key to preventing LLM performance collapse in RL fine-tuning

A new paper identifies a fundamental flaw in standard reinforcement learning fine-tuning approaches for large language models: the choice of divergence term directly causes the degradation of multi-attempt performance (Pass@k) despite single-attempt improvements. Researchers propose Diversity-Preserving Hybrid RL (DPH-RL), which uses mass-covering f-divergences to maintain broad solution coverage and prevent catastrophic forgetting.

2 min read

The Pass@k Paradox in LLM Fine-Tuning

A central problem in reinforcement learning with verifiable reward (RLVR) has gone largely unexamined: models fine-tuned to improve single-attempt accuracy (Pass@1) frequently lose performance on multi-attempt benchmarks (Pass@k). The phenomenon is accompanied by catastrophic forgetting—models discard previously learned skills entirely.

Researchers have proposed various mitigation strategies, but according to a new arXiv paper (2509.07430), the actual culprit has been overlooked: the divergence term itself.

Why Standard Approaches Fail

Standard RLVR objectives use one of two approaches:

  1. Reverse KL-divergence (mode-seeking): Actively narrows the policy, accelerating knowledge decay
  2. No divergence term: Provides no safeguard against drift from the model's diverse knowledge base

Both approaches fail to preserve the broad solution coverage that enables strong Pass@k performance. As models become specialized for high Pass@1 scores, they lose the diversity needed to succeed across multiple attempts.

DPH-RL: Shifting to Mass-Covering Divergences

The proposed solution inverts conventional thinking: use the divergence term as the primary mechanism for preserving diversity rather than as a constraint.

Diversity-Preserving Hybrid RL (DPH-RL) employs mass-covering f-divergences—specifically forward-KL and Jensen-Shannon (JS) divergence—to function as a rehearsal mechanism. By continuously referencing the initial pre-trained policy, the framework forces models to maintain broad solution coverage even as they optimize for task performance.

The approach is computationally efficient: f-divergence computation uses generator functions and requires only sampling from the initial policy, eliminating the need for an expensive online reference model.

Experimental Results

Extensive testing on math and SQL generation tasks demonstrates that DPH-RL:

  • Resolves Pass@k degradation that occurs in standard RLVR approaches
  • Improves both Pass@1 and Pass@k on in-domain and out-of-domain benchmarks
  • Reduces training overhead compared to methods requiring reference models
  • Maintains previously acquired skills without catastrophic forgetting

The paper establishes a straightforward principle: divergence term selection is not a minor hyperparameter—it fundamentally determines whether fine-tuned models become specialized dead-ends or remain general reasoners.

What This Means

This work addresses a practical problem that affects how production LLMs are optimized for complex tasks. The finding that mass-covering divergences preserve diversity while mode-seeking divergences destroy it suggests that many current fine-tuning pipelines are unnecessarily sacrificing multi-attempt performance. Organizations fine-tuning models for code generation, mathematics, and other reasoning tasks should evaluate whether their divergence choices align with actual performance goals. The framework's efficiency advantage—requiring no reference model—also makes it practical for adoption in existing training pipelines.