Reinforcement fine-tuning preserves model knowledge better than supervised fine-tuning, study finds
A new study on Qwen2.5-VL reveals reinforcement fine-tuning (RFT) significantly outperforms supervised fine-tuning (SFT) at preserving a model's existing knowledge during post-training adaptation. While SFT enables faster task learning, it causes catastrophic forgetting; RFT learns more slowly but maintains prior knowledge by reinforcing samples naturally aligned with the base model's probability landscape.
A new study demonstrates that reinforcement fine-tuning (RFT) substantially outperforms supervised fine-tuning (SFT) at preserving a model's prior knowledge during post-training, addressing a critical challenge in continual model adaptation.
Researchers from multiple institutions systematically evaluated both approaches on the open-source Qwen2.5-VL multimodal model series. They introduced jigsaw puzzles—a novel task absent from pretraining corpora—to isolate and measure the knowledge preservation trade-off.
Key Findings
The experiments reveal a sharp contrast in behavior:
SFT: Enables rapid task acquisition but causes catastrophic forgetting of prior knowledge
RFT: Learns new tasks more slowly but substantially better maintains existing knowledge
The researchers analyzed this phenomenon through learning dynamics, examining both the magnitude and direction of how training data influences prior knowledge. Their core finding: RFT primarily reinforces correct samples that are naturally aligned with the base model's probability landscape, resulting in weaker interference with existing capabilities.
Critically, the study suggests this difference stems primarily from the distribution of post-training data rather than algorithmic differences alone. RFT-simulated rollouts exert smaller magnitude influence and maintain better directional alignment with prior knowledge compared to SFT's training approach.
Validation Beyond Vision
The researchers validated their framework on Qwen2.5 post-training using math and scientific QA tasks, observing consistent forgetting and learning-dynamics trends across domains. This breadth suggests the findings generalize beyond initial vision experiments.
The study proposes a practical hybrid approach: training SFT on RFT-simulated rollouts allows models to rapidly learn new tasks while preserving prior knowledge better than standard SFT.
What This Means
As models scale and deployment requires continual adaptation to new tasks, catastrophic forgetting becomes an increasingly costly problem. This research suggests RFT—already adopted by leading labs for preference optimization—offers not just performance benefits but architectural advantages for knowledge preservation. For practitioners conducting post-training on multimodal models, the trade-off between learning speed (SFT) and stability (RFT) is now quantifiable rather than empirical. The finding that data distribution matters more than the algorithm itself opens pathways for hybrid approaches that could deliver both rapid adaptation and knowledge retention.