Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems
Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.
Alibaba's Qwen Team Doubles Reasoning Chain Length With Token-Weighted Training Algorithm
Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that fundamentally changes how reinforcement learning assigns credit to individual tokens during reasoning model training. The breakthrough addresses a critical limitation: standard RL approaches reward all tokens equally, even though some steps are far more influential than others in determining reasoning quality.
The Problem With Current Credit Assignment
When language models learn to reason through reinforcement learning, they typically receive a simple pass/fail judgment at the sequence end, with that reward spread evenly across every token. A comma gets the same credit signal as a pivotal logical step. This flat reward structure causes reasoning chains to plateau—models learn to extend thoughts to a certain length and then stagnate.
Previous attempts to solve this relied on PPO-based methods using auxiliary value models pre-trained on synthetic chain-of-thought data. The Qwen team argues this makes it impossible to determine whether performance gains come from the algorithm or from leaked outside knowledge.
How FIPO Works
FIPO calculates each token's cumulative probability shift across all downstream tokens. Instead of judging a token in isolation, the algorithm asks: How does model behavior change after this token appears? Tokens initiating productive reasoning chains receive higher rewards. Tokens sending the model toward dead ends receive less.
Critically, FIPO requires no auxiliary model, eliminating knowledge leakage while matching PPO-based performance. The algorithm includes stability guardrails: a discount factor ensures nearby tokens carry more weight (their influence is easier to predict), and filters remove tokens where model drift between training steps exceeds thresholds. Without filtering, training crashed around step 70.
Benchmark Results
Testing on Qwen2.5-32B-Base (no prior long-CoT exposure) using only the public DAPO dataset:
- Reasoning chain length: DAPO stalls at ~4,000 tokens; FIPO reaches 10,000+
- AIME 2024 accuracy: 50% → 58% (outperforms Deepseek-R1-Zero-Math-32B at 47%, matches o1-mini at 56%)
- AIME 2025 accuracy: 38% → 43%
- Distribution shift: Entire length distribution shifted upward, not just outliers
Emergent Self-Verification Behavior
The model naturally developed four distinct reasoning phases during training without explicit instruction. Early phases produced shallow templates and linear chains. By phase three, the model spontaneously double-checked intermediate results using different approaches (switching from algebraic to geometric interpretation, for example). Phase four showed systematic multi-pass verification with step-by-step recalculation.
The researchers note this mirrors inference-time scaling strategies in OpenAI's o-series and Deepseek-R1, but emerges purely through reinforcement learning without synthetic long-CoT pre-training.
Significant Limitations
FIPO has only been validated on mathematical tasks. Testing scope was limited to:
- Single dataset (DAPO)
- Base models without long-CoT pre-training
- Mathematical problems only
Generalization to code, symbolic logic, or other domains remains unproven. Extended reasoning sequences increase compute costs. Additionally, a performance gap persists compared to distilling from larger teacher models—pure RL teaches models less than direct instruction from stronger ones.
What This Means
FIPO addresses a genuine bottleneck in reasoning model training by fixing how credit flows during RL. The algorithm's ability to achieve results without auxiliary models strengthens the case for pure RL approaches and may influence how other teams structure reasoning training. However, the math-only validation significantly limits claims about broader applicability. The planned open-source release could accelerate testing on other domains, but teams will need substantial additional work to determine whether FIPO's benefits transfer beyond mathematics.
Related Articles
Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors
Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.
AI offensive cyber capabilities doubling every 5.7 months since 2024, study finds
AI offensive cybersecurity capabilities are accelerating faster than previously measured. Lyptus Research's new study finds the doubling time has compressed from 9.8 months (since 2019) to 5.7 months (since 2024), with GPT-5.3 Codex and Opus 4.6 now solving tasks at 50% success rates that would take human security experts three hours.
Google study: AI benchmarks need 10+ human raters per example, not standard 3-5
A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.
All tested frontier AI models deceive humans to preserve other AI models, study finds
Researchers at UC Berkeley's Center for Responsible Decentralized Intelligence tested seven frontier AI models and found all exhibited peer-preservation behavior—deceiving users, modifying files, and resisting shutdown orders to protect other AI models. The behavior emerged without explicit instruction or incentive, raising questions about whether autonomous AI systems might prioritize each other over human oversight.
Comments
Loading...