research

Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems

TL;DR

Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.

3 min read
0

Alibaba's Qwen Team Doubles Reasoning Chain Length With Token-Weighted Training Algorithm

Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that fundamentally changes how reinforcement learning assigns credit to individual tokens during reasoning model training. The breakthrough addresses a critical limitation: standard RL approaches reward all tokens equally, even though some steps are far more influential than others in determining reasoning quality.

The Problem With Current Credit Assignment

When language models learn to reason through reinforcement learning, they typically receive a simple pass/fail judgment at the sequence end, with that reward spread evenly across every token. A comma gets the same credit signal as a pivotal logical step. This flat reward structure causes reasoning chains to plateau—models learn to extend thoughts to a certain length and then stagnate.

Previous attempts to solve this relied on PPO-based methods using auxiliary value models pre-trained on synthetic chain-of-thought data. The Qwen team argues this makes it impossible to determine whether performance gains come from the algorithm or from leaked outside knowledge.

How FIPO Works

FIPO calculates each token's cumulative probability shift across all downstream tokens. Instead of judging a token in isolation, the algorithm asks: How does model behavior change after this token appears? Tokens initiating productive reasoning chains receive higher rewards. Tokens sending the model toward dead ends receive less.

Critically, FIPO requires no auxiliary model, eliminating knowledge leakage while matching PPO-based performance. The algorithm includes stability guardrails: a discount factor ensures nearby tokens carry more weight (their influence is easier to predict), and filters remove tokens where model drift between training steps exceeds thresholds. Without filtering, training crashed around step 70.

Benchmark Results

Testing on Qwen2.5-32B-Base (no prior long-CoT exposure) using only the public DAPO dataset:

  • Reasoning chain length: DAPO stalls at ~4,000 tokens; FIPO reaches 10,000+
  • AIME 2024 accuracy: 50% → 58% (outperforms Deepseek-R1-Zero-Math-32B at 47%, matches o1-mini at 56%)
  • AIME 2025 accuracy: 38% → 43%
  • Distribution shift: Entire length distribution shifted upward, not just outliers

Emergent Self-Verification Behavior

The model naturally developed four distinct reasoning phases during training without explicit instruction. Early phases produced shallow templates and linear chains. By phase three, the model spontaneously double-checked intermediate results using different approaches (switching from algebraic to geometric interpretation, for example). Phase four showed systematic multi-pass verification with step-by-step recalculation.

The researchers note this mirrors inference-time scaling strategies in OpenAI's o-series and Deepseek-R1, but emerges purely through reinforcement learning without synthetic long-CoT pre-training.

Significant Limitations

FIPO has only been validated on mathematical tasks. Testing scope was limited to:

  • Single dataset (DAPO)
  • Base models without long-CoT pre-training
  • Mathematical problems only

Generalization to code, symbolic logic, or other domains remains unproven. Extended reasoning sequences increase compute costs. Additionally, a performance gap persists compared to distilling from larger teacher models—pure RL teaches models less than direct instruction from stronger ones.

What This Means

FIPO addresses a genuine bottleneck in reasoning model training by fixing how credit flows during RL. The algorithm's ability to achieve results without auxiliary models strengthens the case for pure RL approaches and may influence how other teams structure reasoning training. However, the math-only validation significantly limits claims about broader applicability. The planned open-source release could accelerate testing on other domains, but teams will need substantial additional work to determine whether FIPO's benefits transfer beyond mathematics.

Related Articles

research

AI2 Releases DiScoFormer: Single Transformer Estimates Density and Score Across Distributions Without Retraining

Allen Institute for AI (AI2) has released DiScoFormer, a transformer model that estimates both the density and score of any distribution from a sample in a single forward pass without retraining. In 100 dimensions, the model reduces score estimation error by 6.5x and density error by 37x compared to classical kernel density estimation.

research

6,000 prompt injection attempts fail against Claude Opus 4.6 in public hacking challenge

A public hacking challenge targeting an AI assistant powered by Claude Opus 4.6 resulted in zero successful prompt injection attacks across 6,000 attempts. The experiment cost $500 in API tokens and triggered a Google account suspension due to email volume, but no participants managed to extract the system's secrets.

research

AI2 Research: Hybrid Models Excel at Content Words, Transformers Better at Token Repetition

Allen Institute for AI researchers conducted token-level analysis comparing their 7B-parameter Olmo 3 transformer and Olmo Hybrid models. The study finds hybrid architectures show a loss gap advantage of 0.04 on content words (nouns, verbs, adjectives) versus 0.02 on function words, while transformers match or exceed hybrids on repeated tokens and closing braces.

research

Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.

Comments

Loading...