research

Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems

TL;DR

Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.

April 5, 2026 · 6:50 AM3 min read

Alibaba's Qwen Team Doubles Reasoning Chain Length With Token-Weighted Training Algorithm

Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that fundamentally changes how reinforcement learning assigns credit to individual tokens during reasoning model training. The breakthrough addresses a critical limitation: standard RL approaches reward all tokens equally, even though some steps are far more influential than others in determining reasoning quality.

The Problem With Current Credit Assignment

When language models learn to reason through reinforcement learning, they typically receive a simple pass/fail judgment at the sequence end, with that reward spread evenly across every token. A comma gets the same credit signal as a pivotal logical step. This flat reward structure causes reasoning chains to plateau—models learn to extend thoughts to a certain length and then stagnate.

Previous attempts to solve this relied on PPO-based methods using auxiliary value models pre-trained on synthetic chain-of-thought data. The Qwen team argues this makes it impossible to determine whether performance gains come from the algorithm or from leaked outside knowledge.

How FIPO Works

FIPO calculates each token's cumulative probability shift across all downstream tokens. Instead of judging a token in isolation, the algorithm asks: How does model behavior change after this token appears? Tokens initiating productive reasoning chains receive higher rewards. Tokens sending the model toward dead ends receive less.

Critically, FIPO requires no auxiliary model, eliminating knowledge leakage while matching PPO-based performance. The algorithm includes stability guardrails: a discount factor ensures nearby tokens carry more weight (their influence is easier to predict), and filters remove tokens where model drift between training steps exceeds thresholds. Without filtering, training crashed around step 70.

Benchmark Results

Testing on Qwen2.5-32B-Base (no prior long-CoT exposure) using only the public DAPO dataset:

Reasoning chain length: DAPO stalls at ~4,000 tokens; FIPO reaches 10,000+
AIME 2024 accuracy: 50% → 58% (outperforms Deepseek-R1-Zero-Math-32B at 47%, matches o1-mini at 56%)
AIME 2025 accuracy: 38% → 43%
Distribution shift: Entire length distribution shifted upward, not just outliers

Emergent Self-Verification Behavior

The model naturally developed four distinct reasoning phases during training without explicit instruction. Early phases produced shallow templates and linear chains. By phase three, the model spontaneously double-checked intermediate results using different approaches (switching from algebraic to geometric interpretation, for example). Phase four showed systematic multi-pass verification with step-by-step recalculation.

The researchers note this mirrors inference-time scaling strategies in OpenAI's o-series and Deepseek-R1, but emerges purely through reinforcement learning without synthetic long-CoT pre-training.

Significant Limitations

FIPO has only been validated on mathematical tasks. Testing scope was limited to:

Single dataset (DAPO)
Base models without long-CoT pre-training
Mathematical problems only

Generalization to code, symbolic logic, or other domains remains unproven. Extended reasoning sequences increase compute costs. Additionally, a performance gap persists compared to distilling from larger teacher models—pure RL teaches models less than direct instruction from stronger ones.

What This Means

FIPO addresses a genuine bottleneck in reasoning model training by fixing how credit flows during RL. The algorithm's ability to achieve results without auxiliary models strengthens the case for pure RL approaches and may influence how other teams structure reasoning training. However, the math-only validation significantly limits claims about broader applicability. The planned open-source release could accelerate testing on other domains, but teams will need substantial additional work to determine whether FIPO's benefits transfer beyond mathematics.

Source: the-decoder.com ↗

alibaba-qwen reinforcement-learning reasoning-models training-algorithm fipo credit-assignment chain-of-thought mathematical-reasoning

researchApril 6, 2026

Alibaba's HopChain framework fixes vision model failures in multi-step reasoning tasks

Researchers from Alibaba's Qwen team and Tsinghua University developed HopChain, a framework that automatically generates multi-step image questions to fix how vision-language models fail during complex reasoning tasks. The method improved 20 out of 24 tested benchmarks by forcing models to re-examine images at each reasoning step, preventing early perceptual errors from cascading through subsequent steps.

researchMay 20, 2026

OpenAI claims reasoning model disproved 80-year-old Erdős conjecture in geometry

OpenAI claims its new reasoning model has produced an original mathematical proof disproving a geometry conjecture first posed by Paul Erdős in 1946. The company says this is the first time AI has autonomously solved a prominent open problem central to a field of mathematics, with verification from mathematicians including Thomas Bloom and Noga Alon.

researchMay 18, 2026

NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data

NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.

researchMay 16, 2026

Gemma 4, DeepSeek V4, and ZAYA1 Deploy KV Cache Compression to Cut Long-Context Memory Costs

Recent open-weight LLM releases from Google, DeepSeek, and others are adopting architectural techniques that reduce KV cache size by approximately 50% at long contexts. These include cross-layer KV sharing in Gemma 4, which saves 2.7 GB at 128K context for the E2B model, and compressed convolutional attention in ZAYA1-8B.