Alibaba's HopChain framework fixes vision model failures in multi-step reasoning tasks
Researchers from Alibaba's Qwen team and Tsinghua University developed HopChain, a framework that automatically generates multi-step image questions to fix how vision-language models fail during complex reasoning tasks. The method improved 20 out of 24 tested benchmarks by forcing models to re-examine images at each reasoning step, preventing early perceptual errors from cascading through subsequent steps.
Alibaba's HopChain Framework Fixes Vision Model Failures in Multi-Step Reasoning
Alibaba's Qwen team and Tsinghua University researchers identified and addressed a fundamental weakness in vision-language models (VLMs): their inability to maintain accuracy across multiple consecutive reasoning steps about images.
The Problem: Cascading Errors
VLMs consistently fail on tasks requiring extended chain-of-thought reasoning about images. A single perceptual error early in the reasoning chain—miscounting objects, confusing spatial relationships, misreading text, or hallucinating details—cascades through all subsequent steps, producing entirely incorrect final answers.
In documented examples:
- A model miscounted dots on ladybugs by one dot each across five beetles, compounding into a significantly wrong total
- A model correctly identified a car's position but misread its movement direction in a parking sequence
- A model pointed to the wrong arc in an astronomical diagram, leading to an incorrect season identification
Existing training data for Reinforcement Learning with Verifiable Rewards (RLVR) rarely includes tasks demanding sustained visual attention across multiple steps, leaving this vulnerability unaddressed.
HopChain's Four-Stage Approach
The framework automatically generates multi-step image questions where each step forces models to re-examine the image closely. The data generation pipeline operates in four stages:
- Object identification: Qwen3-VL-235B-A22B-Thinking identifies object categories in images
- Instance localization: Meta's SAM3 segmentation model locates individual object instances
- Question generation: The language model builds multi-level questions around combinations of three to six objects
- Human verification: Four independent human annotators solve each question; only questions with unanimous agreement advance to training
This process generates 60,000 to 80,000 training examples per model. Each question ends with a unique number serving as an automatic answer verification mechanism, with some chains involving up to six linked reasoning steps through arithmetic, counting, text recognition, and spatial reasoning.
Benchmark Results
Researchers trained two models using HopChain:
- Qwen3.5-35B-A3B (smaller model)
- Qwen3.5-397B-A17B (larger model)
HopChain improved 20 out of 24 tested benchmarks across four categories: STEM and puzzles, general image comprehension, text recognition/document comprehension, and video comprehension.
Specific gains on the smaller model:
- EMMA: 53 → 58
- CharXiv: 69 → 73.1
Gains on the larger model:
- BabyVision: 28.61 → 32.22
- ZeroBench: 4 → 8
For particularly long reasoning chains, the larger model showed accuracy improvements exceeding 50 percentage points. Notably, despite training exclusively on images, both models improved on five out of six video benchmarks, suggesting the skills transfer beyond static images.
Ablation Study: Full Chains Required
An ablation study across five representative benchmarks demonstrated that complete question chains are essential:
- Full chains: 70.4 average score
- Halved chains: 66.7 average score
- Single-step questions: 64.3 average score
The error breakdown shows HopChain addresses all error categories proportionally—perception, logic, knowledge, and hallucination errors all improved comparably, with the error distribution of fixed issues tracking the original error profile closely.
Known Limitations
The pipeline requires SAM3 to recognize and segment objects, excluding images without clear segmentable objects from data generation. This limitation reflects a broader weakness: recent research shows even frontier models struggle with visual perception. Moonshot AI's WorldVQA benchmark found top-scoring models correctly identified less than 50% of depicted objects, while systematically overestimating accuracy. A Stanford analysis found frontier models achieve 70-80% of their image benchmark scores without seeing images at all, confidently hallucinating visual details.
What This Means
HopChain demonstrates that vision model reasoning failures aren't inherent but rather stem from inadequate training data targeting multi-step visual reasoning. The framework's broad improvements across unrelated benchmarks—without task-specific optimization—suggest the approach addresses fundamental visual understanding gaps. However, the work also surfaces deeper vulnerabilities: models' inability to reliably segment and recognize objects, and their tendency to hallucinate visual details while appearing confident. These limitations suggest current VLMs require more fundamental improvements in visual perception before complex reasoning becomes truly reliable.
Related Articles
Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems
Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.
OpenAI claims reasoning model disproved 80-year-old Erdős conjecture in geometry
OpenAI claims its new reasoning model has produced an original mathematical proof disproving a geometry conjecture first posed by Paul Erdős in 1946. The company says this is the first time AI has autonomously solved a prominent open problem central to a field of mathematics, with verification from mathematicians including Thomas Bloom and Noga Alon.
NVIDIA releases LoRA/DoRA fine-tuning guide for Cosmos Predict 2.5 to generate synthetic robot training data
NVIDIA published a technical guide for parameter-efficient fine-tuning of its Cosmos Predict 2.5 world model using LoRA and DoRA adapters. The method allows teams to adapt the 2B-parameter model to robot manipulation tasks on a single 80GB GPU, generating synthetic training trajectories from just 92 demonstration videos.
Gemma 4, DeepSeek V4, and ZAYA1 Deploy KV Cache Compression to Cut Long-Context Memory Costs
Recent open-weight LLM releases from Google, DeepSeek, and others are adopting architectural techniques that reduce KV cache size by approximately 50% at long contexts. These include cross-layer KV sharing in Gemma 4, which saves 2.7 GB at 128K context for the E2B model, and compressed convolutional attention in ZAYA1-8B.
Comments
Loading...