Alibaba's HopChain framework fixes vision model failures in multi-step reasoning tasks
Researchers from Alibaba's Qwen team and Tsinghua University developed HopChain, a framework that automatically generates multi-step image questions to fix how vision-language models fail during complex reasoning tasks. The method improved 20 out of 24 tested benchmarks by forcing models to re-examine images at each reasoning step, preventing early perceptual errors from cascading through subsequent steps.
Alibaba's HopChain Framework Fixes Vision Model Failures in Multi-Step Reasoning
Alibaba's Qwen team and Tsinghua University researchers identified and addressed a fundamental weakness in vision-language models (VLMs): their inability to maintain accuracy across multiple consecutive reasoning steps about images.
The Problem: Cascading Errors
VLMs consistently fail on tasks requiring extended chain-of-thought reasoning about images. A single perceptual error early in the reasoning chain—miscounting objects, confusing spatial relationships, misreading text, or hallucinating details—cascades through all subsequent steps, producing entirely incorrect final answers.
In documented examples:
- A model miscounted dots on ladybugs by one dot each across five beetles, compounding into a significantly wrong total
- A model correctly identified a car's position but misread its movement direction in a parking sequence
- A model pointed to the wrong arc in an astronomical diagram, leading to an incorrect season identification
Existing training data for Reinforcement Learning with Verifiable Rewards (RLVR) rarely includes tasks demanding sustained visual attention across multiple steps, leaving this vulnerability unaddressed.
HopChain's Four-Stage Approach
The framework automatically generates multi-step image questions where each step forces models to re-examine the image closely. The data generation pipeline operates in four stages:
- Object identification: Qwen3-VL-235B-A22B-Thinking identifies object categories in images
- Instance localization: Meta's SAM3 segmentation model locates individual object instances
- Question generation: The language model builds multi-level questions around combinations of three to six objects
- Human verification: Four independent human annotators solve each question; only questions with unanimous agreement advance to training
This process generates 60,000 to 80,000 training examples per model. Each question ends with a unique number serving as an automatic answer verification mechanism, with some chains involving up to six linked reasoning steps through arithmetic, counting, text recognition, and spatial reasoning.
Benchmark Results
Researchers trained two models using HopChain:
- Qwen3.5-35B-A3B (smaller model)
- Qwen3.5-397B-A17B (larger model)
HopChain improved 20 out of 24 tested benchmarks across four categories: STEM and puzzles, general image comprehension, text recognition/document comprehension, and video comprehension.
Specific gains on the smaller model:
- EMMA: 53 → 58
- CharXiv: 69 → 73.1
Gains on the larger model:
- BabyVision: 28.61 → 32.22
- ZeroBench: 4 → 8
For particularly long reasoning chains, the larger model showed accuracy improvements exceeding 50 percentage points. Notably, despite training exclusively on images, both models improved on five out of six video benchmarks, suggesting the skills transfer beyond static images.
Ablation Study: Full Chains Required
An ablation study across five representative benchmarks demonstrated that complete question chains are essential:
- Full chains: 70.4 average score
- Halved chains: 66.7 average score
- Single-step questions: 64.3 average score
The error breakdown shows HopChain addresses all error categories proportionally—perception, logic, knowledge, and hallucination errors all improved comparably, with the error distribution of fixed issues tracking the original error profile closely.
Known Limitations
The pipeline requires SAM3 to recognize and segment objects, excluding images without clear segmentable objects from data generation. This limitation reflects a broader weakness: recent research shows even frontier models struggle with visual perception. Moonshot AI's WorldVQA benchmark found top-scoring models correctly identified less than 50% of depicted objects, while systematically overestimating accuracy. A Stanford analysis found frontier models achieve 70-80% of their image benchmark scores without seeing images at all, confidently hallucinating visual details.
What This Means
HopChain demonstrates that vision model reasoning failures aren't inherent but rather stem from inadequate training data targeting multi-step visual reasoning. The framework's broad improvements across unrelated benchmarks—without task-specific optimization—suggest the approach addresses fundamental visual understanding gaps. However, the work also surfaces deeper vulnerabilities: models' inability to reliably segment and recognize objects, and their tendency to hallucinate visual details while appearing confident. These limitations suggest current VLMs require more fundamental improvements in visual perception before complex reasoning becomes truly reliable.
Related Articles
Alibaba's Qwen team develops algorithm that doubles reasoning chain length in math problems
Alibaba's Qwen team has developed Future-KL Influenced Policy Optimization (FIPO), a training algorithm that assigns different weights to tokens based on their influence on subsequent reasoning steps, rather than treating all tokens equally. Testing on Qwen2.5-32B-Base showed reasoning chains double from ~4,000 to 10,000+ tokens, with AIME 2024 accuracy improving from 50% to 58%, outperforming Deepseek-R1-Zero-Math-32B (47%) and OpenAI's o1-mini (56%). The team plans to open-source the system.
Apple's RubiCap model generates better image captions with 3-7B parameters than 72B competitors
Apple researchers developed RubiCap, a framework for training dense image captioning models that achieve state-of-the-art results at 2B, 3B, and 7B parameter scales. The 7B model outperforms models up to 72 billion parameters on multiple benchmarks including CapArena and CaptionQA, while the 3B variant matches larger 32B models, suggesting efficient dense captioning doesn't require massive scale.
AI offensive cyber capabilities doubling every 5.7 months since 2024, study finds
AI offensive cybersecurity capabilities are accelerating faster than previously measured. Lyptus Research's new study finds the doubling time has compressed from 9.8 months (since 2019) to 5.7 months (since 2024), with GPT-5.3 Codex and Opus 4.6 now solving tasks at 50% success rates that would take human security experts three hours.
Google study: AI benchmarks need 10+ human raters per example, not standard 3-5
A Google Research and Rochester Institute of Technology study reveals that standard AI benchmarking practices using three to five human evaluators per test example systematically underestimate human disagreement and produce unreliable model comparisons. The researchers found that at least ten raters per example are needed for statistically reliable results, and that budget allocation between test examples and raters matters as much as total budget size.
Comments
Loading...