research

Alibaba's HopChain framework fixes vision model failures in multi-step reasoning tasks

TL;DR

Researchers from Alibaba's Qwen team and Tsinghua University developed HopChain, a framework that automatically generates multi-step image questions to fix how vision-language models fail during complex reasoning tasks. The method improved 20 out of 24 tested benchmarks by forcing models to re-examine images at each reasoning step, preventing early perceptual errors from cascading through subsequent steps.

April 6, 2026 · 7:50 AM3 min read

Alibaba's HopChain Framework Fixes Vision Model Failures in Multi-Step Reasoning

Alibaba's Qwen team and Tsinghua University researchers identified and addressed a fundamental weakness in vision-language models (VLMs): their inability to maintain accuracy across multiple consecutive reasoning steps about images.

The Problem: Cascading Errors

VLMs consistently fail on tasks requiring extended chain-of-thought reasoning about images. A single perceptual error early in the reasoning chain—miscounting objects, confusing spatial relationships, misreading text, or hallucinating details—cascades through all subsequent steps, producing entirely incorrect final answers.

In documented examples:

A model miscounted dots on ladybugs by one dot each across five beetles, compounding into a significantly wrong total
A model correctly identified a car's position but misread its movement direction in a parking sequence
A model pointed to the wrong arc in an astronomical diagram, leading to an incorrect season identification

Existing training data for Reinforcement Learning with Verifiable Rewards (RLVR) rarely includes tasks demanding sustained visual attention across multiple steps, leaving this vulnerability unaddressed.

HopChain's Four-Stage Approach

The framework automatically generates multi-step image questions where each step forces models to re-examine the image closely. The data generation pipeline operates in four stages:

Object identification: Qwen3-VL-235B-A22B-Thinking identifies object categories in images
Instance localization: Meta's SAM3 segmentation model locates individual object instances
Question generation: The language model builds multi-level questions around combinations of three to six objects
Human verification: Four independent human annotators solve each question; only questions with unanimous agreement advance to training

This process generates 60,000 to 80,000 training examples per model. Each question ends with a unique number serving as an automatic answer verification mechanism, with some chains involving up to six linked reasoning steps through arithmetic, counting, text recognition, and spatial reasoning.

Benchmark Results

Researchers trained two models using HopChain:

Qwen3.5-35B-A3B (smaller model)
Qwen3.5-397B-A17B (larger model)

HopChain improved 20 out of 24 tested benchmarks across four categories: STEM and puzzles, general image comprehension, text recognition/document comprehension, and video comprehension.

Specific gains on the smaller model:

EMMA: 53 → 58
CharXiv: 69 → 73.1

Gains on the larger model:

BabyVision: 28.61 → 32.22
ZeroBench: 4 → 8

For particularly long reasoning chains, the larger model showed accuracy improvements exceeding 50 percentage points. Notably, despite training exclusively on images, both models improved on five out of six video benchmarks, suggesting the skills transfer beyond static images.

Ablation Study: Full Chains Required

An ablation study across five representative benchmarks demonstrated that complete question chains are essential:

Full chains: 70.4 average score
Halved chains: 66.7 average score
Single-step questions: 64.3 average score

The error breakdown shows HopChain addresses all error categories proportionally—perception, logic, knowledge, and hallucination errors all improved comparably, with the error distribution of fixed issues tracking the original error profile closely.

Known Limitations

The pipeline requires SAM3 to recognize and segment objects, excluding images without clear segmentable objects from data generation. This limitation reflects a broader weakness: recent research shows even frontier models struggle with visual perception. Moonshot AI's WorldVQA benchmark found top-scoring models correctly identified less than 50% of depicted objects, while systematically overestimating accuracy. A Stanford analysis found frontier models achieve 70-80% of their image benchmark scores without seeing images at all, confidently hallucinating visual details.

What This Means

HopChain demonstrates that vision model reasoning failures aren't inherent but rather stem from inadequate training data targeting multi-step visual reasoning. The framework's broad improvements across unrelated benchmarks—without task-specific optimization—suggest the approach addresses fundamental visual understanding gaps. However, the work also surfaces deeper vulnerabilities: models' inability to reliably segment and recognize objects, and their tendency to hallucinate visual details while appearing confident. These limitations suggest current VLMs require more fundamental improvements in visual perception before complex reasoning becomes truly reliable.

Source: the-decoder.com ↗

vision-language-models alibaba-qwen multi-step-reasoning chain-of-thought reinforcement-learning training-methodology benchmark-improvements image-understanding

researchJune 29, 2026

AI2 Releases DiScoFormer: Single Transformer Estimates Density and Score Across Distributions Without Retraining

Allen Institute for AI (AI2) has released DiScoFormer, a transformer model that estimates both the density and score of any distribution from a sample in a single forward pass without retraining. In 100 dimensions, the model reduces score estimation error by 6.5x and density error by 37x compared to classical kernel density estimation.

researchJune 26, 2026

6,000 prompt injection attempts fail against Claude Opus 4.6 in public hacking challenge

A public hacking challenge targeting an AI assistant powered by Claude Opus 4.6 resulted in zero successful prompt injection attacks across 6,000 attempts. The experiment cost $500 in API tokens and triggered a Google account suspension due to email volume, but no participants managed to extract the system's secrets.

researchJune 25, 2026

AI2 Research: Hybrid Models Excel at Content Words, Transformers Better at Token Repetition

Allen Institute for AI researchers conducted token-level analysis comparing their 7B-parameter Olmo 3 transformer and Olmo Hybrid models. The study finds hybrid architectures show a loss gap advantage of 0.04 on content words (nouns, verbs, adjectives) versus 0.02 on function words, while transformers match or exceed hybrids on repeated tokens and closing braces.

researchJune 18, 2026

Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.