research

New framework improves VLM spatial reasoning through minimal information selection

A new research paper introduces MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent framework that improves Vision-Language Models' ability to reason about 3D spatial relationships. The method addresses two key bottlenecks: inadequate 3D understanding from 2D-centric training and reasoning failures from redundant information.

2 min read

Vision-Language Models Struggle With 3D Spatial Reasoning

Vision-Language Models continue to struggle with grounding language in 3D spatial understanding, according to research published on arXiv (2510.16688). Researchers identify two fundamental problems: VLMs' 3D understanding capabilities remain limited due to 2D-centric pre-training approaches, and models fail at reasoning tasks when given redundant or excessive 3D information.

Dual-Agent Framework Pursues Information Minimality

The researchers introduce MSSR (Minimal Sufficient Spatial Reasoner), a dual-agent system that constructs a "Minimal Sufficient Set" (MSS) of information before answering spatial reasoning questions. The framework operates through two agents working in tandem:

Perception Agent: Programmatically queries 3D scenes using a perception toolbox to extract sufficient information for answering questions. This agent includes a novel SOG (Situated Orientation Grounding) module designed to robustly extract language-grounded directional information from 3D scenes.

Reasoning Agent: Iteratively refines the extracted information through a closed-loop process, pruning redundant details and requesting missing information until the MSS is curated.

The framework prioritizes both sufficiency (having enough information to answer correctly) and minimality (avoiding excessive or redundant details that degrade reasoning performance).

State-of-the-Art Results Across Benchmarks

Extensive experiments demonstrate that MSSR achieves state-of-the-art performance across two challenging spatial reasoning benchmarks. The method significantly improves accuracy compared to baseline approaches, with the dual-agent architecture enabling explicit optimization for information minimality.

Beyond performance metrics, the framework produces interpretable reasoning paths that make its decision-making transparent. Researchers note this interpretability offers a valuable source of high-quality training data for developing future models.

Source code is available on GitHub at https://github.com/gyj155/mssr.

What This Means

The research reveals a fundamental insight: VLMs perform better on spatial reasoning when given carefully curated information rather than all available data. This challenges the assumption that more perceptual information automatically improves reasoning. The interpretable reasoning paths MSSR generates could accelerate development of spatial understanding in next-generation models by providing supervised training examples. The approach may influence how researchers design perception systems for multimodal AI going forward.

MSSR: Minimal Sufficient Spatial Reasoning for Vision-Language Models | TPS