research

Researchers develop data synthesis method to improve multimodal AI reasoning on charts and documents

A new research paper proposes COGS (COmposition-Grounded data Synthesis), a framework that decomposes questions into primitive perception and reasoning factors to generate synthetic training data. The method substantially improves multimodal model performance on chart reasoning and document understanding tasks with minimal human annotation.

March 5, 2026 · 5:24 AM2 min read

Researchers Develop Data Synthesis Method to Improve Multimodal AI Reasoning on Charts and Documents

A new arXiv paper (2510.15040) introduces COGS (COmposition-Grounded data Synthesis), a framework designed to equip multimodal large language models (MLLMs) with stronger reasoning capabilities for visual domains where human-annotated data is scarce.

The Problem: Limited Reasoning Data for Visual Domains

Multimodal models demonstrate strong performance across diverse tasks, but struggle with complex reasoning on specialized visual domains like charts, rendered documents, and webpages. While these artificial image domains are abundant in practice, they lack large-scale human-annotated reasoning datasets, creating a bottleneck for improving model capabilities.

How COGS Works

The framework decomposes each seed question into primitive perception and reasoning factors—the basic building blocks of understanding. These factors can then be systematically recombined with new images to generate large collections of synthetic question-answer pairs. Critically, each generated question includes subquestions and intermediate answers, enabling training via reinforcement learning with factor-level process rewards.

The approach is data-efficient: it generates large synthetic datasets from just a small set of seed questions, reducing reliance on costly human annotation.

Experimental Results

Experiments on chart reasoning demonstrated substantial performance improvements on unseen questions. The largest gains appeared on reasoning-heavy and compositional questions—exactly the types that current models struggle with most. A key finding: training with a factor-level mixture of different seed data improved transfer across multiple datasets, suggesting COGS develops generalizable reasoning capabilities rather than overfitting to specific datasets.

The framework extends beyond charts to other visual domains including webpages, indicating broader applicability.

What This Means

COGS addresses a critical limitation in multimodal AI: the scarcity of annotated reasoning data for specialized visual domains. By systematically generating synthetic training data while maintaining compositional structure, the approach could accelerate progress on visual reasoning tasks that are common in real-world applications but underrepresented in existing datasets. The method's emphasis on factor-level process rewards suggests a path toward more compositional, generalizable reasoning abilities rather than memorization. This could be particularly valuable for domains like document understanding and data visualization interpretation where reasoning complexity increases rapidly with question difficulty.

Source: arxiv.org ↗

multimodal visual-reasoning data-synthesis llm-training chart-understanding research