benchmarkOpenAI

Video AI models hit reasoning ceiling despite 1000x larger dataset, researchers find

An international research team released the largest video reasoning dataset to date—roughly 1,000 times larger than previous alternatives. Testing reveals that state-of-the-art models including Sora 2 and Veo 3.1 substantially underperform humans on reasoning tasks, suggesting the limitation isn't data scarcity but architectural constraints.

2 min read

Video AI Models Hit Reasoning Ceiling Despite 1000x Larger Dataset

An international research team has released a new video reasoning benchmark—the largest of its kind—that exposes a fundamental limitation in current video AI models: they cannot match human reasoning capabilities, and simply training on more data won't close the gap.

The Dataset and Test Results

The new dataset is approximately 1,000 times larger than existing video reasoning benchmarks. Testing against this expanded dataset shows that even the most advanced models fall significantly short of human performance.

Specific models tested include OpenAI's Sora 2 and Google's Veo 3.1, both among the most capable video AI systems available. According to the researchers, both models demonstrate substantially lower reasoning accuracy compared to humans across tasks including maze navigation, 3D rotation understanding, tile puzzles, object counting, and physical predictions.

Beyond Data Scaling

The research indicates that the performance gap cannot be addressed by scaling training data alone. This represents a critical finding: video reasoning requires architectural or methodological innovations beyond simply feeding larger datasets to existing models.

The tasks tested—spatial reasoning, physical simulation understanding, and logical sequence completion—represent core reasoning abilities that humans perform readily but that current video models struggle with even when exposed to vastly more training examples.

Implications for Video AI Development

This finding shifts the conversation around video AI capabilities. Rather than pursuing data collection as the primary path to improvement, researchers and companies developing video models will need to focus on:

  • Architectural changes that better encode spatial and temporal reasoning
  • New training objectives that explicitly optimize for logical inference
  • Integration of reasoning-specific mechanisms rather than purely scaling transformer-based approaches

The benchmark itself becomes a critical tool for the field, providing a standardized measurement of video reasoning that can guide future development. With data no longer the bottleneck, progress will depend on conceptual breakthroughs in how video models process and reason about spatial and physical information.

What This Means

Video AI is approaching a scaling plateau in reasoning tasks. Companies investing in video generation and understanding—including OpenAI, Google DeepMind, and others—will need to shift from data-centric strategies to algorithmic innovation. The research establishes that 1000x more data alone doesn't solve the problem, meaning the next generation of video models will require fundamentally different approaches to reasoning, not just bigger training runs.

Video AI Reasoning Benchmark: Sora 2, Veo 3.1 Trail Humans | TPS