OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning
Researchers have introduced OmniVideoBench, a large-scale evaluation framework comprising 1,000 manually verified question-answer pairs derived from 628 videos (ranging from seconds to 30 minutes) designed to measure synergistic audio-visual reasoning in multimodal large language models. Testing reveals a significant performance gap between open-source and closed-source MLLMs on genuine cross-modal reasoning tasks.
OmniVideoBench: New 1,000-question benchmark exposes gaps in audio-visual AI reasoning
A new benchmark called OmniVideoBench has been released to address critical gaps in how multimodal AI models are evaluated on audio-visual understanding tasks.
The benchmark comprises 1,000 manually verified question-answer pairs derived from 628 diverse videos, with durations ranging from several seconds to 30 minutes. Each QA pair includes step-by-step reasoning traces and covers 13 distinct question types: temporal reasoning, spatial localization, counting, causal inference, summarization, and others.
Key findings
Evaluation of multiple MLLMs on OmniVideoBench reveals substantial performance gaps. Open-source models significantly underperform compared to closed-source counterparts, indicating that genuine audio-visual reasoning—requiring synergistic understanding across both modalities—remains a challenging frontier for current models.
The benchmark emphasizes modality complementarity and logical consistency, addressing a core limitation in existing video understanding benchmarks. Researchers note that previous benchmarks often neglect one modality or integrate them in logically inconsistent ways, failing to assess true cross-modal reasoning.
Design methodology
Each video in OmniVideoBench was selected to require genuine fusion of audio and visual information for correct answering. All 1,000 QA pairs underwent manual verification to ensure correctness and uniqueness, distinguishing this benchmark from those relying on automated annotation pipelines.
The benchmark's scope spans practical video understanding challenges: identifying temporal sequences dependent on audio cues, localizing objects referenced in speech, counting entities with audio-visual context, inferring causal relationships across modalities, and generating comprehensive video summaries.
Implications
The pronounced gap between model performance and human reasoning on OmniVideoBench suggests that current MLLMs struggle with true audio-visual integration. This gap is particularly stark for open-source models, highlighting a capability divide in the field.
The researchers plan to release OmniVideoBench publicly to accelerate development of MLLMs with stronger, more generalizable reasoning across modalities.
What this means
OmniVideoBench provides a much-needed evaluation framework for a critical gap in MLLM assessment. Most existing video benchmarks rely primarily on visual information, treating audio as supplementary. This benchmark's emphasis on genuine modality complementarity—where both audio and visual information are necessary for correct answers—creates a higher standard for video understanding evaluation. The significant performance gaps documented here suggest that even leading closed-source models haven't solved audio-visual reasoning, indicating substantial room for improvement across the field.