Study questions whether OCR is still necessary for document extraction with modern MLLMs
A large-scale benchmarking study finds that modern multimodal large language models (MLLMs) can extract information from business documents nearly as well as traditional OCR+MLLM pipelines. The research introduces an automated error analysis framework and suggests that careful schema design and prompt engineering can further close the performance gap.
New Research Challenges OCR as a Requirement for Document Extraction
A comprehensive benchmarking study questions whether optical character recognition (OCR) remains necessary for document information extraction when using modern multimodal large language models (MLLMs), suggesting image-only approaches can match traditional OCR-enhanced pipelines.
The paper, posted to arXiv (2603.02789), evaluates multiple off-the-shelf MLLMs on business document information extraction tasks. Researchers tested whether a simplified MLLM-only pipeline—which skips the traditional OCR preprocessing step—can achieve comparable performance to the conventional approach of first extracting text via OCR, then feeding both text and image to an MLLM.
Key Findings
The core finding is that powerful MLLMs can achieve comparable performance using image-only input without OCR preprocessing. This represents a potential simplification of document extraction workflows, as it eliminates a separate OCR step while maintaining effectiveness.
The researchers also found that performance can be further enhanced through deliberate design choices:
- Carefully structured extraction schemas
- Well-chosen few-shot examples (exemplars)
- Optimized prompting and instructions
Methodology: Automated Error Analysis
Beyond the main benchmark, the authors propose an automated hierarchical error analysis framework that uses large language models to systematically diagnose failure patterns. This approach moves beyond simple accuracy metrics to identify why extraction fails—differentiating between issues like text recognition, schema understanding, or instruction comprehension.
This framework could provide practical guidance for improving document extraction systems, as it pinpoints specific failure modes rather than aggregating results into a single performance number.
Implications for Document Processing
If confirmed at scale, these findings suggest that organizations could simplify document processing pipelines by eliminating OCR as a separate preprocessing step. This could reduce latency, complexity, and costs associated with maintaining separate OCR systems.
However, the research evaluates "out-of-the-box" MLLMs, which likely have varying capabilities. The performance gap between image-only and OCR+MLLM approaches may depend significantly on the specific MLLM used, document types tested, and extraction complexity.
What This Means
The distinction between OCR-dependent and image-native approaches matters less as MLLM capabilities mature. For organizations building document extraction systems, this research suggests exploring direct image processing with strong prompt engineering before defaulting to traditional OCR pipelines. The automated error analysis framework offers a practical tool for diagnosing where MLLM-based extraction fails, enabling targeted improvements. That said, OCR may still provide value for specific document types, languages, or quality requirements not covered in this benchmarking study.