research

MLLMs can replace OCR for document extraction, large-scale study finds

A large-scale benchmarking study comparing multimodal large language models (MLLMs) against traditional OCR-enhanced pipelines for document information extraction finds that image-only inputs can achieve comparable performance. The research evaluates multiple out-of-the-box MLLMs on business documents and proposes an automated hierarchical error analysis framework using LLMs to diagnose failure modes.

March 5, 2026 · 1:39 AM2 min read

MLLMs Can Match OCR Performance for Document Extraction, Benchmarking Study Shows

A new research paper challenges the assumption that optical character recognition (OCR) preprocessing remains necessary for document information extraction in the era of multimodal large language models.

The study, conducted by researchers across multiple institutions, benchmarks various out-of-the-box MLLMs on business document information extraction tasks and directly compares MLLM-only pipelines against traditional OCR+MLLM approaches.

Key Findings

The central finding is direct: image-only MLLM inputs can achieve comparable performance to OCR-enhanced pipelines when extracting information from business documents. This suggests that as MLLMs improve their visual understanding capabilities, the intermediate OCR step becomes redundant for many extraction tasks.

Beyond the core comparison, the researchers identified that performance can be further improved through careful engineering:

Well-designed extraction schemas
Strategic use of in-context exemplars
Refined task instructions

These refinements can push MLLM performance beyond baseline levels without requiring architectural changes to the models themselves.

Methodology: Automated Error Analysis

To understand where and why each approach fails, the researchers developed an automated hierarchical error analysis framework. Rather than relying on manual inspection of failure cases, this system uses LLMs to systematically diagnose error patterns across different document types and extraction scenarios.

This diagnostic approach reveals not just what models get wrong, but categories of mistakes—allowing researchers to identify whether failures stem from visual understanding limitations, instruction following issues, or other factors.

Practical Implications

The findings have immediate engineering implications. Document extraction pipelines built on MLLMs can potentially be simplified by removing OCR preprocessing steps, reducing computational overhead and potential error accumulation from sequential pipeline failures.

However, the results appear conditional on model capability level. The study evaluates "various out-of-the-box MLLMs," suggesting that performance parity may not hold uniformly across all models or may depend on minimum capability thresholds.

The emphasis on schema design, exemplars, and instructions also highlights that MLLM performance on structured extraction tasks is highly dependent on prompt engineering—a factor that OCR pipelines bypass through more rigid, rule-based approaches.

What This Means

For teams building document processing systems, this research validates MLLM-only architectures as viable alternatives to traditional OCR+MLLM stacks. The practical value comes from reduced complexity and fewer failure points in the extraction pipeline.

However, this is not a wholesale replacement for OCR in all domains. The study focuses on business documents with specific MLLM models at a particular capability level. Legacy systems, specialized document types, or scenarios requiring absolute extraction reliability may still benefit from OCR's deterministic approach.

The broader signal: as MLLMs continue improving visual understanding, the case for maintaining separate OCR preprocessing weakens. Teams should evaluate their specific use cases rather than assuming OCR remains necessary.

Source: arxiv.org ↗

document-extraction multimodal-llms ocr benchmark error-analysis information-extraction business-documents