IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.
IBM Releases Granite 4.0 3B Vision for Enterprise Document Processing
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model purpose-built for extracting information from complex business documents. The model targets three specific capabilities: table extraction from multi-page PDFs, chart understanding and conversion to structured formats, and semantic key-value pair extraction from forms.
Model Architecture and Design
Granite 4.0 3B Vision ships as a LoRA adapter on top of Granite 4.0 Micro rather than as a standalone model. This modular approach allows the same deployment to serve both multimodal and text-only workloads, automatically falling back to the base language model when vision processing isn't required.
The model uses a novel "DeepStack Injection" architecture that routes abstract visual features into earlier transformer layers for semantic understanding, while high-resolution spatial features feed into later layers to preserve fine-grained detail. IBM claims this design is critical for layout-sensitive tasks like table extraction and form field location.
Performance Benchmarks
Chart Understanding: On the human-verified ChartNet benchmark using LLM-as-a-judge evaluation, Granite 4.0 3B Vision achieves 86.4% on Chart2Summary tasks—the highest score among all evaluated models including significantly larger competitors. On Chart2CSV conversion, it scores 62.1%, ranking second behind Qwen3.5-9B (63.4%), which has more than double the parameters.
Table Extraction: Across three industry benchmarks measured by TEDS (a metric capturing both structural and content accuracy), the model leads on:
- PubTablesV2 cropped: 92.1
- PubTablesV2 full-page: 79.3
- OmniDocBench: 64.0
- TableVQA: 88.1
Form Field Extraction: On VAREX, a benchmark of 1,777 U.S. government forms with complex nested and tabular structures, Granite 4.0 3B Vision achieves 85.5% exact match accuracy in zero-shot evaluation.
ChartNet Dataset
IBM developed ChartNet, described in an upcoming CVPR 2026 paper, containing 1.7 million synthetic chart samples spanning 24 chart types across 6 plotting libraries. Each sample includes aligned components: plotting code, rendered image, data table, natural language summary, and QA pairs. The dataset also includes human-annotated and real-world subsets filtered for visual fidelity and semantic accuracy.
Deployment Options
The model supports two integration patterns:
-
Standalone: Direct processing of individual images for targeted extraction in existing workflows without upstream modifications.
-
Integrated Pipeline: Seamless integration with Docling (IBM's document processing tool) for end-to-end multi-page PDF processing with automated detection, segmentation, and visual element cropping.
Use cases include invoice and form processing, financial report analysis, and automated document classification.
What This Means
Granite 4.0 3B Vision represents a shift toward specialized, compact models for document understanding rather than general-purpose scaling. At 3 billion parameters, it's significantly smaller than competing models while achieving comparable or superior performance on document-specific tasks. The modular LoRA design addresses a practical enterprise requirement: support for both vision and text-only workloads within a single deployment. IBM's emphasis on document layout understanding and the ChartNet dataset suggest the model is tuned for precision in spatial reasoning—a weakness of many larger VLMs. Pricing and general availability details were not disclosed.
Related Articles
Trump administration approves Anthropic's Mythos 5 release to 100 companies and federal agencies
The U.S. government approved Anthropic's release of its Mythos 5 model to roughly 100 companies and federal agencies on Friday. The limited distribution marks a controlled rollout requiring government clearance.
DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes
DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Comments
Loading...