IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.
IBM Releases Granite 4.0 3B Vision for Enterprise Document Processing
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model purpose-built for extracting information from complex business documents. The model targets three specific capabilities: table extraction from multi-page PDFs, chart understanding and conversion to structured formats, and semantic key-value pair extraction from forms.
Model Architecture and Design
Granite 4.0 3B Vision ships as a LoRA adapter on top of Granite 4.0 Micro rather than as a standalone model. This modular approach allows the same deployment to serve both multimodal and text-only workloads, automatically falling back to the base language model when vision processing isn't required.
The model uses a novel "DeepStack Injection" architecture that routes abstract visual features into earlier transformer layers for semantic understanding, while high-resolution spatial features feed into later layers to preserve fine-grained detail. IBM claims this design is critical for layout-sensitive tasks like table extraction and form field location.
Performance Benchmarks
Chart Understanding: On the human-verified ChartNet benchmark using LLM-as-a-judge evaluation, Granite 4.0 3B Vision achieves 86.4% on Chart2Summary tasks—the highest score among all evaluated models including significantly larger competitors. On Chart2CSV conversion, it scores 62.1%, ranking second behind Qwen3.5-9B (63.4%), which has more than double the parameters.
Table Extraction: Across three industry benchmarks measured by TEDS (a metric capturing both structural and content accuracy), the model leads on:
- PubTablesV2 cropped: 92.1
- PubTablesV2 full-page: 79.3
- OmniDocBench: 64.0
- TableVQA: 88.1
Form Field Extraction: On VAREX, a benchmark of 1,777 U.S. government forms with complex nested and tabular structures, Granite 4.0 3B Vision achieves 85.5% exact match accuracy in zero-shot evaluation.
ChartNet Dataset
IBM developed ChartNet, described in an upcoming CVPR 2026 paper, containing 1.7 million synthetic chart samples spanning 24 chart types across 6 plotting libraries. Each sample includes aligned components: plotting code, rendered image, data table, natural language summary, and QA pairs. The dataset also includes human-annotated and real-world subsets filtered for visual fidelity and semantic accuracy.
Deployment Options
The model supports two integration patterns:
-
Standalone: Direct processing of individual images for targeted extraction in existing workflows without upstream modifications.
-
Integrated Pipeline: Seamless integration with Docling (IBM's document processing tool) for end-to-end multi-page PDF processing with automated detection, segmentation, and visual element cropping.
Use cases include invoice and form processing, financial report analysis, and automated document classification.
What This Means
Granite 4.0 3B Vision represents a shift toward specialized, compact models for document understanding rather than general-purpose scaling. At 3 billion parameters, it's significantly smaller than competing models while achieving comparable or superior performance on document-specific tasks. The modular LoRA design addresses a practical enterprise requirement: support for both vision and text-only workloads within a single deployment. IBM's emphasis on document layout understanding and the ChartNet dataset suggest the model is tuned for precision in spatial reasoning—a weakness of many larger VLMs. Pricing and general availability details were not disclosed.
Related Articles
IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support
IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.
IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter
IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.
Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens
Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Comments
Loading...