model release

IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding

TL;DR

IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.

March 31, 2026 · 3:20 PM2 min read

Granite 4.0 3B Vision — Quick Specs

Compare Granite 4.0 3B Vision with other models →

IBM Releases Granite 4.0 3B Vision for Enterprise Document Processing

IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model purpose-built for extracting information from complex business documents. The model targets three specific capabilities: table extraction from multi-page PDFs, chart understanding and conversion to structured formats, and semantic key-value pair extraction from forms.

Model Architecture and Design

Granite 4.0 3B Vision ships as a LoRA adapter on top of Granite 4.0 Micro rather than as a standalone model. This modular approach allows the same deployment to serve both multimodal and text-only workloads, automatically falling back to the base language model when vision processing isn't required.

The model uses a novel "DeepStack Injection" architecture that routes abstract visual features into earlier transformer layers for semantic understanding, while high-resolution spatial features feed into later layers to preserve fine-grained detail. IBM claims this design is critical for layout-sensitive tasks like table extraction and form field location.

Performance Benchmarks

Chart Understanding: On the human-verified ChartNet benchmark using LLM-as-a-judge evaluation, Granite 4.0 3B Vision achieves 86.4% on Chart2Summary tasks—the highest score among all evaluated models including significantly larger competitors. On Chart2CSV conversion, it scores 62.1%, ranking second behind Qwen3.5-9B (63.4%), which has more than double the parameters.

Table Extraction: Across three industry benchmarks measured by TEDS (a metric capturing both structural and content accuracy), the model leads on:

PubTablesV2 cropped: 92.1
PubTablesV2 full-page: 79.3
OmniDocBench: 64.0
TableVQA: 88.1

Form Field Extraction: On VAREX, a benchmark of 1,777 U.S. government forms with complex nested and tabular structures, Granite 4.0 3B Vision achieves 85.5% exact match accuracy in zero-shot evaluation.

ChartNet Dataset

IBM developed ChartNet, described in an upcoming CVPR 2026 paper, containing 1.7 million synthetic chart samples spanning 24 chart types across 6 plotting libraries. Each sample includes aligned components: plotting code, rendered image, data table, natural language summary, and QA pairs. The dataset also includes human-annotated and real-world subsets filtered for visual fidelity and semantic accuracy.

Deployment Options

The model supports two integration patterns:

Standalone: Direct processing of individual images for targeted extraction in existing workflows without upstream modifications.
Integrated Pipeline: Seamless integration with Docling (IBM's document processing tool) for end-to-end multi-page PDF processing with automated detection, segmentation, and visual element cropping.

Use cases include invoice and form processing, financial report analysis, and automated document classification.

What This Means

Granite 4.0 3B Vision represents a shift toward specialized, compact models for document understanding rather than general-purpose scaling. At 3 billion parameters, it's significantly smaller than competing models while achieving comparable or superior performance on document-specific tasks. The modular LoRA design addresses a practical enterprise requirement: support for both vision and text-only workloads within a single deployment. IBM's emphasis on document layout understanding and the ChartNet dataset suggest the model is tuned for precision in spatial reasoning—a weakness of many larger VLMs. Pricing and general availability details were not disclosed.

Source: huggingface.co ↗

IBM vision-language-model document-processing table-extraction chart-understanding form-processing enterprise-AI multimodal

model releaseMay 6, 2026

IBM Releases Granite Embedding 311M R2 With 32K Context, 200+ Language Support

IBM released Granite Embedding 311M Multilingual R2, a 311-million parameter dense embedding model with 32,768-token context length and support for 200+ languages. The model scores 64.0 on Multilingual MTEB Retrieval (18 tasks), an 11.8-point improvement over its predecessor, and ships with ONNX and OpenVINO models for production deployment.

model releaseMay 14, 2026

IBM Releases 97M-Parameter Granite Embedding Model With 60.3 MTEB Score — Highest Retrieval Quality Under 100M Parameter

IBM released two new multilingual embedding models under Apache 2.0: a 97M-parameter compact model scoring 60.3 on MTEB Multilingual Retrieval (highest in its size class) and a 311M full-size model scoring 65.2. Both support 200+ languages with enhanced retrieval for 52 languages, handle 32K-token context (64x increase over predecessors), and include code retrieval across 9 programming languages.

model releaseMay 14, 2026

Baidu Releases Qianfan-OCR-Fast Model with 66K Context at $0.68 Per 1M Input Tokens

Baidu has released Qianfan-OCR-Fast, a multimodal model specialized for optical character recognition tasks. The model offers a 66,000 token context window and is priced at $0.68 per 1M input tokens and $2.81 per 1M output tokens.

model releaseMay 12, 2026

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.