IBM releases Granite 4.0 3B Vision, compact multimodal model for enterprise document understanding
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model designed for enterprise document processing. The model achieves 86.4% on Chart2Summary and 92.1% TEDS score on cropped table extraction, shipped as a LoRA adapter on Granite 4.0 Micro to enable modular text-only fallbacks.
IBM Releases Granite 4.0 3B Vision for Enterprise Document Processing
IBM announced Granite 4.0 3B Vision, a 3 billion parameter vision-language model purpose-built for extracting information from complex business documents. The model targets three specific capabilities: table extraction from multi-page PDFs, chart understanding and conversion to structured formats, and semantic key-value pair extraction from forms.
Model Architecture and Design
Granite 4.0 3B Vision ships as a LoRA adapter on top of Granite 4.0 Micro rather than as a standalone model. This modular approach allows the same deployment to serve both multimodal and text-only workloads, automatically falling back to the base language model when vision processing isn't required.
The model uses a novel "DeepStack Injection" architecture that routes abstract visual features into earlier transformer layers for semantic understanding, while high-resolution spatial features feed into later layers to preserve fine-grained detail. IBM claims this design is critical for layout-sensitive tasks like table extraction and form field location.
Performance Benchmarks
Chart Understanding: On the human-verified ChartNet benchmark using LLM-as-a-judge evaluation, Granite 4.0 3B Vision achieves 86.4% on Chart2Summary tasks—the highest score among all evaluated models including significantly larger competitors. On Chart2CSV conversion, it scores 62.1%, ranking second behind Qwen3.5-9B (63.4%), which has more than double the parameters.
Table Extraction: Across three industry benchmarks measured by TEDS (a metric capturing both structural and content accuracy), the model leads on:
- PubTablesV2 cropped: 92.1
- PubTablesV2 full-page: 79.3
- OmniDocBench: 64.0
- TableVQA: 88.1
Form Field Extraction: On VAREX, a benchmark of 1,777 U.S. government forms with complex nested and tabular structures, Granite 4.0 3B Vision achieves 85.5% exact match accuracy in zero-shot evaluation.
ChartNet Dataset
IBM developed ChartNet, described in an upcoming CVPR 2026 paper, containing 1.7 million synthetic chart samples spanning 24 chart types across 6 plotting libraries. Each sample includes aligned components: plotting code, rendered image, data table, natural language summary, and QA pairs. The dataset also includes human-annotated and real-world subsets filtered for visual fidelity and semantic accuracy.
Deployment Options
The model supports two integration patterns:
-
Standalone: Direct processing of individual images for targeted extraction in existing workflows without upstream modifications.
-
Integrated Pipeline: Seamless integration with Docling (IBM's document processing tool) for end-to-end multi-page PDF processing with automated detection, segmentation, and visual element cropping.
Use cases include invoice and form processing, financial report analysis, and automated document classification.
What This Means
Granite 4.0 3B Vision represents a shift toward specialized, compact models for document understanding rather than general-purpose scaling. At 3 billion parameters, it's significantly smaller than competing models while achieving comparable or superior performance on document-specific tasks. The modular LoRA design addresses a practical enterprise requirement: support for both vision and text-only workloads within a single deployment. IBM's emphasis on document layout understanding and the ChartNet dataset suggest the model is tuned for precision in spatial reasoning—a weakness of many larger VLMs. Pricing and general availability details were not disclosed.
Related Articles
Google releases Gemini 3.1 Flash Live, its highest-quality audio model for real-time voice AI
Google has released Gemini 3.1 Flash Live, its highest-quality audio model designed for natural and reliable real-time voice interactions. The model scores 90.8% on ComplexFuncBench Audio and 36.1% on Scale AI's Audio MultiChallenge with thinking enabled. It's now available to developers via the Gemini Live API, enterprises through Gemini Enterprise for Customer Experience, and consumers in Search Live and Gemini Live across 200+ countries.
Alibaba's Qwen3.5-Omni learns to write code from speech and video without explicit training
Alibaba has released Qwen3.5-Omni, an omnimodal model handling text, images, audio, and video with a 256,000-token context window. The model reportedly outperforms Google's Gemini 3.1 Pro on audio tasks with support for 74 languages in speech recognition, a 6x increase from its predecessor. An unexpected emergent capability: writing working code from spoken instructions and video input, which the team did not explicitly train.
Alibaba releases Qwen 3.6 Plus Preview with 1M token context, free via OpenRouter
Alibaba's Qwen division has released Qwen 3.6 Plus Preview, a free multimodal model available via OpenRouter with a 1,000,000 token context window. The model claims stronger reasoning and more reliable agentic behavior compared to the 3.5 series, with particular strength in coding and complex problem-solving tasks.
Mistral releases Voxtral, open-weight TTS model that clones voices from 3 seconds of audio
Mistral has released Voxtral TTS, a 4-billion-parameter text-to-speech model that can clone voices from just three seconds of reference audio across nine languages. The model delivers 70ms latency for typical 10-second samples and outperformed ElevenLabs Flash v2.5 in naturalness tests. Voxtral is available via API at $0.016 per 1,000 characters and as open-weights on Hugging Face.
Comments
Loading...