NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities

TL;DR

NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.

April 4, 2026 · 5:50 AM2 min read

Gemma 4 31B IT NVFP4 — Quick Specs

Context window262K tokens

Compare Gemma 4 31B IT NVFP4 with other models →

NVIDIA Quantizes Google DeepMind's Gemma 4 31B for Efficient Inference

NVIDIA has released an NVFP4-quantized version of Google DeepMind's Gemma 4 31B IT model on Hugging Face, designed to run inference on consumer-grade NVIDIA GPUs while maintaining frontier-level performance for reasoning, coding, and multimodal tasks.

Model Specifications

The base Gemma 4 31B IT contains 30.7B parameters and supports a 256K-token context window—enabling extended document processing and multi-turn conversations. The model is multimodal, accepting text, image, and video inputs (up to 60 seconds at 1 FPS). It supports configurable visual token budgets (70, 140, 280, 560, 1120 tokens) and variable image aspect ratios. Vocabulary size is 262,144 tokens.

The model covers over 140 languages and uses a hybrid attention mechanism combining local sliding-window and global attention with Proportional RoPE for long-context stability.

Quantization Impact

NVIDIA's NVFP4 quantization (performed with nvidia-modelopt v0.42.0) shows minimal performance degradation:

GPQA Diamond: 75.71% → 75.46% (−0.25 points)
AIME 2025: 66.25% → 65.94% (−0.31 points)
MMLU Pro: 85.25% → 84.94% (−0.31 points)
LiveCodeBench (pass@1): 70.90% → 70.63% (−0.27 points)
Scicode (pass@1): 33.61% → 33.18% (−0.43 points)
Terminal-Bench Hard: 27.08% → 27.08% (no change)

The quantized model reduces memory requirements and enables deployment on NVIDIA Hopper architecture (H100) and newer Blackwell systems via vLLM.

Training Data and Licensing

The underlying Gemma 4 model was trained on large-scale multimodal data (text, code, images, audio) with a knowledge cutoff of January 2025. Google DeepMind applied CSAM filtering and safety processing. NVIDIA calibrated the quantized version using the CNN DailyMail dataset (300K+ articles).

The model is available under Apache License 2.0 with NVIDIA's Open Model License Agreement governing usage. It is cleared for both commercial and non-commercial use, though NVIDIA notes the model may amplify biases and toxicity from its training data.

Known Limitations

Gemma 4 31B IT can generate inaccurate information, omit key details, and produce toxic responses—particularly when prompted with adversarial inputs. The model does not blur or maintain aspect ratios of people, personal health information, or copyrighted content in images.

NVIDIA recommends developers validate the model against internal requirements for specific industries and use cases before deployment.

Deployment and Integration

The quantized model is optimized for vLLM inference engine with recommended tensor parallelism of 8 on H100 hardware. NVIDIA reports 6,292 downloads on Hugging Face in the past month, though no commercial pricing has been announced for hosted inference.

What This Means

This release makes frontier-class multimodal reasoning accessible on consumer and datacenter GPUs without API dependencies. The minimal performance loss (<0.5% on most benchmarks) validates NVFP4 quantization for production deployments. However, the lack of hosted inference offerings means organizations must self-host—requiring GPU infrastructure and operational overhead. The 256K context and multimodal capabilities position Gemma 4 31B as a direct competitor to larger proprietary models for organizations prioritizing deployment flexibility and cost control.

Source: huggingface.co ↗

gemma nvidia quantization nvfp4 multimodal open-source inference google-deepmind

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model releaseJuly 1, 2026

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

model releaseJune 29, 2026

DeepReinforce Releases Ornith-1.0, Open-Source Agentic Coding Model in 9B to 397B Sizes

DeepReinforce has released Ornith-1.0, an MIT-licensed model designed for agentic coding tasks with variants ranging from 9B to 397B parameters. Built on top of Apache 2.0-licensed Gemma 4 and Qwen 3.5 base models, the company claims it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.