model releaseGoogle DeepMind

NVIDIA releases Gemma 4 31B quantized model with 256K context, multimodal capabilities

TL;DR

NVIDIA has released a quantized version of Google DeepMind's Gemma 4 31B IT model, compressed to NVFP4 format for efficient inference on consumer GPUs. The 30.7B-parameter multimodal model supports 256K token context windows, handles text and image inputs with video frame processing, and maintains near-baseline performance across reasoning and coding benchmarks.

2 min read
0

NVIDIA Quantizes Google DeepMind's Gemma 4 31B for Efficient Inference

NVIDIA has released an NVFP4-quantized version of Google DeepMind's Gemma 4 31B IT model on Hugging Face, designed to run inference on consumer-grade NVIDIA GPUs while maintaining frontier-level performance for reasoning, coding, and multimodal tasks.

Model Specifications

The base Gemma 4 31B IT contains 30.7B parameters and supports a 256K-token context window—enabling extended document processing and multi-turn conversations. The model is multimodal, accepting text, image, and video inputs (up to 60 seconds at 1 FPS). It supports configurable visual token budgets (70, 140, 280, 560, 1120 tokens) and variable image aspect ratios. Vocabulary size is 262,144 tokens.

The model covers over 140 languages and uses a hybrid attention mechanism combining local sliding-window and global attention with Proportional RoPE for long-context stability.

Quantization Impact

NVIDIA's NVFP4 quantization (performed with nvidia-modelopt v0.42.0) shows minimal performance degradation:

  • GPQA Diamond: 75.71% → 75.46% (−0.25 points)
  • AIME 2025: 66.25% → 65.94% (−0.31 points)
  • MMLU Pro: 85.25% → 84.94% (−0.31 points)
  • LiveCodeBench (pass@1): 70.90% → 70.63% (−0.27 points)
  • Scicode (pass@1): 33.61% → 33.18% (−0.43 points)
  • Terminal-Bench Hard: 27.08% → 27.08% (no change)

The quantized model reduces memory requirements and enables deployment on NVIDIA Hopper architecture (H100) and newer Blackwell systems via vLLM.

Training Data and Licensing

The underlying Gemma 4 model was trained on large-scale multimodal data (text, code, images, audio) with a knowledge cutoff of January 2025. Google DeepMind applied CSAM filtering and safety processing. NVIDIA calibrated the quantized version using the CNN DailyMail dataset (300K+ articles).

The model is available under Apache License 2.0 with NVIDIA's Open Model License Agreement governing usage. It is cleared for both commercial and non-commercial use, though NVIDIA notes the model may amplify biases and toxicity from its training data.

Known Limitations

Gemma 4 31B IT can generate inaccurate information, omit key details, and produce toxic responses—particularly when prompted with adversarial inputs. The model does not blur or maintain aspect ratios of people, personal health information, or copyrighted content in images.

NVIDIA recommends developers validate the model against internal requirements for specific industries and use cases before deployment.

Deployment and Integration

The quantized model is optimized for vLLM inference engine with recommended tensor parallelism of 8 on H100 hardware. NVIDIA reports 6,292 downloads on Hugging Face in the past month, though no commercial pricing has been announced for hosted inference.

What This Means

This release makes frontier-class multimodal reasoning accessible on consumer and datacenter GPUs without API dependencies. The minimal performance loss (<0.5% on most benchmarks) validates NVFP4 quantization for production deployments. However, the lack of hosted inference offerings means organizations must self-host—requiring GPU infrastructure and operational overhead. The 256K context and multimodal capabilities position Gemma 4 31B as a direct competitor to larger proprietary models for organizations prioritizing deployment flexibility and cost control.

Related Articles

model release

Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window

Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.

model release

Google releases Gemma 4 31B with 256K context and configurable reasoning mode

Google DeepMind has released Gemma 4 31B, a 30.7-billion-parameter multimodal model supporting text and image input. The model features a 262,144-token context window, configurable thinking/reasoning mode, native function calling, and multilingual support across 140+ languages under Apache 2.0 license.

model release

Google DeepMind releases Gemma 4 with four models up to 31B parameters, 256K context window

Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (E2B, E4B, 26B A4B, 31B) with context windows up to 256K tokens and native reasoning capabilities. The 26B A4B variant uses Mixture-of-Experts architecture with 3.8B active parameters for efficient inference. All models support text, image input and handle 140+ languages with Apache 2.0 licensing.

model release

Google DeepMind releases Gemma 4, open multimodal models with 256K context and reasoning

Google DeepMind has released Gemma 4, a family of open-weights multimodal models ranging from 2.3B to 31B parameters with support for text, images, video, and audio. The models feature context windows up to 256K tokens, built-in reasoning modes, and native function calling for agentic workflows.

Comments

Loading...