Google DeepMind releases Gemma 4, open multimodal models with 256K context and reasoning
Google DeepMind has released Gemma 4, a family of open-weights multimodal models ranging from 2.3B to 31B parameters with support for text, images, video, and audio. The models feature context windows up to 256K tokens, built-in reasoning modes, and native function calling for agentic workflows.
Gemma 4 31B Instruct — Quick Specs
Google DeepMind Releases Gemma 4: Open Multimodal Models with Extended Context
Google DeepMind has released Gemma 4, a family of open-weights models spanning from 2.3B to 31B parameters with multimodal capabilities and extended context windows up to 256K tokens. The release includes both dense and Mixture-of-Experts (MoE) architectures designed for deployment across devices from mobile phones to data center servers.
Model Sizes and Specifications
Gemma 4 offers four distinct variants:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, text/image/audio support
- E4B: 4.5B effective parameters (8B with embeddings), 128K context, text/image/audio support
- 26B A4B (MoE): 25.2B total parameters with 3.8B active parameters, 256K context, text/image support
- 31B Dense: 30.7B parameters, 256K context, text/image support
The smaller E2B and E4B models use Per-Layer Embeddings (PLE) technology to reduce effective parameter counts, enabling efficient deployment on edge devices. The 26B A4B variant uses a Mixture-of-Experts approach with 128 total experts and 8 active experts, claiming inference speeds comparable to a 4B model while maintaining 26B total capacity.
Capabilities and Architecture
All Gemma 4 models support text and image inputs with variable aspect ratios and resolutions. The E2B and E4B models additionally include native audio support with automatic speech recognition and multilingual speech-to-translation capabilities. Video understanding is available through frame sequence processing.
Key features include:
- Reasoning: Configurable thinking modes enabling step-by-step reasoning before response generation
- Function Calling: Native support for structured tool use and agentic workflows
- Hybrid Attention: Combines local sliding window attention with full global attention, with Proportional RoPE optimization for memory efficiency
- Multilingual: Pre-trained on 140+ languages with out-of-the-box support for 35+
- Native System Prompt Support: Structured conversation control
Benchmark Performance
The instruction-tuned models show significant improvements in reasoning and coding tasks:
Gemma 4 31B achieves:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- MATH-Vision: 85.6%
- Long Context MRCR v2 (128K needle): 66.4%
Gemma 4 26B A4B demonstrates strong performance-to-efficiency trade-offs:
- MMLU Pro: 82.6%
- AIME 2026 (no tools): 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
Smaller models show corresponding improvements over Gemma 3 27B, with E2B scoring 60.0% on MMLU Pro compared to Gemma 3's 67.6% baseline.
Release Details
The models are released under Apache 2.0 licensing as both pre-trained and instruction-tuned variants. Unsloth has released GGUF quantized versions optimized for local inference. The models are available through Hugging Face with support for the latest Transformers library.
Google DeepMind emphasizes on-device deployment viability for the smaller models while positioning larger variants for consumer GPU and server deployment. The hybrid architecture and context window scaling address trade-offs between inference speed and reasoning depth for long-context tasks.
What this means
Gemma 4 represents a significant shift toward production-ready open models with genuine multimodal capabilities and reasoning support at multiple scale points. The MoE variant offers a novel efficiency approach for teams balancing model capacity with inference latency constraints. Notably absent from the release are specific pricing details for cloud inference—unlike proprietary alternatives—since these are open-weights models suitable for self-hosted deployment. The 256K context window and strong long-context benchmark performance position these models competitively for document analysis and extended reasoning tasks against closed commercial alternatives.
Related Articles
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance
Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese
Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.
Comments
Loading...