Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window
Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.
Gemma 4 E4B Instruction-Tuned — Quick Specs
Google DeepMind Releases Gemma 4: Open-Source Multimodal Models with Extended Context
Google DeepMind released the Gemma 4 family of open-source models today, introducing multimodal capabilities and significantly expanded context windows. The family includes four distinct model sizes, ranging from 2.3B to 31B parameters, all available under the Apache 2.0 license.
Model Specifications and Architectures
Gemma 4 employs both dense and Mixture-of-Experts (MoE) architectures:
Dense Models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window
- 31B: 30.7B parameters, 256K context window, 60 layers
MoE Model:
- 26B A4B: 25.2B total parameters with 3.8B active parameters, 256K context window, 8 active experts from 128 total
The "E" in E2B/E4B denotes "effective parameters"—the models use Per-Layer Embeddings (PLE) to maximize efficiency on-device without increasing layer or parameter counts. The "A" in 26B A4B indicates active parameters, allowing this model to match inference speed of a 4B model while maintaining 26B total capacity.
Multimodal Capabilities and Modalities
All four models process text and images with variable aspect ratios and resolutions. E2B and E4B additionally support:
- Audio: Native automatic speech recognition (ASR) and speech-to-translated-text across multiple languages
- Video: Frame sequence processing for video understanding
All models support interleaved multimodal input, allowing text and images to be freely mixed within prompts.
Benchmark Performance
Gemma 4 shows substantial improvements over Gemma 3 27B (no thinking mode):
| Benchmark | Gemma 4 31B | Gemma 4 26B A4B | Gemma 4 E4B | Gemma 3 27B |
|---|---|---|---|---|
| MMLU Pro | 85.2% | 82.6% | 69.4% | 67.6% |
| AIME 2026 | 89.2% | 88.3% | 42.5% | 20.8% |
| LiveCodeBench v6 | 80.0% | 77.1% | 52.0% | 29.1% |
| Codeforces ELO | 2150 | 1718 | 940 | 110 |
| GPQA Diamond | 84.3% | 82.3% | 58.6% | 42.4% |
| MMMLU | 88.4% | 86.3% | 76.6% | 70.7% |
| Vision MMMU Pro | 76.9% | 73.8% | 52.6% | 49.7% |
| MATH-Vision | 85.6% | 82.4% | 59.5% | 46.0% |
The E4B model demonstrates the most significant coding improvements, with a Codeforces ELO of 940 compared to Gemma 3's 110, and LiveCodeBench performance of 52.0% versus 29.1%.
Core Capabilities
All models feature:
- Reasoning/Thinking mode: Configurable step-by-step reasoning before generating answers
- Function calling: Native support for structured tool use and agentic workflows
- System prompt support: Native system role handling for structured conversations
- Multilingual: Pre-trained on 140+ languages with 35+ language support
- Code generation: Full code completion, generation, and correction capabilities
Architecture and Efficiency
All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention (512-1024 tokens depending on model size) with full global attention. The final layer always uses global attention. For long-context optimization, global layers use unified Keys and Values with Proportional RoPE (p-RoPE).
Vision encoders are approximately 150M parameters for smaller models and 550M for larger models. E2B and E4B include 300M-parameter audio encoders.
Availability and Deployment
All Gemma 4 models are available on Hugging Face with integration into the latest Transformers library. The smaller E2B and E4B models target mobile and edge devices, while 26B A4B and 31B target consumer GPUs and workstations. The MoE architecture makes 26B A4B particularly suitable for fast inference compared to the dense 31B variant.
What This Means
Gemma 4 represents a significant shift toward efficient, capable open-source multimodal models. The per-layer embedding approach and MoE variants provide genuine deployment flexibility—the E4B model can run on laptops and modern phones while the 26B A4B delivers frontier performance at 4B-equivalent inference speed. The 89.2% AIME score on the 31B model and substantial coding improvements suggest these models compete meaningfully with closed-source offerings. Multilingual support (140+ languages) and native audio/video handling address practical deployment requirements that many open models still lack.
Related Articles
Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance
Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.
Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese
Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline
NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.
Comments
Loading...