Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.
Gemma 4 26B A4B IT — Quick Specs
Google DeepMind released Gemma 4, a family of open-weights multimodal models spanning four distinct sizes from 2.3B to 31B parameters, available under Apache 2.0 license on Hugging Face.
Model Specifications
The Gemma 4 lineup includes:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window, supports text, image, and audio
- 26B A4B: 25.2B total parameters with only 3.8B active during inference, 256K context window, supports text and image
- 31B: 30.7B parameters, 256K context window, supports text and image
The smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) to reduce effective parameter counts while maintaining multilingual support across 140+ languages. The 26B A4B employs a Mixture-of-Experts architecture with 8 active experts selected from 128 total, enabling fast inference comparable to a 4B model despite 26B total parameters.
Key Capabilities
All Gemma 4 models feature:
- Reasoning mode: Configurable thinking modes enabling step-by-step problem solving
- Extended multimodalities: Text, images with variable aspect ratio/resolution support; video via frame sequences; audio (E2B/E4B only) for ASR and speech-to-translation
- Function calling: Native structured tool use for autonomous agent workflows
- Long context: 128K (E2B/E4B) or 256K (26B A4B/31B) token windows
- Coding support: Code generation, completion, and correction with notable benchmark improvements
- Native system prompts: Enhanced control over conversational behavior
The architecture employs hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention on final layers, optimized with Proportional RoPE (p-RoPE) for long-context memory efficiency.
Benchmark Performance
Instruction-tuned model evaluation shows:
31B Dense Model:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
26B A4B (MoE):
- MMLU Pro: 82.6%
- AIME 2026 (no tools): 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
- GPQA Diamond: 82.3%
E4B:
- MMLU Pro: 69.4%
- LiveCodeBench v6: 52.0%
- Codeforces ELO: 940
Vision benchmarks show MMMU Pro scores of 76.9% (31B), 73.8% (26B A4B), and 52.6% (E4B). The 31B model achieved 66.4% on long-context needle-in-haystack evaluation at 128K tokens.
Deployment Flexibility
Google positions Gemma 4 for diverse deployment scenarios: E2B and E4B for mobile and edge devices; 26B A4B for consumer GPUs and workstations balancing speed and capability via MoE; 31B for high-end servers requiring maximum performance. All models are available in both pre-trained and instruction-tuned variants.
What This Means
Gemma 4 extends Google's open-model strategy to multimodal reasoning at multiple efficiency tiers. The 26B A4B model's sparse activation approach offers a compelling alternative to dense models—matching near-31B performance while running 6-7× faster. With 256K context windows and reasoning modes, Gemma 4 targets competitive positioning against closed models in long-context and agentic use cases, while maintaining deployment flexibility from phones to data centers. The Apache 2.0 license enables commercial use without restrictions.
Related Articles
DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3
DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.
DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3
DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.
Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance
Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.
NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline
NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.
Comments
Loading...