Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context

TL;DR

Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.

April 2, 2026 · 6:20 PM2 min read

Gemma 4 26B A4B IT — Quick Specs

Context window262K tokens

Compare Gemma 4 26B A4B IT with other models →

Google DeepMind released Gemma 4, a family of open-weights multimodal models spanning four distinct sizes from 2.3B to 31B parameters, available under Apache 2.0 license on Hugging Face.

Model Specifications

The Gemma 4 lineup includes:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
E4B: 4.5B effective parameters (8B with embeddings), 128K context window, supports text, image, and audio
26B A4B: 25.2B total parameters with only 3.8B active during inference, 256K context window, supports text and image
31B: 30.7B parameters, 256K context window, supports text and image

The smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) to reduce effective parameter counts while maintaining multilingual support across 140+ languages. The 26B A4B employs a Mixture-of-Experts architecture with 8 active experts selected from 128 total, enabling fast inference comparable to a 4B model despite 26B total parameters.

Key Capabilities

All Gemma 4 models feature:

Reasoning mode: Configurable thinking modes enabling step-by-step problem solving
Extended multimodalities: Text, images with variable aspect ratio/resolution support; video via frame sequences; audio (E2B/E4B only) for ASR and speech-to-translation
Function calling: Native structured tool use for autonomous agent workflows
Long context: 128K (E2B/E4B) or 256K (26B A4B/31B) token windows
Coding support: Code generation, completion, and correction with notable benchmark improvements
Native system prompts: Enhanced control over conversational behavior

The architecture employs hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention on final layers, optimized with Proportional RoPE (p-RoPE) for long-context memory efficiency.

Benchmark Performance

Instruction-tuned model evaluation shows:

31B Dense Model:

MMLU Pro: 85.2%
AIME 2026 (no tools): 89.2%
LiveCodeBench v6: 80.0%
Codeforces ELO: 2150
GPQA Diamond: 84.3%

26B A4B (MoE):

MMLU Pro: 82.6%
AIME 2026 (no tools): 88.3%
LiveCodeBench v6: 77.1%
Codeforces ELO: 1718
GPQA Diamond: 82.3%

E4B:

MMLU Pro: 69.4%
LiveCodeBench v6: 52.0%
Codeforces ELO: 940

Vision benchmarks show MMMU Pro scores of 76.9% (31B), 73.8% (26B A4B), and 52.6% (E4B). The 31B model achieved 66.4% on long-context needle-in-haystack evaluation at 128K tokens.

Deployment Flexibility

Google positions Gemma 4 for diverse deployment scenarios: E2B and E4B for mobile and edge devices; 26B A4B for consumer GPUs and workstations balancing speed and capability via MoE; 31B for high-end servers requiring maximum performance. All models are available in both pre-trained and instruction-tuned variants.

What This Means

Gemma 4 extends Google's open-model strategy to multimodal reasoning at multiple efficiency tiers. The 26B A4B model's sparse activation approach offers a compelling alternative to dense models—matching near-31B performance while running 6-7× faster. With 256K context windows and reasoning modes, Gemma 4 targets competitive positioning against closed models in long-context and agentic use cases, while maintaining deployment flexibility from phones to data centers. The Apache 2.0 license enables commercial use without restrictions.

Source: huggingface.co ↗

gemma google-deepmind multimodal open-weights moe reasoning long-context apache-2.0

model releaseJune 29, 2026

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model releaseJune 27, 2026

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.