Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning

TL;DR

Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.

April 2, 2026 · 8:50 PM3 min read

Gemma 4 E2B Instruction-Tuned — Quick Specs

Context window128K tokens

Compare Gemma 4 E2B Instruction-Tuned with other models →

Google DeepMind Releases Gemma 4: Four Open-Weights Models with Multimodal and Reasoning Capabilities

Google DeepMind released Gemma 4, an open-weights model family spanning four sizes optimized for deployment from mobile devices to high-end servers. The release includes both dense and Mixture-of-Experts variants under the Apache 2.0 license.

Model Specifications

The Gemma 4 family comprises:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
E4B: 4.5B effective parameters (8B with embeddings), 128K context window
26B A4B: 3.8B active parameters out of 25.2B total (MoE architecture), 256K context window
31B Dense: 30.7B parameters, 256K context window

The "E" designation indicates effective parameters achieved through Per-Layer Embeddings (PLE), while "A" denotes active parameters in the MoE variant. This architecture allows the 26B A4B to run nearly as fast as a 4B model during inference while maintaining frontier-level performance.

Multimodal and Reasoning Capabilities

All models handle text and image input with variable aspect ratio and resolution support. E2B and E4B add native audio support including automatic speech recognition (ASR) and speech-to-translated-text translation. All models include configurable thinking modes for step-by-step reasoning and support native function calling for agentic workflows.

The models support 140+ languages in pre-training with 35+ languages confirmed for downstream tasks.

Benchmark Performance

Gemma 4 31B achieved:

MMLU Pro: 85.2%
AIME 2026 (no tools): 89.2%
LiveCodeBench v6: 80.0%
Codeforces ELO: 2150
GPQA Diamond: 84.3%
Vision MMMU Pro: 76.9%
Long Context (MRCR v2, 8 needle @ 128K): 66.4%

The 26B A4B MoE variant tracked closely behind: MMLU Pro 82.6%, AIME 2026 88.3%, LiveCodeBench 77.1%, Codeforces ELO 1718, and GPQA Diamond 82.3%.

Smaller models show proportional scaling: E4B achieved MMLU Pro 69.4% and GPQA Diamond 58.6%, while E2B reached 60.0% and 43.4% respectively.

Technical Architecture

Gemma 4 employs a hybrid attention mechanism combining local sliding window attention (512-1024 tokens depending on size) with global full attention in the final layer. This balances computational efficiency with long-context awareness. Global layers use unified Keys and Values with Proportional RoPE (p-RoPE) for memory optimization.

Vision encoders add ~150M parameters to E2B/E4B and ~550M to larger models. Audio encoders add ~300M parameters to E2B and E4B only.

Deployment and Availability

Models are available via Hugging Face with full Transformers library support. The smaller E2B and E4B models target mobile phones and laptops, while 26B A4B and 31B Dense scale to consumer GPUs, workstations, and servers. All models include native system prompt support for structured conversations.

What This Means

Gemma 4 significantly expands Google's open-weights presence across the model size spectrum. The efficient parameter design—particularly effective parameters in E2B/E4B and active parameters in 26B A4B—enables deployment scenarios previously requiring much larger models. The reasoning modes and multimodal capabilities position Gemma 4 for complex reasoning tasks and agent applications without proprietary API dependencies. Performance metrics indicate competitive scaling within size classes, though 31B-class models from other vendors maintain leads on reasoning benchmarks. The extended context window (256K on larger models) addresses enterprise document processing and long-context reasoning requirements.

Source: huggingface.co ↗

gemma-4 google-deepmind open-weights multimodal reasoning code-generation mixture-of-experts moe

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model releaseJune 29, 2026

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model releaseJune 27, 2026

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.