Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window

TL;DR

Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.

April 4, 2026 · 12:50 AM3 min read

Gemma 4 E4B Instruction-Tuned — Quick Specs

Context window128K tokens

Compare Gemma 4 E4B Instruction-Tuned with other models →

Google DeepMind Launches Gemma 4 with Multimodal Reasoning Capabilities

Google DeepMind released Gemma 4, a family of open-weight multimodal models supporting text, images, video, and audio inputs with reasoning modes and context windows up to 256K tokens.

Model Lineup and Architecture

Gemma 4 includes four distinct sizes:

Dense Models:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
E4B: 4.5B effective parameters (8B with embeddings), 128K context window
31B: 30.7B parameters, 256K context window

Mixture-of-Experts:

26B A4B: 25.2B total parameters, 3.8B active parameters, 256K context window, 8 active experts out of 128 total

The "E" designation indicates "effective" parameters achieved through Per-Layer Embeddings (PLE), where each decoder layer maintains its own small embedding table for quick lookups. The "A" in the A4B model denotes active parameters—only 3.8B of 25.2B total parameters activate during inference, enabling near-4B inference speed at 26B model scale.

All models employ hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention in the final layer. Global layers use unified Keys and Values with Proportional RoPE for memory optimization during long-context processing.

Multimodal and Reasoning Capabilities

Gemma 4 handles:

Text and Images: All models support variable aspect ratio and resolution image processing
Video: Frame sequence analysis available across the family
Audio: Native ASR and speech-to-translated-text on E2B and E4B models only
Reasoning: Built-in configurable thinking modes enabling step-by-step problem solving
Function Calling: Native structured tool use for agentic workflows
System Prompts: Native system role support for controlled conversations
Multilingual: Pre-trained on 140+ languages with native 35+ language support

Benchmark Performance

Instruction-tuned benchmark results:

Benchmark	31B	26B A4B	E4B	E2B
MMLU Pro	85.2%	82.6%	69.4%	60.0%
AIME 2026 (no tools)	89.2%	88.3%	42.5%	37.5%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%
Codeforces ELO	2150	1718	940	633
GPQA Diamond	84.3%	82.3%	58.6%	43.4%
MMMLU (Multilingual)	88.4%	86.3%	76.6%	67.4%
Vision MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%

For long-context evaluation (MRCR v2, 128K tokens with 8 needles), the 31B model achieved 66.4% average accuracy.

Deployment and Availability

Models are available under Apache 2.0 license with open weights. Unsloth offers optimized GGUF (4-bit) quantized versions enabling local execution on laptops and mobile devices. All models are available via Hugging Face Transformers library and compatible with Unsloth Studio for fine-tuning and inference.

The family is designed for diverse deployment scenarios: E2B and E4B for edge/mobile, 26B A4B for consumer GPUs, and 31B for workstations and servers.

What This Means

Gemma 4 represents a significant consolidation of multimodal capabilities in open models. The efficiency-focused variants (E2B, E4B, 26B A4B) expand deployment options beyond high-end data centers, while the 31B variant approaches frontier performance on reasoning and code benchmarks (85.2% MMLU Pro, 89.2% AIME). The native reasoning modes and function-calling address the growing demand for agentic workflows. However, the smaller models show notable performance drops on advanced reasoning tasks—the E4B drops to 69.4% MMLU Pro versus 31B's 85.2%, suggesting size-dependent trade-offs for edge deployments.

Source: huggingface.co ↗

gemma google-deepmind multimodal open-source reasoning long-context moe mobile-ai

model releaseJune 29, 2026

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model releaseJune 27, 2026

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model releaseJuly 1, 2026

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.