Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window

TL;DR

Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.

April 2, 2026 · 7:05 PM3 min read

Gemma 4 E4B Instruction-Tuned — Quick Specs

Context window128K tokens

Compare Gemma 4 E4B Instruction-Tuned with other models →

Google DeepMind Releases Gemma 4: Open-Source Multimodal Models with Extended Context

Google DeepMind released the Gemma 4 family of open-source models today, introducing multimodal capabilities and significantly expanded context windows. The family includes four distinct model sizes, ranging from 2.3B to 31B parameters, all available under the Apache 2.0 license.

Model Specifications and Architectures

Gemma 4 employs both dense and Mixture-of-Experts (MoE) architectures:

Dense Models:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
E4B: 4.5B effective parameters (8B with embeddings), 128K context window
31B: 30.7B parameters, 256K context window, 60 layers

MoE Model:

26B A4B: 25.2B total parameters with 3.8B active parameters, 256K context window, 8 active experts from 128 total

The "E" in E2B/E4B denotes "effective parameters"—the models use Per-Layer Embeddings (PLE) to maximize efficiency on-device without increasing layer or parameter counts. The "A" in 26B A4B indicates active parameters, allowing this model to match inference speed of a 4B model while maintaining 26B total capacity.

Multimodal Capabilities and Modalities

All four models process text and images with variable aspect ratios and resolutions. E2B and E4B additionally support:

Audio: Native automatic speech recognition (ASR) and speech-to-translated-text across multiple languages
Video: Frame sequence processing for video understanding

All models support interleaved multimodal input, allowing text and images to be freely mixed within prompts.

Benchmark Performance

Gemma 4 shows substantial improvements over Gemma 3 27B (no thinking mode):

Benchmark	Gemma 4 31B	Gemma 4 26B A4B	Gemma 4 E4B	Gemma 3 27B
MMLU Pro	85.2%	82.6%	69.4%	67.6%
AIME 2026	89.2%	88.3%	42.5%	20.8%
LiveCodeBench v6	80.0%	77.1%	52.0%	29.1%
Codeforces ELO	2150	1718	940	110
GPQA Diamond	84.3%	82.3%	58.6%	42.4%
MMMLU	88.4%	86.3%	76.6%	70.7%
Vision MMMU Pro	76.9%	73.8%	52.6%	49.7%
MATH-Vision	85.6%	82.4%	59.5%	46.0%

The E4B model demonstrates the most significant coding improvements, with a Codeforces ELO of 940 compared to Gemma 3's 110, and LiveCodeBench performance of 52.0% versus 29.1%.

Core Capabilities

All models feature:

Reasoning/Thinking mode: Configurable step-by-step reasoning before generating answers
Function calling: Native support for structured tool use and agentic workflows
System prompt support: Native system role handling for structured conversations
Multilingual: Pre-trained on 140+ languages with 35+ language support
Code generation: Full code completion, generation, and correction capabilities

Architecture and Efficiency

All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention (512-1024 tokens depending on model size) with full global attention. The final layer always uses global attention. For long-context optimization, global layers use unified Keys and Values with Proportional RoPE (p-RoPE).

Vision encoders are approximately 150M parameters for smaller models and 550M for larger models. E2B and E4B include 300M-parameter audio encoders.

Availability and Deployment

All Gemma 4 models are available on Hugging Face with integration into the latest Transformers library. The smaller E2B and E4B models target mobile and edge devices, while 26B A4B and 31B target consumer GPUs and workstations. The MoE architecture makes 26B A4B particularly suitable for fast inference compared to the dense 31B variant.

What This Means

Gemma 4 represents a significant shift toward efficient, capable open-source multimodal models. The per-layer embedding approach and MoE variants provide genuine deployment flexibility—the E4B model can run on laptops and modern phones while the 26B A4B delivers frontier performance at 4B-equivalent inference speed. The 89.2% AIME score on the 31B model and substantial coding improvements suggest these models compete meaningfully with closed-source offerings. Multilingual support (140+ languages) and native audio/video handling address practical deployment requirements that many open models still lack.

Source: huggingface.co ↗

google-deepmind open-source multimodal gemma-4 text-generation vision audio moe

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model releaseJuly 1, 2026

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

model releaseJune 29, 2026

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.