model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window

TL;DR

Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.

3 min read
0

Google DeepMind Releases Gemma 4: Open-Source Multimodal Models with Extended Context

Google DeepMind released the Gemma 4 family of open-source models today, introducing multimodal capabilities and significantly expanded context windows. The family includes four distinct model sizes, ranging from 2.3B to 31B parameters, all available under the Apache 2.0 license.

Model Specifications and Architectures

Gemma 4 employs both dense and Mixture-of-Experts (MoE) architectures:

Dense Models:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context window
  • 31B: 30.7B parameters, 256K context window, 60 layers

MoE Model:

  • 26B A4B: 25.2B total parameters with 3.8B active parameters, 256K context window, 8 active experts from 128 total

The "E" in E2B/E4B denotes "effective parameters"—the models use Per-Layer Embeddings (PLE) to maximize efficiency on-device without increasing layer or parameter counts. The "A" in 26B A4B indicates active parameters, allowing this model to match inference speed of a 4B model while maintaining 26B total capacity.

Multimodal Capabilities and Modalities

All four models process text and images with variable aspect ratios and resolutions. E2B and E4B additionally support:

  • Audio: Native automatic speech recognition (ASR) and speech-to-translated-text across multiple languages
  • Video: Frame sequence processing for video understanding

All models support interleaved multimodal input, allowing text and images to be freely mixed within prompts.

Benchmark Performance

Gemma 4 shows substantial improvements over Gemma 3 27B (no thinking mode):

Benchmark Gemma 4 31B Gemma 4 26B A4B Gemma 4 E4B Gemma 3 27B
MMLU Pro 85.2% 82.6% 69.4% 67.6%
AIME 2026 89.2% 88.3% 42.5% 20.8%
LiveCodeBench v6 80.0% 77.1% 52.0% 29.1%
Codeforces ELO 2150 1718 940 110
GPQA Diamond 84.3% 82.3% 58.6% 42.4%
MMMLU 88.4% 86.3% 76.6% 70.7%
Vision MMMU Pro 76.9% 73.8% 52.6% 49.7%
MATH-Vision 85.6% 82.4% 59.5% 46.0%

The E4B model demonstrates the most significant coding improvements, with a Codeforces ELO of 940 compared to Gemma 3's 110, and LiveCodeBench performance of 52.0% versus 29.1%.

Core Capabilities

All models feature:

  • Reasoning/Thinking mode: Configurable step-by-step reasoning before generating answers
  • Function calling: Native support for structured tool use and agentic workflows
  • System prompt support: Native system role handling for structured conversations
  • Multilingual: Pre-trained on 140+ languages with 35+ language support
  • Code generation: Full code completion, generation, and correction capabilities

Architecture and Efficiency

All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention (512-1024 tokens depending on model size) with full global attention. The final layer always uses global attention. For long-context optimization, global layers use unified Keys and Values with Proportional RoPE (p-RoPE).

Vision encoders are approximately 150M parameters for smaller models and 550M for larger models. E2B and E4B include 300M-parameter audio encoders.

Availability and Deployment

All Gemma 4 models are available on Hugging Face with integration into the latest Transformers library. The smaller E2B and E4B models target mobile and edge devices, while 26B A4B and 31B target consumer GPUs and workstations. The MoE architecture makes 26B A4B particularly suitable for fast inference compared to the dense 31B variant.

What This Means

Gemma 4 represents a significant shift toward efficient, capable open-source multimodal models. The per-layer embedding approach and MoE variants provide genuine deployment flexibility—the E4B model can run on laptops and modern phones while the 26B A4B delivers frontier performance at 4B-equivalent inference speed. The 89.2% AIME score on the 31B model and substantial coding improvements suggest these models compete meaningfully with closed-source offerings. Multilingual support (140+ languages) and native audio/video handling address practical deployment requirements that many open models still lack.

Related Articles

model release

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model release

Portugal releases Amália, open-source 9B parameter AI model trained on European Portuguese

Portugal has released Amália, its first national AI model trained specifically for European Portuguese. Built on EuroLLM-9B with 9 billion parameters, the model is fully open-source with weights, datasets, and code published under an open license. The government has committed €5.5m in initial funding through 2027.

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

Comments

Loading...

Gemma 4 Models Released: Open-Source AI with Multimodal Support | TPS