model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 open models with multimodal capabilities and 256K context window

TL;DR

Google DeepMind released the Gemma 4 family of open-source models with multimodal capabilities (text, image, audio, video) and context windows up to 256K tokens. Four distinct model sizes—E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active), and 31B—are available under the Apache 2.0 license, with instruction-tuned and pre-trained variants.

3 min read
0

Google DeepMind Releases Gemma 4: Open-Source Multimodal Models with Extended Context

Google DeepMind released the Gemma 4 family of open-source models today, introducing multimodal capabilities and significantly expanded context windows. The family includes four distinct model sizes, ranging from 2.3B to 31B parameters, all available under the Apache 2.0 license.

Model Specifications and Architectures

Gemma 4 employs both dense and Mixture-of-Experts (MoE) architectures:

Dense Models:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context window
  • 31B: 30.7B parameters, 256K context window, 60 layers

MoE Model:

  • 26B A4B: 25.2B total parameters with 3.8B active parameters, 256K context window, 8 active experts from 128 total

The "E" in E2B/E4B denotes "effective parameters"—the models use Per-Layer Embeddings (PLE) to maximize efficiency on-device without increasing layer or parameter counts. The "A" in 26B A4B indicates active parameters, allowing this model to match inference speed of a 4B model while maintaining 26B total capacity.

Multimodal Capabilities and Modalities

All four models process text and images with variable aspect ratios and resolutions. E2B and E4B additionally support:

  • Audio: Native automatic speech recognition (ASR) and speech-to-translated-text across multiple languages
  • Video: Frame sequence processing for video understanding

All models support interleaved multimodal input, allowing text and images to be freely mixed within prompts.

Benchmark Performance

Gemma 4 shows substantial improvements over Gemma 3 27B (no thinking mode):

Benchmark Gemma 4 31B Gemma 4 26B A4B Gemma 4 E4B Gemma 3 27B
MMLU Pro 85.2% 82.6% 69.4% 67.6%
AIME 2026 89.2% 88.3% 42.5% 20.8%
LiveCodeBench v6 80.0% 77.1% 52.0% 29.1%
Codeforces ELO 2150 1718 940 110
GPQA Diamond 84.3% 82.3% 58.6% 42.4%
MMMLU 88.4% 86.3% 76.6% 70.7%
Vision MMMU Pro 76.9% 73.8% 52.6% 49.7%
MATH-Vision 85.6% 82.4% 59.5% 46.0%

The E4B model demonstrates the most significant coding improvements, with a Codeforces ELO of 940 compared to Gemma 3's 110, and LiveCodeBench performance of 52.0% versus 29.1%.

Core Capabilities

All models feature:

  • Reasoning/Thinking mode: Configurable step-by-step reasoning before generating answers
  • Function calling: Native support for structured tool use and agentic workflows
  • System prompt support: Native system role handling for structured conversations
  • Multilingual: Pre-trained on 140+ languages with 35+ language support
  • Code generation: Full code completion, generation, and correction capabilities

Architecture and Efficiency

All Gemma 4 models employ a hybrid attention mechanism that interleaves local sliding window attention (512-1024 tokens depending on model size) with full global attention. The final layer always uses global attention. For long-context optimization, global layers use unified Keys and Values with Proportional RoPE (p-RoPE).

Vision encoders are approximately 150M parameters for smaller models and 550M for larger models. E2B and E4B include 300M-parameter audio encoders.

Availability and Deployment

All Gemma 4 models are available on Hugging Face with integration into the latest Transformers library. The smaller E2B and E4B models target mobile and edge devices, while 26B A4B and 31B target consumer GPUs and workstations. The MoE architecture makes 26B A4B particularly suitable for fast inference compared to the dense 31B variant.

What This Means

Gemma 4 represents a significant shift toward efficient, capable open-source multimodal models. The per-layer embedding approach and MoE variants provide genuine deployment flexibility—the E4B model can run on laptops and modern phones while the 26B A4B delivers frontier performance at 4B-equivalent inference speed. The 89.2% AIME score on the 31B model and substantial coding improvements suggest these models compete meaningfully with closed-source offerings. Multilingual support (140+ languages) and native audio/video handling address practical deployment requirements that many open models still lack.

Related Articles

model release

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

model release

Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June

Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.

product update

Google DeepMind connects Genie world model to 280 billion Street View images, Waymo already using for self-driving train

Google DeepMind has integrated its Genie world model with Street View's 280 billion images spanning 110 countries, enabling users to explore AI-generated simulations of real locations. Waymo is already using Genie 3 to train self-driving cars on rare scenarios like tornadoes and unexpected obstacles.

model release

Google launches Gemini 3.5 Flash and new Omni multimodal AI family at I/O 2026

Google launched Gemini 3.5 Flash today as the default model for its Gemini app and AI Mode in Search, with Gemini 3.5 Pro following next month. The company also introduced Gemini Omni, a new multimodal AI family capable of generating video from text, photos, video, and audio inputs.

Comments

Loading...