model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 open models with up to 256K context and multimodal reasoning

TL;DR

Google DeepMind has released Gemma 4, an open-weights model family in four sizes (2.3B to 31B parameters) with multimodal capabilities handling text, images, video, and audio. The 26B A4B variant uses mixture-of-experts to achieve 4B active parameters while supporting 256K token context windows and native reasoning modes.

3 min read
0

Google DeepMind Releases Gemma 4 Open Models with 256K Context and Multimodal Reasoning

Google DeepMind has released Gemma 4, an open-weights model family spanning four sizes designed for deployment from mobile devices to servers. The models feature multimodal capabilities, extended context windows up to 256K tokens, and built-in reasoning modes.

Model Lineup and Architecture

Gemma 4 ships in four variants:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context
  • 26B A4B: 3.8B active parameters (25.2B total), 256K context, mixture-of-experts
  • 31B: 30.7B dense parameters, 256K context

The smaller E2B and E4B models use Per-Layer Embeddings (PLE) to maximize parameter efficiency for on-device inference. The 26B A4B employs a mixture-of-experts architecture with 8 active experts selected from 128 total, enabling inference speeds comparable to 4B models while maintaining larger model capacity.

All models use a hybrid attention mechanism combining local sliding window attention with global attention, optimized for long-context processing without excessive memory overhead.

Multimodal and Reasoning Capabilities

Gemma 4 models process text and images across all variants, with video support via frame sequences. The E2B and E4B additionally support native audio input for automatic speech recognition and multilingual translation. The models support 140+ languages with 35+ officially optimized.

Key capabilities include:

  • Thinking modes: Configurable step-by-step reasoning before generation
  • Function calling: Native structured tool use for agentic workflows
  • Variable resolution images: Flexible aspect ratio and resolution handling
  • System prompt support: Native system role integration
  • Coding enhancements: Function-calling support and improved code generation

Benchmark Performance

Instruction-tuned models were evaluated across reasoning, coding, vision, and long-context tasks:

26B A4B performance:

  • MMLU Pro: 82.6%
  • AIME 2026 (no tools): 88.3%
  • LiveCodeBench v6: 77.1%
  • Codeforces ELO: 1718
  • GPQA Diamond: 82.3%
  • Vision MMMU Pro: 73.8%
  • Long Context (128K): 44.1% (8-needle MRCR v2)

31B dense model:

  • MMLU Pro: 85.2%
  • AIME 2026 (no tools): 89.2%
  • LiveCodeBench v6: 80.0%
  • Codeforces ELO: 2150
  • Long Context (256K): 66.4%

The 26B A4B shows substantial gains in reasoning benchmarks compared to Gemma 3 27B, particularly in AIME (88.3% vs 20.8%), LiveCodeBench (77.1% vs 29.1%), and long-context retrieval (44.1% vs 13.5%).

Availability and Deployment

All models are released under Apache 2.0 license and available on Hugging Face. Models are available in multiple formats including GGUF quantizations via Unsloth. The 26B A4B variant is positioned as a practical middle ground—significantly faster than the dense 31B while maintaining strong reasoning and coding performance.

Unsloth has released optimized GGUF quantizations with benchmarks, and all models are compatible with the latest Transformers library. Smaller variants (E2B, E4B) are explicitly optimized for laptop and mobile deployment, while 26B A4B and 31B target consumer GPUs and workstations.

What This Means

Gemma 4 addresses a specific deployment gap: models needing frontier-level reasoning and coding capability without the compute requirements of 70B-parameter models. The 26B A4B's mixture-of-experts design offers inference efficiency approaching 4B models while maintaining 25B total capacity—a pragmatic architecture choice that directly competes with similarly-sized dense models from other labs. The extended context windows (up to 256K) and native reasoning modes position Gemma 4 for long-document analysis and agentic workflows where earlier open models struggled. However, benchmark results suggest the 31B remains Gemma 4's strongest performer, with the A4B variant offering better throughput at modest performance cost.

Related Articles

model release

Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context

Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.

model release

Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning

Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.

model release

Google releases Gemma 4 31B with 256K context and configurable reasoning mode

Google DeepMind has released Gemma 4 31B, a 30.7-billion-parameter multimodal model supporting text and image input. The model features a 262,144-token context window, configurable thinking/reasoning mode, native function calling, and multilingual support across 140+ languages under Apache 2.0 license.

model release

Google DeepMind releases Gemma 4 family with 256K context window and multimodal capabilities

Google DeepMind released the Gemma 4 family of open-weights models in four sizes (2.3B to 31B parameters) with multimodal support for text, images, video, and audio. The flagship 31B model achieves 85.2% on MMLU Pro and 89.2% on AIME 2024, with context windows up to 256K tokens. All models feature configurable reasoning modes and are optimized for deployment from mobile devices to servers under Apache 2.0 license.

Comments

Loading...