model releaseGoogle DeepMind

Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context

TL;DR

Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.

2 min read
0

Google DeepMind released Gemma 4, a family of open-weights multimodal models spanning four distinct sizes from 2.3B to 31B parameters, available under Apache 2.0 license on Hugging Face.

Model Specifications

The Gemma 4 lineup includes:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context window, supports text, image, and audio
  • 26B A4B: 25.2B total parameters with only 3.8B active during inference, 256K context window, supports text and image
  • 31B: 30.7B parameters, 256K context window, supports text and image

The smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) to reduce effective parameter counts while maintaining multilingual support across 140+ languages. The 26B A4B employs a Mixture-of-Experts architecture with 8 active experts selected from 128 total, enabling fast inference comparable to a 4B model despite 26B total parameters.

Key Capabilities

All Gemma 4 models feature:

  • Reasoning mode: Configurable thinking modes enabling step-by-step problem solving
  • Extended multimodalities: Text, images with variable aspect ratio/resolution support; video via frame sequences; audio (E2B/E4B only) for ASR and speech-to-translation
  • Function calling: Native structured tool use for autonomous agent workflows
  • Long context: 128K (E2B/E4B) or 256K (26B A4B/31B) token windows
  • Coding support: Code generation, completion, and correction with notable benchmark improvements
  • Native system prompts: Enhanced control over conversational behavior

The architecture employs hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention on final layers, optimized with Proportional RoPE (p-RoPE) for long-context memory efficiency.

Benchmark Performance

Instruction-tuned model evaluation shows:

31B Dense Model:

  • MMLU Pro: 85.2%
  • AIME 2026 (no tools): 89.2%
  • LiveCodeBench v6: 80.0%
  • Codeforces ELO: 2150
  • GPQA Diamond: 84.3%

26B A4B (MoE):

  • MMLU Pro: 82.6%
  • AIME 2026 (no tools): 88.3%
  • LiveCodeBench v6: 77.1%
  • Codeforces ELO: 1718
  • GPQA Diamond: 82.3%

E4B:

  • MMLU Pro: 69.4%
  • LiveCodeBench v6: 52.0%
  • Codeforces ELO: 940

Vision benchmarks show MMMU Pro scores of 76.9% (31B), 73.8% (26B A4B), and 52.6% (E4B). The 31B model achieved 66.4% on long-context needle-in-haystack evaluation at 128K tokens.

Deployment Flexibility

Google positions Gemma 4 for diverse deployment scenarios: E2B and E4B for mobile and edge devices; 26B A4B for consumer GPUs and workstations balancing speed and capability via MoE; 31B for high-end servers requiring maximum performance. All models are available in both pre-trained and instruction-tuned variants.

What This Means

Gemma 4 extends Google's open-model strategy to multimodal reasoning at multiple efficiency tiers. The 26B A4B model's sparse activation approach offers a compelling alternative to dense models—matching near-31B performance while running 6-7× faster. With 256K context windows and reasoning modes, Gemma 4 targets competitive positioning against closed models in long-context and agentic use cases, while maintaining deployment flexibility from phones to data centers. The Apache 2.0 license enables commercial use without restrictions.

Related Articles

model release

DeepSeek Releases V4 Models: 1M Context Window, 90% Less KV Cache Than V3

DeepSeek has released two new MoE models: DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated). Both models support a one million token context window and use a hybrid attention architecture that requires only 27% of single-token inference FLOPs and 10% of KV cache compared to DeepSeek-V3.2.

model release

DeepSeek Releases V4-Pro with 1.6T Parameters, 1M Token Context at 27% Inference Cost of V3

DeepSeek has released two Mixture-of-Experts models: V4-Pro with 1.6 trillion parameters (49B activated) and V4-Flash with 284B parameters (13B activated), both supporting 1 million token context windows. V4-Pro requires only 27% of inference FLOPs and 10% of KV cache compared to V3.2 at 1M token context, trained on over 32 trillion tokens.

model release

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

model release

NVIDIA releases Nemotron-Labs-TwoTower-30B: block-wise diffusion model claims 2.42× faster generation at 98.7% baseline

NVIDIA released Nemotron-Labs-TwoTower-30B-A3B-Base-BF16, a block-wise diffusion language model that generates text by denoising blocks of tokens in parallel rather than sequentially. According to NVIDIA, the model achieves 2.42× the wall-clock generation throughput of its autoregressive baseline while retaining 98.7% of aggregate benchmark quality.

Comments

Loading...