model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window

TL;DR

Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.

3 min read
0

Google DeepMind Launches Gemma 4 with Multimodal Reasoning Capabilities

Google DeepMind released Gemma 4, a family of open-weight multimodal models supporting text, images, video, and audio inputs with reasoning modes and context windows up to 256K tokens.

Model Lineup and Architecture

Gemma 4 includes four distinct sizes:

Dense Models:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context window
  • 31B: 30.7B parameters, 256K context window

Mixture-of-Experts:

  • 26B A4B: 25.2B total parameters, 3.8B active parameters, 256K context window, 8 active experts out of 128 total

The "E" designation indicates "effective" parameters achieved through Per-Layer Embeddings (PLE), where each decoder layer maintains its own small embedding table for quick lookups. The "A" in the A4B model denotes active parameters—only 3.8B of 25.2B total parameters activate during inference, enabling near-4B inference speed at 26B model scale.

All models employ hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention in the final layer. Global layers use unified Keys and Values with Proportional RoPE for memory optimization during long-context processing.

Multimodal and Reasoning Capabilities

Gemma 4 handles:

  • Text and Images: All models support variable aspect ratio and resolution image processing
  • Video: Frame sequence analysis available across the family
  • Audio: Native ASR and speech-to-translated-text on E2B and E4B models only
  • Reasoning: Built-in configurable thinking modes enabling step-by-step problem solving
  • Function Calling: Native structured tool use for agentic workflows
  • System Prompts: Native system role support for controlled conversations
  • Multilingual: Pre-trained on 140+ languages with native 35+ language support

Benchmark Performance

Instruction-tuned benchmark results:

Benchmark 31B 26B A4B E4B E2B
MMLU Pro 85.2% 82.6% 69.4% 60.0%
AIME 2026 (no tools) 89.2% 88.3% 42.5% 37.5%
LiveCodeBench v6 80.0% 77.1% 52.0% 44.0%
Codeforces ELO 2150 1718 940 633
GPQA Diamond 84.3% 82.3% 58.6% 43.4%
MMMLU (Multilingual) 88.4% 86.3% 76.6% 67.4%
Vision MMMU Pro 76.9% 73.8% 52.6% 44.2%
MATH-Vision 85.6% 82.4% 59.5% 52.4%
BigBench Extra Hard 74.4% 64.8% 33.1% 21.9%

For long-context evaluation (MRCR v2, 128K tokens with 8 needles), the 31B model achieved 66.4% average accuracy.

Deployment and Availability

Models are available under Apache 2.0 license with open weights. Unsloth offers optimized GGUF (4-bit) quantized versions enabling local execution on laptops and mobile devices. All models are available via Hugging Face Transformers library and compatible with Unsloth Studio for fine-tuning and inference.

The family is designed for diverse deployment scenarios: E2B and E4B for edge/mobile, 26B A4B for consumer GPUs, and 31B for workstations and servers.

What This Means

Gemma 4 represents a significant consolidation of multimodal capabilities in open models. The efficiency-focused variants (E2B, E4B, 26B A4B) expand deployment options beyond high-end data centers, while the 31B variant approaches frontier performance on reasoning and code benchmarks (85.2% MMLU Pro, 89.2% AIME). The native reasoning modes and function-calling address the growing demand for agentic workflows. However, the smaller models show notable performance drops on advanced reasoning tasks—the E4B drops to 69.4% MMLU Pro versus 31B's 85.2%, suggesting size-dependent trade-offs for edge deployments.

Related Articles

model release

Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis

Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.

model release

Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June

Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.

model release

DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter

DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.

model release

Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens

Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.

Comments

Loading...