Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

TL;DR

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

June 3, 2026 · 8:51 PM2 min read

Gemma 4 12B Unified — Quick Specs

Context window256K tokens

Compare Gemma 4 12B Unified with other models →

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of five Apache 2.0-licensed multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model introduces an encoder-free architecture that processes text, images, audio, and video directly through a single decoder-only transformer.

Model Lineup and Architecture

Gemma 4 consists of five models:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context
E4B: 4.5B effective parameters (8B with embeddings), 128K context
12B Unified: 11.95B parameters, 256K context
26B A4B (MoE): 25.2B total parameters with 3.8B active, 256K context
31B Dense: 30.7B parameters, 256K context

All models support text and image input. E2B, E4B, and 12B Unified include native audio and video capabilities.

The 12B Unified model eliminates the dedicated vision and audio encoders used in other Gemma 4 models. Instead, it projects raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers, reducing multimodal latency and enabling end-to-end fine-tuning.

Technical Specifications

All models use a hybrid attention mechanism that alternates between local sliding window attention (512-1024 tokens) and full global attention. Global layers feature unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long contexts.

The MoE model (26B A4B) activates only 3.8B of its 25.2B parameters during inference, using 8 active experts from a pool of 128 total experts plus 1 shared expert.

Benchmark Performance

According to Google DeepMind, Gemma 4 31B achieved:

MMLU Pro: 85.2%
AIME 2026 (no tools): 89.2%
LiveCodeBench v6: 80.0%
Codeforces ELO: 2150
GPQA Diamond: 84.3%
Vision MMMU Pro: 76.9%

The 12B Unified model scored 77.2% on MMLU Pro, 77.5% on AIME 2026, and 72.0% on LiveCodeBench v6.

Capabilities and Release Details

All models include configurable reasoning modes, native function calling support, variable aspect ratio image processing, and multilingual support across 140+ languages. The E2B, E4B, and 12B models handle automatic speech recognition and speech-to-translated-text translation.

Models support up to 256K token context windows (12B, 26B A4B, 31B) with interleaved multimodal input. The smaller E2B and E4B models use Per-Layer Embeddings (PLE) to maximize parameter efficiency for on-device deployment.

All models are available on Hugging Face under Apache 2.0 license in both pre-trained and instruction-tuned variants.

What This Means

Gemma 4's encoder-free architecture in the 12B model represents a significant shift in multimodal model design, potentially reducing deployment complexity and latency compared to traditional encoder-decoder approaches. The family's range from mobile-optimized 2.3B models to the 30.7B dense variant provides options across the performance-efficiency spectrum, though real-world performance on production workloads remains to be validated independently. The MoE architecture's ability to deliver near-31B performance while activating only 3.8B parameters could make it attractive for inference-constrained deployments.

Source: huggingface.co ↗

google-deepmind gemma-4 multimodal open-weights encoder-free mixture-of-experts apache-2.0 reasoning

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 17, 2026

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI has released Kimi k3, positioning it as China's leading AI model. The company claims the model features a 1 million token context window and improved reasoning capabilities, though independent benchmarks are not yet available.

model releaseJuly 16, 2026

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Chinese AI lab Moonshot AI released Kimi K3, a 2.8 trillion parameter model priced at $3 per million input tokens and $15 per million output tokens. The model is currently available via API, with open weights promised by July 27, 2026. This represents the most expensive pricing from a Chinese AI lab to date, matching Anthropic's Claude Sonnet series.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Gemma 4 12B Unified — Quick Specs

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Model Lineup and Architecture

Technical Specifications

Benchmark Performance

Capabilities and Release Details

What This Means

Related Articles

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI releases 2.8T parameter Kimi K3, pricing at $3/$15 per million tokens

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Comments