Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

TL;DR

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

June 3, 2026 · 5:51 PM3 min read

Gemma 4 12B Unified — Quick Specs

Context window256K tokens

Compare Gemma 4 12B Unified with other models →

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an 11.95 billion parameter multimodal model that eliminates separate encoders by processing text, images, and audio directly through a single decoder-only transformer. The model is part of the larger Gemma 4 family, which includes five models ranging from 2.3B to 30.7B effective parameters.

Technical specifications

Gemma 4 12B Unified features 48 layers with a 256K token context window and a 262K token vocabulary. The model uses a hybrid attention mechanism that interleaves local sliding window attention (1024 tokens) with full global attention, ensuring the final layer always has global context. Unlike other Gemma 4 models, the 12B Unified version projects raw image patches and audio waveforms directly into the language model's embedding space through lightweight linear layers, removing the need for dedicated vision or audio encoders.

The model is released under Apache 2.0 license with both pre-trained and instruction-tuned variants available.

Benchmark performance

According to Google DeepMind, Gemma 4 12B Unified achieves:

77.2% on MMLU Pro
77.5% on AIME 2026 (no tools)
72.0% on LiveCodeBench v6
1659 Codeforces ELO rating
78.8% on GPQA Diamond
69.1% on Vision MMMU Pro
79.7% on MATH-Vision
38.5 on CoVoST audio translation (excluding Chinese)

The model supports configurable "thinking modes" for step-by-step reasoning and includes native function-calling capabilities for agentic workflows.

Gemma 4 family architecture

The full Gemma 4 family includes:

E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, text/image/audio
E4B: 4.5B effective parameters (8B with embeddings), 128K context, text/image/audio
12B Unified: 11.95B parameters, 256K context, encoder-free text/image/audio
26B A4B: 25.2B total parameters with 3.8B active (MoE), 256K context, text/image
31B: 30.7B parameters, 256K context, text/image

The "E" models use Per-Layer Embeddings (PLE) to maximize parameter efficiency for on-device deployment. The effective parameter count excludes large embedding lookup tables that don't contribute to compute during inference. The 26B A4B model uses mixture-of-experts architecture with 8 active experts out of 128 total, allowing it to run nearly as fast as a 4B model while maintaining larger model capacity.

Multimodal capabilities

Gemma 4 12B Unified processes images at variable aspect ratios and resolutions, handles video through frame sequences, and supports interleaved multimodal inputs mixing text and images in any order. Audio capabilities include automatic speech recognition and speech-to-translated-text translation across multiple languages.

The model includes native support for system prompts and maintains multilingual support across 140+ languages during pre-training, with out-of-the-box support for 35+ languages.

What this means

The encoder-free architecture in Gemma 4 12B Unified represents a shift toward simpler multimodal models that can be fine-tuned end-to-end in a single pass. By eliminating separate encoders, the model reduces multimodal processing latency and deployment complexity. The 256K context window and strong coding benchmarks (72% on LiveCodeBench v6) position it for long-context reasoning tasks and agentic workflows, while the 12B parameter count targets consumer-grade GPUs rather than requiring high-end infrastructure. The Apache 2.0 license allows commercial use without restrictions.

Source: huggingface.co ↗

google-deepmind gemma-4 multimodal open-source apache-2.0 reasoning audio vision

model releaseJuly 16, 2026

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Thinking Machines Lab released Inkling, a Mixture-of-Experts transformer with 975B total parameters and 41B active parameters, trained on 45 trillion tokens of text, images, audio and video. The Apache-2.0 licensed model is designed as a base for fine-tuning rather than a frontier model.

model releaseJuly 17, 2026

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI has released Kimi k3, positioning it as China's leading AI model. The company claims the model features a 1 million token context window and improved reasoning capabilities, though independent benchmarks are not yet available.

model releaseJuly 16, 2026

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Moonshot AI has released Kimi K3, an open-weight multimodal reasoning model with a 1-million token context window. The model is priced at $3 per 1M input tokens and $15 per 1M output tokens, available through OpenRouter.

model releaseJuly 14, 2026

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Google has released Gemma 4 E2B for TPU, a variant of its open-source Gemma 4 model optimized to run natively on the Tensor G5 chip in Pixel 10 devices. The multimodal model enables completely offline AI chat, image recognition, and audio transcription on Pixel 10, 10 Pro, 10 Pro XL, and 10 Pro Fold.

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Gemma 4 12B Unified — Quick Specs

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Technical specifications

Benchmark performance

Gemma 4 family architecture

Multimodal capabilities

What this means

Related Articles

Thinking Machines Lab releases Inkling: 975B-parameter open-weights multimodal model under Apache-2.0

Moonshot AI's Kimi k3 claims top performance among Chinese models with 1M token context

Moonshot AI Releases Kimi K3: Open-Weight Multimodal Reasoning Model with 1M Context Window

Google releases Gemma 4 E2B, optimized to run natively on Pixel 10's Tensor G5 TPU

Comments