Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.
Gemma 4 12B Unified — Quick Specs
Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters
Google DeepMind released Gemma 4, a family of five Apache 2.0-licensed multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model introduces an encoder-free architecture that processes text, images, audio, and video directly through a single decoder-only transformer.
Model Lineup and Architecture
Gemma 4 consists of five models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context
- E4B: 4.5B effective parameters (8B with embeddings), 128K context
- 12B Unified: 11.95B parameters, 256K context
- 26B A4B (MoE): 25.2B total parameters with 3.8B active, 256K context
- 31B Dense: 30.7B parameters, 256K context
All models support text and image input. E2B, E4B, and 12B Unified include native audio and video capabilities.
The 12B Unified model eliminates the dedicated vision and audio encoders used in other Gemma 4 models. Instead, it projects raw image patches and audio waveforms directly into the LLM's embedding space through lightweight linear layers, reducing multimodal latency and enabling end-to-end fine-tuning.
Technical Specifications
All models use a hybrid attention mechanism that alternates between local sliding window attention (512-1024 tokens) and full global attention. Global layers feature unified Keys and Values with Proportional RoPE (p-RoPE) to optimize memory for long contexts.
The MoE model (26B A4B) activates only 3.8B of its 25.2B parameters during inference, using 8 active experts from a pool of 128 total experts plus 1 shared expert.
Benchmark Performance
According to Google DeepMind, Gemma 4 31B achieved:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- Vision MMMU Pro: 76.9%
The 12B Unified model scored 77.2% on MMLU Pro, 77.5% on AIME 2026, and 72.0% on LiveCodeBench v6.
Capabilities and Release Details
All models include configurable reasoning modes, native function calling support, variable aspect ratio image processing, and multilingual support across 140+ languages. The E2B, E4B, and 12B models handle automatic speech recognition and speech-to-translated-text translation.
Models support up to 256K token context windows (12B, 26B A4B, 31B) with interleaved multimodal input. The smaller E2B and E4B models use Per-Layer Embeddings (PLE) to maximize parameter efficiency for on-device deployment.
All models are available on Hugging Face under Apache 2.0 license in both pre-trained and instruction-tuned variants.
What This Means
Gemma 4's encoder-free architecture in the 12B model represents a significant shift in multimodal model design, potentially reducing deployment complexity and latency compared to traditional encoder-decoder approaches. The family's range from mobile-optimized 2.3B models to the 30.7B dense variant provides options across the performance-efficiency spectrum, though real-world performance on production workloads remains to be validated independently. The MoE architecture's ability to deliver near-31B performance while activating only 3.8B parameters could make it attractive for inference-constrained deployments.
Related Articles
Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window
Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.
Alibaba's Qwen Releases Qwen3.7 Plus: 1M Context Window at $0.40 Per Million Input Tokens
Alibaba's Qwen has released Qwen3.7 Plus, a multimodal model with a 1 million token context window. The model accepts text and image input with text output, priced at $0.40 per million input tokens and $1.60 per million output tokens through OpenRouter's API.
ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture
ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.
Microsoft launches MAI-Code-1 and MAI-Thinking-1 models to reduce OpenAI dependence
Microsoft announced two proprietary AI models at its Build developer conference: MAI-Code-1 for code generation and MAI-Thinking-1 for reasoning tasks. The models are designed to run on Azure infrastructure, allowing Microsoft to reduce costs from its $13 billion OpenAI investment while competing directly with Anthropic and Google.
Comments
Loading...