model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

TL;DR

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

3 min read
0

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an 11.95 billion parameter multimodal model that eliminates separate encoders by processing text, images, and audio directly through a single decoder-only transformer. The model is part of the larger Gemma 4 family, which includes five models ranging from 2.3B to 30.7B effective parameters.

Technical specifications

Gemma 4 12B Unified features 48 layers with a 256K token context window and a 262K token vocabulary. The model uses a hybrid attention mechanism that interleaves local sliding window attention (1024 tokens) with full global attention, ensuring the final layer always has global context. Unlike other Gemma 4 models, the 12B Unified version projects raw image patches and audio waveforms directly into the language model's embedding space through lightweight linear layers, removing the need for dedicated vision or audio encoders.

The model is released under Apache 2.0 license with both pre-trained and instruction-tuned variants available.

Benchmark performance

According to Google DeepMind, Gemma 4 12B Unified achieves:

  • 77.2% on MMLU Pro
  • 77.5% on AIME 2026 (no tools)
  • 72.0% on LiveCodeBench v6
  • 1659 Codeforces ELO rating
  • 78.8% on GPQA Diamond
  • 69.1% on Vision MMMU Pro
  • 79.7% on MATH-Vision
  • 38.5 on CoVoST audio translation (excluding Chinese)

The model supports configurable "thinking modes" for step-by-step reasoning and includes native function-calling capabilities for agentic workflows.

Gemma 4 family architecture

The full Gemma 4 family includes:

  • E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, text/image/audio
  • E4B: 4.5B effective parameters (8B with embeddings), 128K context, text/image/audio
  • 12B Unified: 11.95B parameters, 256K context, encoder-free text/image/audio
  • 26B A4B: 25.2B total parameters with 3.8B active (MoE), 256K context, text/image
  • 31B: 30.7B parameters, 256K context, text/image

The "E" models use Per-Layer Embeddings (PLE) to maximize parameter efficiency for on-device deployment. The effective parameter count excludes large embedding lookup tables that don't contribute to compute during inference. The 26B A4B model uses mixture-of-experts architecture with 8 active experts out of 128 total, allowing it to run nearly as fast as a 4B model while maintaining larger model capacity.

Multimodal capabilities

Gemma 4 12B Unified processes images at variable aspect ratios and resolutions, handles video through frame sequences, and supports interleaved multimodal inputs mixing text and images in any order. Audio capabilities include automatic speech recognition and speech-to-translated-text translation across multiple languages.

The model includes native support for system prompts and maintains multilingual support across 140+ languages during pre-training, with out-of-the-box support for 35+ languages.

What this means

The encoder-free architecture in Gemma 4 12B Unified represents a shift toward simpler multimodal models that can be fine-tuned end-to-end in a single pass. By eliminating separate encoders, the model reduces multimodal processing latency and deployment complexity. The 256K context window and strong coding benchmarks (72% on LiveCodeBench v6) position it for long-context reasoning tasks and agentic workflows, while the 12B parameter count targets consumer-grade GPUs rather than requiring high-end infrastructure. The Apache 2.0 license allows commercial use without restrictions.

Related Articles

model release

ByteDance Open-Sources Bernini-R Video Diffusion Model With Semantic Planning Architecture

ByteDance released Bernini-R, an open-source video generation and editing model that combines an MLLM-based semantic planner with a DiT-based renderer. The model requires Hopper-class GPUs (H100/H800/H200) for optimal performance and supports multiple tasks including text-to-video, video editing, and reference-guided generation.

model release

NVIDIA Releases Cosmos 3: 8B and 32B Omni-Models Combining Video Generation, Reasoning, and Action in Single Architectur

NVIDIA has released Cosmos 3, a unified omni-model that combines world generation, physical reasoning, and action generation in a single architecture. Available in 8B (Nano) and 32B (Super) parameter versions on Hugging Face, Cosmos 3 uses a Mixture-of-Transformers architecture to process text, image, video, audio, and action modalities without switching between separate models.

model release

Alibaba's Qwen Releases Qwen3.7 Plus: 1M Context Window at $0.40 Per Million Input Tokens

Alibaba's Qwen has released Qwen3.7 Plus, a multimodal model with a 1 million token context window. The model accepts text and image input with text output, priced at $0.40 per million input tokens and $1.60 per million output tokens through OpenRouter's API.

model release

Microsoft launches MAI-Code-1 and MAI-Thinking-1 models to reduce OpenAI dependence

Microsoft announced two proprietary AI models at its Build developer conference: MAI-Code-1 for code generation and MAI-Thinking-1 for reasoning tasks. The models are designed to run on Azure infrastructure, allowing Microsoft to reduce costs from its $13 billion OpenAI investment while competing directly with Anthropic and Google.

Comments

Loading...