model releaseGoogle DeepMind

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

TL;DR

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

2 min read
0

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model designed to run locally on consumer laptops with 16GB of RAM. The model eliminates traditional multimodal encoders, processing vision and audio inputs directly through its language model backbone.

Technical architecture

Gemma 4 12B differs from conventional multimodal models by removing separate encoder modules:

  • Vision processing: Replaces the vision encoder with a single matrix multiplication, positional embedding, and normalization layers, allowing the LLM backbone to handle visual processing directly
  • Audio processing: Projects raw audio signals into the same dimensional space as text tokens without any encoder
  • Memory footprint: Requires 16GB of VRAM or unified memory for local inference

According to Google DeepMind, this architecture reduces latency and memory usage compared to encoder-based approaches.

Performance and positioning

Google DeepMind claims Gemma 4 12B delivers benchmark performance approaching its larger 26B Mixture of Experts model at less than half the memory footprint. Specific benchmark scores were not disclosed. The model sits between the company's edge-focused E4B and the 26B MoE in terms of capability and size.

Gemma 4 12B is the first mid-sized model in the Gemma family to support native audio inputs alongside text and vision.

Availability and ecosystem support

The model is released under an Apache 2.0 license and available now through:

  • Direct download from Hugging Face and Kaggle
  • Inference tools: LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM
  • Fine-tuning: Unsloth support
  • Deployment: Google Cloud via Gemini Enterprise Agent Platform Model Garden, Cloud Run, and GKE

Google also released an official Skills Repository library for agent development with Gemma models.

The company states the Gemma 4 model family has reached 150 million downloads. The model includes Multi-Token Prediction drafters for latency reduction.

What this means

Gemma 4 12B represents a shift toward unified multimodal architectures that eliminate specialized encoder modules. By processing raw audio and simplified vision embeddings directly in the LLM backbone, Google is betting on architectural simplicity over modular design. The 16GB memory requirement makes this genuinely laptop-deployable for developers, though actual performance relative to encoder-based alternatives remains to be independently verified. The Apache 2.0 license and broad tooling support position this as a practical option for local multimodal inference.

Related Articles

model release

Google DeepMind releases Gemma 4 12B Unified: encoder-free multimodal model with 256K context window

Google DeepMind has released Gemma 4 12B Unified, an encoder-free multimodal model that processes text, images, and audio through a single decoder-only transformer. The model features 11.95 billion parameters, a 256K token context window, and achieves 77.2% on MMLU Pro and 72.0% on LiveCodeBench v6.

model release

NVIDIA Releases Nemotron 3.5 Content Safety: 4B-Parameter Multimodal Model with Custom Policy Enforcement and 140-Langua

NVIDIA has released Nemotron 3.5 Content Safety, a 4B-parameter model built on Google Gemma 3 4B IT that provides multimodal safety classification across approximately 140 languages. The model includes a 128K context window, custom enterprise policy enforcement, auditable reasoning traces, and is releasing its training dataset.

model release

Nvidia Releases Free 4B-Parameter Nemotron 3.5 Content Safety Model with 128K Context

Nvidia has released Nemotron 3.5 Content Safety, a 4-billion parameter multimodal guardrail model fine-tuned from Google Gemma-3-4B. The model is available for free, supports 128K token context windows, and moderates content across 12 languages.

model release

Google DeepMind Releases Gemma 4: Encoder-Free Multimodal Models from 2.3B to 30.7B Parameters

Google DeepMind released Gemma 4, a family of open-weight multimodal models ranging from 2.3B to 30.7B parameters. The flagship 12B Unified model eliminates separate encoders, processing text, images, audio, and video directly through a single decoder-only transformer with up to 256K token context window.

Comments

Loading...