Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

TL;DR

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

June 9, 2026 · 2:21 PM2 min read

Gemma 4 12B — Quick Specs

Compare Gemma 4 12B with other models →

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model designed to run locally on consumer laptops with 16GB of RAM. The model eliminates traditional multimodal encoders, processing vision and audio inputs directly through its language model backbone.

Technical architecture

Gemma 4 12B differs from conventional multimodal models by removing separate encoder modules:

Vision processing: Replaces the vision encoder with a single matrix multiplication, positional embedding, and normalization layers, allowing the LLM backbone to handle visual processing directly
Audio processing: Projects raw audio signals into the same dimensional space as text tokens without any encoder
Memory footprint: Requires 16GB of VRAM or unified memory for local inference

According to Google DeepMind, this architecture reduces latency and memory usage compared to encoder-based approaches.

Performance and positioning

Google DeepMind claims Gemma 4 12B delivers benchmark performance approaching its larger 26B Mixture of Experts model at less than half the memory footprint. Specific benchmark scores were not disclosed. The model sits between the company's edge-focused E4B and the 26B MoE in terms of capability and size.

Gemma 4 12B is the first mid-sized model in the Gemma family to support native audio inputs alongside text and vision.

Availability and ecosystem support

The model is released under an Apache 2.0 license and available now through:

Direct download from Hugging Face and Kaggle
Inference tools: LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM
Fine-tuning: Unsloth support
Deployment: Google Cloud via Gemini Enterprise Agent Platform Model Garden, Cloud Run, and GKE

Google also released an official Skills Repository library for agent development with Gemma models.

The company states the Gemma 4 model family has reached 150 million downloads. The model includes Multi-Token Prediction drafters for latency reduction.

What this means

Gemma 4 12B represents a shift toward unified multimodal architectures that eliminate specialized encoder modules. By processing raw audio and simplified vision embeddings directly in the LLM backbone, Google is betting on architectural simplicity over modular design. The 16GB memory requirement makes this genuinely laptop-deployable for developers, though actual performance relative to encoder-based alternatives remains to be independently verified. The Apache 2.0 license and broad tooling support position this as a practical option for local multimodal inference.

Source: deepmind.google ↗

gemma multimodal google-deepmind open-source local-inference audio vision

model releaseJuly 20, 2026

Alibaba releases Qwen 3.8, a 2.4 trillion parameter open-weight model claiming second place behind Fable 5

Alibaba has released Qwen 3.8, a 2.4 trillion parameter open-weight model that the company claims trails only Fable 5. The multimodal model processes images, videos, and documents, with a preview available through Alibaba's platforms at 10 percent of standard pricing.

researchJuly 20, 2026

Google DeepMind's GenCeption uses video generator for computer vision with 500x less training data

Google DeepMind researchers developed GenCeption, which repurposes Alibaba's Wan2.1 video generator for computer vision tasks including depth estimation, segmentation, and 3D pose estimation. The model matches state-of-the-art specialized systems while training on only 7,500 synthetic videos—between 7 and 500 times less data than competing approaches.

model releaseJuly 24, 2026

Black Forest Labs Unveils FLUX.2 [klein]: A Distilled Model for Interactive Image Generation

Black Forest Labs has released FLUX.2 [klein], a lightweight variant of its FLUX.2 image generation model family designed for faster, more interactive use. The company frames the release as a step toward 'interactive visual intelligence,' though detailed benchmarks and pricing have not yet been disclosed.

model releaseJuly 24, 2026

InclusionAI Releases Ling-3.0-flash, a 124B MoE Model with 5.1B Active Parameters

InclusionAI has released Ling-3.0-flash, a 124-billion-parameter Mixture-of-Experts model that activates roughly 5.1 billion parameters per token. The model targets production-scale agentic workloads with a 262K context window and an emphasis on token efficiency.

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Gemma 4 12B — Quick Specs

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Technical architecture

Performance and positioning

Availability and ecosystem support

What this means

Related Articles

Alibaba releases Qwen 3.8, a 2.4 trillion parameter open-weight model claiming second place behind Fable 5

Google DeepMind's GenCeption uses video generator for computer vision with 500x less training data

Black Forest Labs Unveils FLUX.2 [klein]: A Distilled Model for Interactive Image Generation

InclusionAI Releases Ling-3.0-flash, a 124B MoE Model with 5.1B Active Parameters

Comments