Google DeepMind releases Gemma 4 with four model sizes, up to 256K context, multimodal support
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (2.3B to 31B parameters) with context windows up to 256K tokens. All models support text and image input, with audio native to E2B and E4B variants. The Gemma 4 31B dense model scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench—significant improvements over Gemma 3.
Google DeepMind Releases Gemma 4: Four Multimodal Models with Up to 256K Context
Google DeepMind today released Gemma 4, an open-weights model family spanning four sizes designed for deployment from mobile devices to data centers. The lineup includes the E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense models, all under Apache 2.0 licensing.
Model Specifications
Gemma 4 introduces architectural innovations including:
Dense Models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, ~150M vision encoder, ~300M audio encoder
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window, ~150M vision encoder, ~300M audio encoder
- 31B: 30.7B parameters, 256K context window, ~550M vision encoder, no native audio support
Mixture-of-Experts Model:
- 26B A4B: 25.2B total parameters with 3.8B active (8 active experts from 128 total, plus 1 shared), 256K context window, ~550M vision encoder
Small models (E2B, E4B) employ Per-Layer Embeddings (PLE) to maximize on-device efficiency. All models use hybrid attention combining local sliding window (512-1024 tokens) with global layers, applying Proportional RoPE for long-context optimization.
Multimodal Capabilities
All four models handle text and image input with variable aspect ratio and resolution support. E2B and E4B uniquely feature native audio support for automatic speech recognition and speech-to-translated-text across multiple languages. All models support video understanding via frame sequences and offer out-of-the-box multilingual support for 140+ languages.
Core capabilities include: configurable thinking/reasoning modes, function calling for agentic workflows, code generation and correction, document/PDF parsing, OCR, and interleaved multimodal input (freely mixing text and images).
Benchmark Performance
Gemma 4 31B (instruction-tuned) achieves:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- BigBench Extra Hard: 74.4%
- Vision MMMU Pro: 76.9%
- MATH-Vision: 85.6%
The 26B A4B MoE variant scores 82.6% on MMLU Pro and 88.3% on AIME, delivering near-31B performance with 4B active parameters. The E4B achieves 69.4% on MMLU Pro and 52.0% on LiveCodeBench—substantial improvements over Gemma 3 27B (67.6% MMLU Pro, 29.1% LiveCodeBench).
Smaller models (E2B: 60.0% MMLU Pro, E4B: 69.4%) target on-device deployment without sacrificing reasoning capability.
Availability and Deployment
All models are available via Hugging Face with Transformers integration. Google provides inference code supporting text generation, image/video/audio processing, and reasoning modes. The diverse architecture options enable deployment across phones, laptops, edge devices, consumer GPUs, and enterprise servers.
What This Means
Gemma 4 targets the efficiency-to-capability spectrum aggressively. The E2B and E4B variants with native audio represent Google's push into on-device multimodal AI, while the 31B and 26B A4B compete directly with Meta's Llama models on reasoning benchmarks. Google's emphasis on function calling and thinking modes positions Gemma 4 for agentic workflows. The Apache 2.0 licensing ensures commercial usability, though real-world inference costs and latency data remain unreleased—critical metrics for evaluating on-device vs. cloud deployment trade-offs.
Related Articles
Google releases Gemma 4 26B with 256K context and multimodal support, free to use
Google DeepMind has released Gemma 4 26B A4B, a free instruction-tuned Mixture-of-Experts model with 262,144 token context window and multimodal capabilities including text, images, and video input. Despite 25.2B total parameters, only 3.8B activate per token, delivering performance comparable to larger 31B models at reduced compute cost.
Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.
Google releases Gemma 4 31B free model with 256K context and multimodal support
Google DeepMind has released Gemma 4 31B Instruct, a free 30.7-billion parameter model with a 256K token context window, multimodal text and image input capabilities, and native function calling. The model supports configurable reasoning mode and 140+ languages, with strong performance on coding and document understanding tasks under Apache 2.0 license.
Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window
Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.
Comments
Loading...