Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.
Google DeepMind releases Gemma 4: Multimodal models from 2.3B to 31B parameters with up to 256K context
Google DeepMind released the Gemma 4 family of open-weights models across four distinct sizes, each optimized for different deployment scenarios from mobile devices to server infrastructure.
Model Lineup and Architecture
The release includes:
- Gemma 4 E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
- Gemma 4 E4B: 4.5B effective parameters (8B with embeddings), 128K context window, text, image, and audio
- Gemma 4 26B A4B: 25.2B total parameters with 3.8B active parameters (Mixture-of-Experts), 256K context window, text and image
- Gemma 4 31B: 30.7B parameters, 256K context window, text and image
The smaller E-series models use Per-Layer Embeddings (PLE) to reduce effective parameter count while maintaining capacity. The 26B A4B employs a Mixture-of-Experts architecture with 128 total experts, activating only 8 per token during inference, enabling fast execution comparable to a 4B model.
All models use hybrid attention combining sliding-window local attention with global attention in final layers, optimized with Proportional RoPE for long-context efficiency.
Multimodal Capabilities
All four models handle text and image input with variable aspect ratio and resolution support. E2B and E4B additionally feature native audio processing for automatic speech recognition and speech-to-translated-text across multiple languages. E4B and E2B include dedicated audio encoders (~300M parameters each).
Core capabilities include: reasoning with configurable thinking modes, function calling for agentic workflows, video understanding via frame sequences, document/PDF parsing, OCR across 140+ languages, and code generation.
Benchmark Performance
Instructino-tuned variant results against instruction-tuned baselines:
Reasoning and Coding:
- MMLU Pro: E2B 60.0% | E4B 69.4% | 26B A4B 82.6% | 31B 85.2%
- AIME 2026 (no tools): E2B 37.5% | E4B 42.5% | 26B A4B 88.3% | 31B 89.2%
- LiveCodeBench v6: E2B 44.0% | E4B 52.0% | 26B A4B 77.1% | 31B 80.0%
- Codeforces ELO: E2B 633 | E4B 940 | 26B A4B 1718 | 31B 2150
Multimodal Vision:
- MMMU Pro: E2B 44.2% | E4B 52.6% | 26B A4B 73.8% | 31B 76.9%
- MATH-Vision: E2B 52.4% | E4B 59.5% | 26B A4B 82.4% | 31B 85.6%
Long Context (MRCR v2 at 128K, 8-needle average):
- E2B 19.1% | E4B 25.4% | 26B A4B 44.1% | 31B 66.4%
Audio (E2B/E4B only):
- CoVoST2: E4B 35.54 | E2B 33.47
- FLEURS character error rate: E4B 0.08 | E2B 0.09
Technical Details and Licensing
All models are released under Apache 2.0 license with full source access on Hugging Face. Models support 262K vocabulary size and include native system prompt support for structured conversations. Training cutoff date and exact training data composition were not disclosed.
Integration requires Transformers library (latest version) and runs on single GPU inference via AutoModelForCausalLM and AutoModelForMultimodalLM APIs.
What this means
Gemma 4 significantly expands deployment optionality. The efficient E-series models target edge/mobile with reasonable capability trade-offs, while larger variants compete with dense competitors on reasoning benchmarks. The MoE variant offers a middle ground: competitive performance with inference speed closer to 4B-class models. The 256K context across larger models and integrated audio/vision support position Gemma 4 as a comprehensive open alternative to closed multimodal systems, though long-context performance (19-66% on needle-in-haystack tasks) suggests practical limitations remain at extreme context lengths.
Related Articles
Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.
Google releases Gemma 4 family with 31B model, 256K context, multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models ranging from 2.3B to 31B parameters, featuring up to 256K token context windows and native support for text, image, video, and audio inputs. The flagship 31B model scores 85.2% on MMLU Pro and 89.2% on AIME 2026, with a smaller 26B MoE variant requiring only 3.8B active parameters for faster inference.
Google DeepMind releases Gemma 4 with multimodal reasoning and up to 256K context window
Google DeepMind released Gemma 4, a multimodal model family supporting text, images, video, and audio with context windows up to 256K tokens. The release includes four sizes (E2B, E4B, 26B A4B, and 31B) designed for deployment from mobile devices to servers. The 31B dense model achieves 85.2% on MMLU Pro and 89.2% on AIME 2026.
Google DeepMind releases Gemma 4 with four models up to 31B parameters, 256K context window
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (E2B, E4B, 26B A4B, 31B) with context windows up to 256K tokens and native reasoning capabilities. The 26B A4B variant uses Mixture-of-Experts architecture with 3.8B active parameters for efficient inference. All models support text, image input and handle 140+ languages with Apache 2.0 licensing.
Comments
Loading...