Google DeepMind releases Gemma 4 family: multimodal models from 2.3B to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense. All models support text and image input with 128K-256K context windows; E2B and E4B add native audio capabilities. Models feature reasoning modes, function calling, and multilingual support across 140+ languages.
Google DeepMind releases Gemma 4: Multimodal models from 2.3B to 31B parameters with up to 256K context
Google DeepMind released the Gemma 4 family of open-weights models across four distinct sizes, each optimized for different deployment scenarios from mobile devices to server infrastructure.
Model Lineup and Architecture
The release includes:
- Gemma 4 E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
- Gemma 4 E4B: 4.5B effective parameters (8B with embeddings), 128K context window, text, image, and audio
- Gemma 4 26B A4B: 25.2B total parameters with 3.8B active parameters (Mixture-of-Experts), 256K context window, text and image
- Gemma 4 31B: 30.7B parameters, 256K context window, text and image
The smaller E-series models use Per-Layer Embeddings (PLE) to reduce effective parameter count while maintaining capacity. The 26B A4B employs a Mixture-of-Experts architecture with 128 total experts, activating only 8 per token during inference, enabling fast execution comparable to a 4B model.
All models use hybrid attention combining sliding-window local attention with global attention in final layers, optimized with Proportional RoPE for long-context efficiency.
Multimodal Capabilities
All four models handle text and image input with variable aspect ratio and resolution support. E2B and E4B additionally feature native audio processing for automatic speech recognition and speech-to-translated-text across multiple languages. E4B and E2B include dedicated audio encoders (~300M parameters each).
Core capabilities include: reasoning with configurable thinking modes, function calling for agentic workflows, video understanding via frame sequences, document/PDF parsing, OCR across 140+ languages, and code generation.
Benchmark Performance
Instructino-tuned variant results against instruction-tuned baselines:
Reasoning and Coding:
- MMLU Pro: E2B 60.0% | E4B 69.4% | 26B A4B 82.6% | 31B 85.2%
- AIME 2026 (no tools): E2B 37.5% | E4B 42.5% | 26B A4B 88.3% | 31B 89.2%
- LiveCodeBench v6: E2B 44.0% | E4B 52.0% | 26B A4B 77.1% | 31B 80.0%
- Codeforces ELO: E2B 633 | E4B 940 | 26B A4B 1718 | 31B 2150
Multimodal Vision:
- MMMU Pro: E2B 44.2% | E4B 52.6% | 26B A4B 73.8% | 31B 76.9%
- MATH-Vision: E2B 52.4% | E4B 59.5% | 26B A4B 82.4% | 31B 85.6%
Long Context (MRCR v2 at 128K, 8-needle average):
- E2B 19.1% | E4B 25.4% | 26B A4B 44.1% | 31B 66.4%
Audio (E2B/E4B only):
- CoVoST2: E4B 35.54 | E2B 33.47
- FLEURS character error rate: E4B 0.08 | E2B 0.09
Technical Details and Licensing
All models are released under Apache 2.0 license with full source access on Hugging Face. Models support 262K vocabulary size and include native system prompt support for structured conversations. Training cutoff date and exact training data composition were not disclosed.
Integration requires Transformers library (latest version) and runs on single GPU inference via AutoModelForCausalLM and AutoModelForMultimodalLM APIs.
What this means
Gemma 4 significantly expands deployment optionality. The efficient E-series models target edge/mobile with reasonable capability trade-offs, while larger variants compete with dense competitors on reasoning benchmarks. The MoE variant offers a middle ground: competitive performance with inference speed closer to 4B-class models. The 256K context across larger models and integrated audio/vision support position Gemma 4 as a comprehensive open alternative to closed multimodal systems, though long-context performance (19-66% on needle-in-haystack tasks) suggests practical limitations remain at extreme context lengths.
Related Articles
Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context
Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Comments
Loading...