Google DeepMind releases Gemma 4 with four model sizes, up to 256K context, multimodal support
Google DeepMind released Gemma 4, an open-weights multimodal model family in four sizes (2.3B to 31B parameters) with context windows up to 256K tokens. All models support text and image input, with audio native to E2B and E4B variants. The Gemma 4 31B dense model scores 85.2% on MMLU Pro, 89.2% on AIME 2026, and 80.0% on LiveCodeBench—significant improvements over Gemma 3.
Google DeepMind Releases Gemma 4: Four Multimodal Models with Up to 256K Context
Google DeepMind today released Gemma 4, an open-weights model family spanning four sizes designed for deployment from mobile devices to data centers. The lineup includes the E2B (2.3B effective parameters), E4B (4.5B effective), 26B A4B (3.8B active parameters), and 31B dense models, all under Apache 2.0 licensing.
Model Specifications
Gemma 4 introduces architectural innovations including:
Dense Models:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, ~150M vision encoder, ~300M audio encoder
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window, ~150M vision encoder, ~300M audio encoder
- 31B: 30.7B parameters, 256K context window, ~550M vision encoder, no native audio support
Mixture-of-Experts Model:
- 26B A4B: 25.2B total parameters with 3.8B active (8 active experts from 128 total, plus 1 shared), 256K context window, ~550M vision encoder
Small models (E2B, E4B) employ Per-Layer Embeddings (PLE) to maximize on-device efficiency. All models use hybrid attention combining local sliding window (512-1024 tokens) with global layers, applying Proportional RoPE for long-context optimization.
Multimodal Capabilities
All four models handle text and image input with variable aspect ratio and resolution support. E2B and E4B uniquely feature native audio support for automatic speech recognition and speech-to-translated-text across multiple languages. All models support video understanding via frame sequences and offer out-of-the-box multilingual support for 140+ languages.
Core capabilities include: configurable thinking/reasoning modes, function calling for agentic workflows, code generation and correction, document/PDF parsing, OCR, and interleaved multimodal input (freely mixing text and images).
Benchmark Performance
Gemma 4 31B (instruction-tuned) achieves:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- BigBench Extra Hard: 74.4%
- Vision MMMU Pro: 76.9%
- MATH-Vision: 85.6%
The 26B A4B MoE variant scores 82.6% on MMLU Pro and 88.3% on AIME, delivering near-31B performance with 4B active parameters. The E4B achieves 69.4% on MMLU Pro and 52.0% on LiveCodeBench—substantial improvements over Gemma 3 27B (67.6% MMLU Pro, 29.1% LiveCodeBench).
Smaller models (E2B: 60.0% MMLU Pro, E4B: 69.4%) target on-device deployment without sacrificing reasoning capability.
Availability and Deployment
All models are available via Hugging Face with Transformers integration. Google provides inference code supporting text generation, image/video/audio processing, and reasoning modes. The diverse architecture options enable deployment across phones, laptops, edge devices, consumer GPUs, and enterprise servers.
What This Means
Gemma 4 targets the efficiency-to-capability spectrum aggressively. The E2B and E4B variants with native audio represent Google's push into on-device multimodal AI, while the 31B and 26B A4B compete directly with Meta's Llama models on reasoning benchmarks. Google's emphasis on function calling and thinking modes positions Gemma 4 for agentic workflows. The Apache 2.0 licensing ensures commercial usability, though real-world inference costs and latency data remain unreleased—critical metrics for evaluating on-device vs. cloud deployment trade-offs.
Related Articles
Cohere Releases Command A+ Open Source Model with 25B Active Parameters, 128K Context
Cohere has released Command A+ as an open source model under Apache 2.0 license. The sparse mixture-of-experts architecture features 25 billion active parameters out of 218B total parameters, supports 128K input context length, and includes vision capabilities alongside tool use and reasoning features.
Cohere Releases Command A+: 218B-Parameter MoE Model With 4-Bit Quantization Runs on Single B200 GPU
Cohere has released Command A+, an open-source sparse mixture-of-experts model with 218 billion total parameters and 25 billion active parameters. The model features W4A4 quantization allowing deployment on a single Nvidia B200 GPU, supports 128K input context, and includes built-in chain-of-thought reasoning with vision capabilities.
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Comments
Loading...