Google DeepMind releases Gemma 4: multimodal models up to 31B parameters with 256K context
Google DeepMind released the Gemma 4 family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (25.2B total, 3.8B active), and 31B dense. All models support text and image input with 128K-256K context windows, reasoning modes, and native function calling for agentic workflows.
Gemma 4 26B A4B IT — Quick Specs
Google DeepMind released Gemma 4, a family of open-weights multimodal models spanning four distinct sizes from 2.3B to 31B parameters, available under Apache 2.0 license on Hugging Face.
Model Specifications
The Gemma 4 lineup includes:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window, supports text, image, and audio
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window, supports text, image, and audio
- 26B A4B: 25.2B total parameters with only 3.8B active during inference, 256K context window, supports text and image
- 31B: 30.7B parameters, 256K context window, supports text and image
The smaller models (E2B/E4B) use Per-Layer Embeddings (PLE) to reduce effective parameter counts while maintaining multilingual support across 140+ languages. The 26B A4B employs a Mixture-of-Experts architecture with 8 active experts selected from 128 total, enabling fast inference comparable to a 4B model despite 26B total parameters.
Key Capabilities
All Gemma 4 models feature:
- Reasoning mode: Configurable thinking modes enabling step-by-step problem solving
- Extended multimodalities: Text, images with variable aspect ratio/resolution support; video via frame sequences; audio (E2B/E4B only) for ASR and speech-to-translation
- Function calling: Native structured tool use for autonomous agent workflows
- Long context: 128K (E2B/E4B) or 256K (26B A4B/31B) token windows
- Coding support: Code generation, completion, and correction with notable benchmark improvements
- Native system prompts: Enhanced control over conversational behavior
The architecture employs hybrid attention mechanisms combining local sliding window attention (512-1024 tokens) with full global attention on final layers, optimized with Proportional RoPE (p-RoPE) for long-context memory efficiency.
Benchmark Performance
Instruction-tuned model evaluation shows:
31B Dense Model:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
26B A4B (MoE):
- MMLU Pro: 82.6%
- AIME 2026 (no tools): 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
- GPQA Diamond: 82.3%
E4B:
- MMLU Pro: 69.4%
- LiveCodeBench v6: 52.0%
- Codeforces ELO: 940
Vision benchmarks show MMMU Pro scores of 76.9% (31B), 73.8% (26B A4B), and 52.6% (E4B). The 31B model achieved 66.4% on long-context needle-in-haystack evaluation at 128K tokens.
Deployment Flexibility
Google positions Gemma 4 for diverse deployment scenarios: E2B and E4B for mobile and edge devices; 26B A4B for consumer GPUs and workstations balancing speed and capability via MoE; 31B for high-end servers requiring maximum performance. All models are available in both pre-trained and instruction-tuned variants.
What This Means
Gemma 4 extends Google's open-model strategy to multimodal reasoning at multiple efficiency tiers. The 26B A4B model's sparse activation approach offers a compelling alternative to dense models—matching near-31B performance while running 6-7× faster. With 256K context windows and reasoning modes, Gemma 4 targets competitive positioning against closed models in long-context and agentic use cases, while maintaining deployment flexibility from phones to data centers. The Apache 2.0 license enables commercial use without restrictions.
Related Articles
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Comments
Loading...