Google DeepMind releases Gemma 4 with 4 model sizes, 256K context, and multimodal reasoning
Google DeepMind released Gemma 4, a family of open-weights multimodal models in four sizes: E2B (2.3B effective), E4B (4.5B effective), 26B A4B (3.8B active), and 31B (30.7B parameters). All models support text and image input with 128K-256K context windows, while E2B and E4B add native audio capabilities and reasoning modes across 140+ languages.
Gemma 4 E2B Instruction-Tuned — Quick Specs
Google DeepMind Releases Gemma 4: Four Open-Weights Models with Multimodal and Reasoning Capabilities
Google DeepMind released Gemma 4, an open-weights model family spanning four sizes optimized for deployment from mobile devices to high-end servers. The release includes both dense and Mixture-of-Experts variants under the Apache 2.0 license.
Model Specifications
The Gemma 4 family comprises:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context window
- E4B: 4.5B effective parameters (8B with embeddings), 128K context window
- 26B A4B: 3.8B active parameters out of 25.2B total (MoE architecture), 256K context window
- 31B Dense: 30.7B parameters, 256K context window
The "E" designation indicates effective parameters achieved through Per-Layer Embeddings (PLE), while "A" denotes active parameters in the MoE variant. This architecture allows the 26B A4B to run nearly as fast as a 4B model during inference while maintaining frontier-level performance.
Multimodal and Reasoning Capabilities
All models handle text and image input with variable aspect ratio and resolution support. E2B and E4B add native audio support including automatic speech recognition (ASR) and speech-to-translated-text translation. All models include configurable thinking modes for step-by-step reasoning and support native function calling for agentic workflows.
The models support 140+ languages in pre-training with 35+ languages confirmed for downstream tasks.
Benchmark Performance
Gemma 4 31B achieved:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- Vision MMMU Pro: 76.9%
- Long Context (MRCR v2, 8 needle @ 128K): 66.4%
The 26B A4B MoE variant tracked closely behind: MMLU Pro 82.6%, AIME 2026 88.3%, LiveCodeBench 77.1%, Codeforces ELO 1718, and GPQA Diamond 82.3%.
Smaller models show proportional scaling: E4B achieved MMLU Pro 69.4% and GPQA Diamond 58.6%, while E2B reached 60.0% and 43.4% respectively.
Technical Architecture
Gemma 4 employs a hybrid attention mechanism combining local sliding window attention (512-1024 tokens depending on size) with global full attention in the final layer. This balances computational efficiency with long-context awareness. Global layers use unified Keys and Values with Proportional RoPE (p-RoPE) for memory optimization.
Vision encoders add ~150M parameters to E2B/E4B and ~550M to larger models. Audio encoders add ~300M parameters to E2B and E4B only.
Deployment and Availability
Models are available via Hugging Face with full Transformers library support. The smaller E2B and E4B models target mobile phones and laptops, while 26B A4B and 31B Dense scale to consumer GPUs, workstations, and servers. All models include native system prompt support for structured conversations.
What This Means
Gemma 4 significantly expands Google's open-weights presence across the model size spectrum. The efficient parameter design—particularly effective parameters in E2B/E4B and active parameters in 26B A4B—enables deployment scenarios previously requiring much larger models. The reasoning modes and multimodal capabilities position Gemma 4 for complex reasoning tasks and agent applications without proprietary API dependencies. Performance metrics indicate competitive scaling within size classes, though 31B-class models from other vendors maintain leads on reasoning benchmarks. The extended context window (256K on larger models) addresses enterprise document processing and long-context reasoning requirements.
Related Articles
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Comments
Loading...