Google DeepMind releases Gemma 4, open multimodal models with 256K context and reasoning
Google DeepMind has released Gemma 4, a family of open-weights multimodal models ranging from 2.3B to 31B parameters with support for text, images, video, and audio. The models feature context windows up to 256K tokens, built-in reasoning modes, and native function calling for agentic workflows.
Gemma 4 31B Instruct — Quick Specs
Google DeepMind Releases Gemma 4: Open Multimodal Models with Extended Context
Google DeepMind has released Gemma 4, a family of open-weights models spanning from 2.3B to 31B parameters with multimodal capabilities and extended context windows up to 256K tokens. The release includes both dense and Mixture-of-Experts (MoE) architectures designed for deployment across devices from mobile phones to data center servers.
Model Sizes and Specifications
Gemma 4 offers four distinct variants:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context, text/image/audio support
- E4B: 4.5B effective parameters (8B with embeddings), 128K context, text/image/audio support
- 26B A4B (MoE): 25.2B total parameters with 3.8B active parameters, 256K context, text/image support
- 31B Dense: 30.7B parameters, 256K context, text/image support
The smaller E2B and E4B models use Per-Layer Embeddings (PLE) technology to reduce effective parameter counts, enabling efficient deployment on edge devices. The 26B A4B variant uses a Mixture-of-Experts approach with 128 total experts and 8 active experts, claiming inference speeds comparable to a 4B model while maintaining 26B total capacity.
Capabilities and Architecture
All Gemma 4 models support text and image inputs with variable aspect ratios and resolutions. The E2B and E4B models additionally include native audio support with automatic speech recognition and multilingual speech-to-translation capabilities. Video understanding is available through frame sequence processing.
Key features include:
- Reasoning: Configurable thinking modes enabling step-by-step reasoning before response generation
- Function Calling: Native support for structured tool use and agentic workflows
- Hybrid Attention: Combines local sliding window attention with full global attention, with Proportional RoPE optimization for memory efficiency
- Multilingual: Pre-trained on 140+ languages with out-of-the-box support for 35+
- Native System Prompt Support: Structured conversation control
Benchmark Performance
The instruction-tuned models show significant improvements in reasoning and coding tasks:
Gemma 4 31B achieves:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- MATH-Vision: 85.6%
- Long Context MRCR v2 (128K needle): 66.4%
Gemma 4 26B A4B demonstrates strong performance-to-efficiency trade-offs:
- MMLU Pro: 82.6%
- AIME 2026 (no tools): 88.3%
- LiveCodeBench v6: 77.1%
- Codeforces ELO: 1718
Smaller models show corresponding improvements over Gemma 3 27B, with E2B scoring 60.0% on MMLU Pro compared to Gemma 3's 67.6% baseline.
Release Details
The models are released under Apache 2.0 licensing as both pre-trained and instruction-tuned variants. Unsloth has released GGUF quantized versions optimized for local inference. The models are available through Hugging Face with support for the latest Transformers library.
Google DeepMind emphasizes on-device deployment viability for the smaller models while positioning larger variants for consumer GPU and server deployment. The hybrid architecture and context window scaling address trade-offs between inference speed and reasoning depth for long-context tasks.
What this means
Gemma 4 represents a significant shift toward production-ready open models with genuine multimodal capabilities and reasoning support at multiple scale points. The MoE variant offers a novel efficiency approach for teams balancing model capacity with inference latency constraints. Notably absent from the release are specific pricing details for cloud inference—unlike proprietary alternatives—since these are open-weights models suitable for self-hosted deployment. The 256K context window and strong long-context benchmark performance position these models competitively for document analysis and extended reasoning tasks against closed commercial alternatives.
Related Articles
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
DeepSeek Releases V4 Flash: 284B-Parameter MoE Model with 1M Context Window, Free via OpenRouter
DeepSeek has released V4 Flash, a Mixture-of-Experts model with 284B total parameters and 13B activated parameters per forward pass. The model supports a 1M-token context window and is available free through OpenRouter, targeting high-throughput coding and chat applications.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Comments
Loading...