Google releases Gemma 4 family with 31B model, 256K context, multimodal capabilities
Google DeepMind released the Gemma 4 family of open-weights models ranging from 2.3B to 31B parameters, featuring up to 256K token context windows and native support for text, image, video, and audio inputs. The flagship 31B model scores 85.2% on MMLU Pro and 89.2% on AIME 2026, with a smaller 26B MoE variant requiring only 3.8B active parameters for faster inference.
Gemma 4 31B Instruct — Quick Specs
Google Releases Gemma 4 Family with Multimodal Capabilities and Up to 256K Context
Google DeepMind launched Gemma 4, a family of open-weights models ranging from 2.3B to 31B parameters, introducing multimodal capabilities including text, image, video, and audio processing alongside native reasoning modes.
Model Sizes and Architecture
The release includes four model variants:
- E2B: 2.3B effective parameters (5.1B with embeddings), 128K context
- E4B: 4.5B effective parameters (8B with embeddings), 128K context
- 26B A4B: 25.2B total parameters with 3.8B active (MoE), 256K context
- 31B: 30.7B parameters, 256K context
All models employ a hybrid attention mechanism combining local sliding window attention with full global attention. The architecture uses Per-Layer Embeddings (PLE) in smaller models to optimize on-device deployment, while the 26B variant uses Mixture-of-Experts with 8 active experts from 128 total.
Capabilities and Features
Gemma 4 models support:
- Multimodal Input: Text, images with variable aspect ratios and resolutions (all models), video frame processing, and native audio for E2B/E4B
- Reasoning Modes: Configurable thinking modes enabling step-by-step reasoning before generation
- Extended Context: 128K tokens for E2B/E4B, 256K for larger models
- Function Calling: Native structured tool use for agentic workflows
- Multilingual Support: 140+ languages in pre-training, 35+ in production
- Audio Processing: ASR and speech-to-translation on E2B and E4B only
- System Prompt Support: Native support for system role in conversations
Benchmark Performance
The 31B model achieves:
- MMLU Pro: 85.2%
- AIME 2026 (no tools): 89.2%
- LiveCodeBench v6: 80.0%
- Codeforces ELO: 2150
- GPQA Diamond: 84.3%
- Vision MMMU Pro: 76.9%
- MATH-Vision: 85.6%
The 26B A4B MoE variant scores 82.6% on MMLU Pro and 88.3% on AIME 2026 while requiring significantly less compute due to sparse activation. Smaller E4B and E2B models score 69.4% and 60.0% on MMLU Pro respectively, suitable for on-device deployment.
Deployment and Licensing
All models are available under Apache 2.0 license through Hugging Face. The diverse size range targets deployment scenarios from mobile and edge devices (E2B/E4B) to consumer GPUs, workstations, and servers (26B/31B). Models can be loaded using the latest version of Hugging Face Transformers library with single-line calls to AutoProcessor and AutoModelForCausalLM.
Google emphasizes efficient on-device execution for smaller variants, with E2B and E4B specifically optimized for laptops and phones. Vision encoder parameters total ~150M (E2B/E4B) and ~550M (larger models), while audio encoders add ~300M parameters to smaller variants.
What This Means
Gemma 4 represents Google's commitment to open-weights multimodal models across the size spectrum. The MoE variant offers a compelling middle ground—matching dense 31B reasoning performance at 4B-parameter inference speed. For on-device deployment, E2B/E4B with native audio support fill a gap between pure language models and larger multimodal systems. Benchmark improvements in coding (Codeforces ELO 2150 vs. Gemma 3's 110) and reasoning tasks position these as competitive with closed-source alternatives, though pricing and hardware requirements differ significantly from API-based competitors.
Related Articles
Google releases Gemini Omni Flash video generation model with conversational editing, withholds speech synthesis
Google DeepMind released Gemini Omni Flash, the first model in its new Omni family that generates and edits video from image, audio, video, and text inputs. The model is rolling out to Gemini app subscribers and YouTube Shorts with a 10-second clip limit, while speech-editing capabilities remain withheld pending safety testing.
Google releases Gemini 3.5 Flash with 4x faster output and agentic capabilities, 3.5 Pro coming June
Google released Gemini 3.5 Flash today with 4x faster output token generation than competing frontier models while surpassing Gemini 3.1 Pro on coding, agentic, and multimodal benchmarks. The company announced Gemini 3.5 Pro will launch next month and introduced Gemini Omni, a new multimodal series that outputs video.
Perceptron Launches Mk1 Vision-Language Model with Video Reasoning at $0.15/$1.50 per 1M Tokens
Perceptron has released Perceptron Mk1, a vision-language model designed for video understanding and embodied reasoning tasks. The model accepts image and video inputs with 33K context window, priced at $0.15 per 1M input tokens and $1.50 per 1M output tokens, and supports structured spatial annotations on demand.
Stability AI Releases Stable Audio 3.0 Model Family Trained on Licensed Data
Stability AI has released Stable Audio 3.0, a model family for audio generation trained on fully licensed data. The company positions the release as a foundation for commercial audio applications, though specific technical specifications have not yet been disclosed.
Comments
Loading...