model releaseGoogle DeepMind

Amazon Bedrock adds Gemma 4 models with 256K context and built-in reasoning mode

TL;DR

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

2 min read
0

Amazon Web Services today announced availability of Google DeepMind's Gemma 4 family on Amazon Bedrock. The open-weight models include three instruction-tuned variants spanning 2.3B to 30.7B parameters, with 256K context windows, multimodal input support, and built-in reasoning mode.

Three model variants

The Gemma 4 family includes:

  • Gemma 4 31B: Dense architecture with 30.7B parameters, 256K context window
  • Gemma 4 26B-A4B: Mixture-of-experts design with 25.2B total parameters but only 3.8B active per token, 256K context window
  • Gemma 4 E2B: Compact model with 5.1B total parameters (2.3B effective), 128K context window

All three variants support text and image input, native function calling, and over 35 languages. According to AWS, independent benchmarks from Artificial Analysis report an Intelligence Index of 39 for Gemma 4 31B, compared to a median of 15 in the 4B-40B open-weights class.

Technical architecture

The models use hybrid attention that interleaves local and global attention to maintain long context support while reducing memory footprint. The 26B-A4B variant activates only 3.8B parameters per token despite having 25.2B total, delivering what AWS describes as "4B-class cost and latency with the knowledge capacity of a larger model."

The E2B variant uses Per-Layer Embeddings (PLE) to keep its effective parameter count at 2.3B of 5.1B total parameters.

Built-in reasoning mode

All Gemma 4 variants include a built-in reasoning mode that, when enabled, emits the model's internal thought process before producing the final answer. AWS documentation notes that in multi-turn conversations, only final answers from previous turns should be sent back to the model, not their reasoning items, as "replaying prior reasoning back to the model can degrade its responses."

Service access

The models are accessed through Amazon Bedrock's bedrock-mantle endpoint, which uses an OpenAI-compatible API. The endpoint URL is https://bedrock-mantle.{region}.api.aws/openai/v1 and supports both Chat Completions and Responses APIs.

All three variants are available in Standard, Priority, and Flex service tiers. AWS states that prompts and completions are not used to train any models and content is not shared with third parties.

The models are released under the Apache 2.0 license, allowing independent evaluation of model architecture and training methodology.

What this means

Gemma 4's availability on Bedrock gives enterprises access to competitive open-weight models through AWS infrastructure without managing inference stacks. The MoE variant's 3.8B active parameters at 25.2B total capacity represents a meaningful efficiency gain for high-throughput workloads. The 256K context window matches or exceeds most competing models, though pricing details were not disclosed in the announcement, making direct cost comparisons premature.

Related Articles

model release

Google DeepMind releases DiffusionGemma, a 26B parameter model generating 15-20 tokens per forward pass via discrete dif

Google DeepMind released DiffusionGemma, a 26B parameter mixture-of-experts model that generates text using discrete diffusion instead of autoregression. The model processes blocks of 256 tokens in parallel, achieving generation speeds exceeding 1100 tokens per second on H100 GPUs in low-batch settings.

model release

Moonshot AI releases Kimi K2.7 Code with 1T parameters, 256K context window, 30% lower thinking token usage

Moonshot AI has released Kimi K2.7 Code, a 1 trillion parameter Mixture-of-Experts model designed for long-horizon coding tasks. The model features a 256K context window and reduces thinking token usage by approximately 30% compared to its predecessor K2.6.

model release

Google DeepMind releases Gemma 4 12B: encoder-free multimodal model runs on 16GB RAM

Google DeepMind has released Gemma 4 12B, a 12-billion parameter multimodal model that runs locally on laptops with 16GB of RAM. The model eliminates separate vision and audio encoders, processing raw inputs directly through its language model backbone under an Apache 2.0 license.

model release

MiniMax Releases M3: 428B-Parameter Multimodal Model with 1M Context Window and 15× Decode Speedup

MiniMax has released M3, a multimodal model with approximately 428 billion parameters and 23 billion activated parameters. The model supports a 1 million token context window and uses MiniMax Sparse Attention to achieve 9× prefill and 15× decode speedups compared to its predecessor M2.

Comments

Loading...