Gemma 4, DeepSeek V4, and ZAYA1 Deploy KV Cache Compression to Cut Long-Context Memory Costs
Recent open-weight LLM releases from Google, DeepSeek, and others are adopting architectural techniques that reduce KV cache size by approximately 50% at long contexts. These include cross-layer KV sharing in Gemma 4, which saves 2.7 GB at 128K context for the E2B model, and compressed convolutional attention in ZAYA1-8B.
Gemma 4, DeepSeek V4, and ZAYA1 Deploy KV Cache Compression to Cut Long-Context Memory Costs
Multiple open-weight LLM releases in April and May 2026 have adopted architectural techniques specifically designed to reduce KV cache size and memory traffic at long contexts, according to a technical analysis by Sebastian Raschka.
Cross-Layer KV Sharing in Gemma 4
Google's Gemma 4 suite, released in early April, implements cross-layer KV sharing in its E2B and E4B variants. Instead of computing separate key-value projections in each transformer layer, later layers reuse KV tensors from earlier layers while still computing their own query projections.
The Gemma 4 E2B model has 35 transformer layers but only 15 compute their own KV projections—the final 20 layers reuse KV tensors from previous layers. According to Raschka's calculations, this saves approximately 2.7 GB of memory at 128K context length (bfloat16 precision) for the E2B model. The E4B variant, with 42 layers (24 computing KV, 18 sharing), saves approximately 6 GB at 128K context.
The technique is not exclusive to Gemma 4. The cross-layer attention approach was described in Brandon et al.'s "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention" (NeurIPS 2024), but Gemma 4 represents the first major open-weight implementation.
Additional Architecture Techniques
Beyond Gemma 4, other recent releases have implemented complementary approaches:
- ZAYA1-8B: Uses compressed convolutional attention to reduce memory footprint
- Laguna XS.2: Implements layer-wise attention budgeting to allocate attention compute selectively
- DeepSeek V4: Combines multi-head compressed (mHC) attention with additional compression techniques
All three models also use Grouped Query Attention (GQA), which shares key-value heads across multiple query heads—a now-standard technique for KV cache reduction.
Model Variants and Target Use Cases
The Gemma 4 release includes three categories:
- E2B and E4B models: Optimized for mobile and embedded devices (IoT)
- 26B mixture-of-experts (MoE): Designed for efficient local inference
- 31B dense model: Optimized for maximum quality and easier fine-tuning
The E2B and E4B variants combine cross-layer KV sharing with a 4:1 pattern of regular GQA and sliding window attention. Specifically, E2B uses MQA (the single-KV-head special case of GQA) rather than full GQA.
Technical Implementation Details
In cross-layer KV sharing, sliding-window attention layers share KV with previous sliding-window layers, while full-attention layers share with previous full-attention layers. Each layer still computes its own query projections, allowing distinct attention patterns while eliminating redundant KV cache storage.
The memory savings scale with context length. At very long contexts (128K+), the KV cache becomes the dominant memory consumer, making these optimizations critical for reasoning models and agent workflows that maintain extended conversation history.
What This Means
The convergence on KV cache reduction techniques across multiple independent releases signals that long-context efficiency has become a primary architectural constraint. With reasoning models and agent workflows keeping more tokens active for longer periods, memory traffic and attention costs now dominate over pure compute. The ~50% KV cache reduction achieved through cross-layer sharing makes 128K+ context windows practical on consumer hardware. Expect these techniques to become standard in future model releases, particularly for edge deployment and long-context applications.
Related Articles
AI2 Releases DiScoFormer: Single Transformer Estimates Density and Score Across Distributions Without Retraining
Allen Institute for AI (AI2) has released DiScoFormer, a transformer model that estimates both the density and score of any distribution from a sample in a single forward pass without retraining. In 100 dimensions, the model reduces score estimation error by 6.5x and density error by 37x compared to classical kernel density estimation.
6,000 prompt injection attempts fail against Claude Opus 4.6 in public hacking challenge
A public hacking challenge targeting an AI assistant powered by Claude Opus 4.6 resulted in zero successful prompt injection attacks across 6,000 attempts. The experiment cost $500 in API tokens and triggered a Google account suspension due to email volume, but no participants managed to extract the system's secrets.
AI2 Research: Hybrid Models Excel at Content Words, Transformers Better at Token Repetition
Allen Institute for AI researchers conducted token-level analysis comparing their 7B-parameter Olmo 3 transformer and Olmo Hybrid models. The study finds hybrid architectures show a loss gap advantage of 0.04 on content words (nouns, verbs, adjectives) versus 0.02 on function words, while transformers match or exceed hybrids on repeated tokens and closing braces.
Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap
Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.
Comments
Loading...