cross-layer-attention

1 article tagged with cross-layer-attention

May 16, 2026

research

Gemma 4, DeepSeek V4, and ZAYA1 Deploy KV Cache Compression to Cut Long-Context Memory Costs

Recent open-weight LLM releases from Google, DeepSeek, and others are adopting architectural techniques that reduce KV cache size by approximately 50% at long contexts. These include cross-layer KV sharing in Gemma 4, which saves 2.7 GB at 128K context for the E2B model, and compressed convolutional attention in ZAYA1-8B.

May 16, 2026 · 11:50 AM

← Back to all news