LLM News

Every LLM release, update, and milestone.

Filtered by:model-compression✕ clear

research

Research: Token-wise KV cache compression cuts memory to 6% while retaining 94% performance

Researchers propose DynaKV, a post-training framework that dynamically allocates compression rates to individual tokens based on semantic importance. The method achieves 94% baseline performance while reducing KV cache to just 6% of original size on LongBench benchmarks.

March 6, 2026 · 6:08 AM2 min read

kv-cache inference-optimization model-compression

via arxiv.org ↗

research

1.58-bit BitNet models naturally support structured sparsity with minimal accuracy loss

Researchers have demonstrated that 1.58-bit quantized language models are naturally more compatible with semi-structured N:M sparsity than full-precision models. The Sparse-BitNet framework combines both techniques simultaneously, achieving up to 1.30X speedups in training and inference while maintaining smaller accuracy degradation than full-precision baselines at equivalent sparsity levels.

March 6, 2026 · 5:54 AM2 min read

quantization sparsity bitnet

via arxiv.org ↗

research

ButterflyMoE achieves 150× memory reduction for mixture-of-experts models via geometric rotations

Researchers introduce ButterflyMoE, a technique that replaces independent expert weight matrices with learned geometric rotations applied to a shared quantized substrate. The method reduces memory scaling from linear to sub-linear in the number of experts, achieving 150× compression at 256 experts with negligible accuracy loss on language modeling tasks.

March 6, 2026 · 5:07 AM2 min read

mixture-of-experts model-compression quantization

via arxiv.org ↗

research

StructLens reveals hidden structural patterns across language model layers

Researchers introduce StructLens, an interpretability framework that analyzes language models by constructing maximum spanning trees from residual streams to uncover inter-layer structural relationships. The approach reveals similarity patterns distinct from conventional cosine similarity and demonstrates practical benefits for layer pruning optimization.

March 5, 2026 · 5:55 AM2 min read

interpretability language-models structural-analysis

via arxiv.org ↗

research

SiNGER framework improves vision transformer distillation by suppressing high-norm artifacts

Researchers introduce SiNGER (Singular Nullspace-Guided Energy Reallocation), a knowledge distillation framework that improves how Vision Transformer features transfer to smaller student models. The method suppresses high-norm artifacts that degrade representation quality while preserving informative signals from teacher models.

March 5, 2026 · 12:52 AM2 min read

vision-transformers knowledge-distillation model-compression

via arxiv.org ↗

research

Researchers develop pruning method that challenges attention-sink assumptions in diffusion language models

A new pruning method challenges the conventional wisdom inherited from autoregressive LLMs about preserving attention-sink tokens. Researchers demonstrate that attention sinks in diffusion language models are substantially less stable than in AR models, enabling more aggressive pruning without retraining.

February 20, 2026 · 3:22 AM2 min read

diffusion-language-models pruning inference-optimization

via arxiv.org ↗

research

New pruning technique cuts diffusion language model inference costs by identifying unstable attention sinks

Researchers have identified a fundamental difference in how attention mechanisms work in diffusion language models versus traditional autoregressive LLMs, enabling a new pruning strategy that removes unstable attention sinks without retraining. The finding challenges existing pruning assumptions inherited from autoregressive models and promises better quality-efficiency trade-offs during inference.

February 20, 2026 · 3:21 AM2 min read

diffusion-language-models pruning inference-optimization

via arxiv.org ↗