LLM News

Every LLM release, update, and milestone.

Filtered by:memory-efficiency✕ clear

research

Research: Token-wise KV cache compression cuts memory to 6% while retaining 94% performance

Researchers propose DynaKV, a post-training framework that dynamically allocates compression rates to individual tokens based on semantic importance. The method achieves 94% baseline performance while reducing KV cache to just 6% of original size on LongBench benchmarks.

March 6, 2026 · 6:08 AM2 min read

kv-cache inference-optimization model-compression

via arxiv.org ↗

researchNVIDIA

POET-X reduces LLM training memory by 40%, enables billion-parameter models on single H100

Researchers introduce POET-X, a memory-efficient variant of the Reparameterized Orthogonal Equivalence Training framework that reduces computational overhead in LLM training. The method enables pretraining of billion-parameter models on a single Nvidia H100 GPU, where standard optimizers like AdamW exhaust memory.

March 6, 2026 · 5:22 AM2 min read

llm-training optimization memory-efficiency

via arxiv.org ↗

research

ButterflyMoE achieves 150× memory reduction for mixture-of-experts models via geometric rotations

Researchers introduce ButterflyMoE, a technique that replaces independent expert weight matrices with learned geometric rotations applied to a shared quantized substrate. The method reduces memory scaling from linear to sub-linear in the number of experts, achieving 150× compression at 256 experts with negligible accuracy loss on language modeling tasks.

March 6, 2026 · 5:07 AM2 min read

mixture-of-experts model-compression quantization

via arxiv.org ↗