LLM News

Every LLM release, update, and milestone.

Filtered by:long-context✕ clear

research

Research: Token-wise KV cache compression cuts memory to 6% while retaining 94% performance

Researchers propose DynaKV, a post-training framework that dynamically allocates compression rates to individual tokens based on semantic importance. The method achieves 94% baseline performance while reducing KV cache to just 6% of original size on LongBench benchmarks.

March 6, 2026 · 6:08 AM2 min read

kv-cache inference-optimization model-compression

via arxiv.org ↗

research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

March 6, 2026 · 6:06 AM2 min read

context-window compression llm-architecture

via arxiv.org ↗

research

MemSifter uses smaller proxy models to handle LLM memory retrieval, reducing computational overhead

Researchers introduce MemSifter, a framework that offloads memory retrieval to smaller proxy models instead of burdening the primary LLM. The approach uses outcome-driven reinforcement learning to optimize retrieval accuracy while minimizing computational overhead during inference.

March 5, 2026 · 5:54 AM2 min read

llm-research memory-retrieval reinforcement-learning

via arxiv.org ↗

benchmark

AMA-Bench reveals major gaps in LLM agent memory systems with real-world evaluation

Researchers introduce AMA-Bench, a benchmark for evaluating long-horizon memory in LLM-based autonomous agents using real-world trajectories and synthetic scaling. Existing memory systems underperform due to lack of causality and reliance on lossy similarity-based retrieval. The proposed AMA-Agent system with causality graphs and tool-augmented retrieval achieves 57.22% accuracy, outperforming baselines by 11.16 percentage points.

March 5, 2026 · 5:10 AM2 min read

benchmarks agents memory-systems

via arxiv.org ↗