LLM News

Every LLM release, update, and milestone.

Filtered by:inference✕ clear
research

SureLock cuts masked diffusion language model decoding compute by 30-50%

Researchers propose SureLock, a technique that reduces computational FLOPs in masked diffusion language model decoding by 30-50% on LLaDA-8B by skipping attention and feed-forward computations for tokens that have converged. The method caches key-value pairs for locked positions while continuing to compute for unlocked tokens, reducing per-iteration complexity from O(N²d) to O(MNd).

research

xLLM: Open-source inference framework claims 2.2x vLLM throughput on Ascend accelerators

Researchers have released xLLM, an open-source Large Language Model inference framework designed for enterprise-scale serving. The framework claims to achieve up to 2.2x higher throughput than vLLM-Ascend when serving Qwen-series models under identical latency constraints, using a novel decoupled architecture that separates service scheduling from engine optimization.

2 min readvia arxiv.org