LLM News

Every LLM release, update, and milestone.

Filtered by:inference✕ clear

research

SureLock cuts masked diffusion language model decoding compute by 30-50%

Researchers propose SureLock, a technique that reduces computational FLOPs in masked diffusion language model decoding by 30-50% on LLaDA-8B by skipping attention and feed-forward computations for tokens that have converged. The method caches key-value pairs for locked positions while continuing to compute for unlocked tokens, reducing per-iteration complexity from O(N²d) to O(MNd).

March 5, 2026 · 5:36 AM2 min read

masked-diffusion language-models decoding-optimization

via arxiv.org ↗

research

xLLM: Open-source inference framework claims 2.2x vLLM throughput on Ascend accelerators

Researchers have released xLLM, an open-source Large Language Model inference framework designed for enterprise-scale serving. The framework claims to achieve up to 2.2x higher throughput than vLLM-Ascend when serving Qwen-series models under identical latency constraints, using a novel decoupled architecture that separates service scheduling from engine optimization.

March 5, 2026 · 12:51 AM2 min read

inference llm-framework open-source

via arxiv.org ↗

model release

Google releases Gemini 3.1 Flash-Lite, fastest model in 3 series

Google has released Gemini 3.1 Flash-Lite, positioning it as the fastest and most cost-efficient model in its Gemini 3 series. The release targets deployment scenarios requiring high-speed inference at reduced computational cost.

March 3, 2026 · 4:50 PM2 min read

gemini model-release google-deepmind

via blog.google ↗

model release

Google releases Gemini 3.1 Flash-Lite, fastest model in 3 series

Google DeepMind has released Gemini 3.1 Flash-Lite, positioning it as the fastest and most cost-efficient model in the Gemini 3 series. The release targets applications requiring high-speed inference at scale, continuing Google's multi-tier model strategy across the Gemini family.

March 3, 2026 · 4:50 PM2 min read

gemini google-deepmind model-release

via deepmind.google ↗

product update

Taalas serves Llama 3.1 8B at 17,000 tokens/second with custom silicon

Taalas, a new Canadian hardware startup, announced its first product: a custom silicon implementation of Meta's Llama 3.1 8B model running at 17,000 tokens/second. The startup uses aggressive quantization combining 3-bit and 6-bit parameters. The system is accessible via chatjimmy.ai.

February 20, 2026 · 10:20 PM2 min read

taalas llama-3-1 inference

via simonwillison.net ↗