LLM News

Every LLM release, update, and milestone.

Filtered by:inference-optimization✕ clear

research

Research: Token-wise KV cache compression cuts memory to 6% while retaining 94% performance

Researchers propose DynaKV, a post-training framework that dynamically allocates compression rates to individual tokens based on semantic importance. The method achieves 94% baseline performance while reducing KV cache to just 6% of original size on LongBench benchmarks.

March 6, 2026 · 6:08 AM2 min read

kv-cache inference-optimization model-compression

via arxiv.org ↗

research

vLLM Semantic Router enables intelligent model selection across multimodal deployments

Researchers presented vLLM Semantic Router, a production-deployed routing system that selects optimal models for each query using composable signal orchestration. The framework extracts signals ranging from sub-millisecond heuristics (keyword patterns, language detection) to neural classifiers (domain, embedding similarity) and composes them through configurable Boolean rules, enabling cost-optimized, privacy-regulated, and latency-sensitive deployments across multiple providers including OpenAI, Anthropic, Google, and AWS.

March 6, 2026 · 5:52 AM2 min read

vllm model-routing inference-optimization

via arxiv.org ↗

research

FlyThinker: Researchers propose parallel reasoning during generation for personalized responses

Researchers introduce FlyThinker, a framework that runs reasoning and generation concurrently rather than sequentially, addressing limitations of existing "think-then-generate" approaches in long-form personalized text generation. The method uses a separate reasoning model that generates token-level guidance in parallel with the main generation model, enabling more adaptive reasoning without sacrificing computational efficiency.

March 6, 2026 · 5:36 AM2 min read

reasoning personalization long-form-generation

via arxiv.org ↗

research

New test-time training method improves LLM reasoning through self-reflection

Researchers propose TTSR, a test-time training framework where a single LLM alternates between Student and Teacher roles to improve its own reasoning. The method generates targeted variant questions based on analyzed failure patterns, showing consistent improvements across mathematical reasoning benchmarks without relying on unreliable pseudo-labels.

March 5, 2026 · 6:08 AM2 min read

test-time-training reasoning self-improvement

via arxiv.org ↗

research

OSCAR: New RAG compression method achieves 2-5x speedup with minimal accuracy loss

Researchers have introduced OSCAR, a query-dependent compression method for Retrieval-Augmented Generation that speeds up inference 2-5x while preserving accuracy. Unlike traditional approaches, OSCAR compresses retrieved information dynamically at inference time rather than offline, eliminating storage overhead and enabling higher compression rates.

March 5, 2026 · 5:25 AM1 min read

rag retrieval-augmented-generation compression

via arxiv.org ↗

research

MeanFlowSE enables single-step speech enhancement by learning mean velocity fields instead of instantaneous flows

Researchers introduced MeanFlowSE, a generative speech enhancement model that eliminates the computational bottleneck of multistep inference by learning average velocity over finite intervals rather than instantaneous velocity fields. The single-step approach achieves comparable quality to multistep baselines on VoiceBank-DEMAND while requiring substantially lower computational cost and no knowledge distillation.

March 5, 2026 · 5:08 AM1 min read

speech-enhancement generative-models flow-models

via arxiv.org ↗

research

LaDiR uses latent diffusion to improve LLM reasoning beyond autoregressive decoding

Researchers propose LaDiR (Latent Diffusion Reasoner), a framework that combines variational autoencoders and latent diffusion models to improve LLM reasoning. The approach encodes reasoning steps into continuous latent representations, enabling iterative refinement and parallel generation of diverse solutions beyond traditional autoregressive decoding.

March 5, 2026 · 12:52 AM1 min read

reasoning chain-of-thought latent-diffusion

via arxiv.org ↗

research

Researchers develop pruning method that challenges attention-sink assumptions in diffusion language models

A new pruning method challenges the conventional wisdom inherited from autoregressive LLMs about preserving attention-sink tokens. Researchers demonstrate that attention sinks in diffusion language models are substantially less stable than in AR models, enabling more aggressive pruning without retraining.

February 20, 2026 · 3:22 AM2 min read

diffusion-language-models pruning inference-optimization

via arxiv.org ↗

research

New pruning technique cuts diffusion language model inference costs by identifying unstable attention sinks

Researchers have identified a fundamental difference in how attention mechanisms work in diffusion language models versus traditional autoregressive LLMs, enabling a new pruning strategy that removes unstable attention sinks without retraining. The finding challenges existing pruning assumptions inherited from autoregressive models and promises better quality-efficiency trade-offs during inference.

February 20, 2026 · 3:21 AM2 min read

diffusion-language-models pruning inference-optimization

via arxiv.org ↗