LLM News

Every LLM release, update, and milestone.

Filtered by:inference-efficiency✕ clear

research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

March 6, 2026 · 6:06 AM2 min read

context-window compression llm-architecture

via arxiv.org ↗

researchByteDance

Bytedance study: reasoning models know when to stop, but sampling methods force continued thinking

A new Bytedance study reveals that large reasoning models actually know when they've reached the correct answer, but common sampling methods prevent them from stopping. The models engage in unnecessary cross-checking and reformulation despite already solving problems correctly.

February 25, 2026 · 6:20 PM2 min read

reasoning-models inference-efficiency sampling-methods

via the-decoder.com ↗