LLM News

Every LLM release, update, and milestone.

Filtered by:compression✕ clear

research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

March 6, 2026 · 6:06 AM2 min read

context-window compression llm-architecture

via arxiv.org ↗

research

ByteFlow Net removes tokenizers, learns adaptive byte compression for language models

Researchers introduce ByteFlow Net, a tokenizer-free language model architecture that learns to segment raw byte streams into semantically meaningful units through compression-driven segmentation. The method adapts internal representation granularity per input, outperforming both BPE-based Transformers and previous byte-level approaches in experiments.

March 5, 2026 · 5:53 AM2 min read

language-models tokenization byte-level

via arxiv.org ↗

research

OSCAR: New RAG compression method achieves 2-5x speedup with minimal accuracy loss

Researchers have introduced OSCAR, a query-dependent compression method for Retrieval-Augmented Generation that speeds up inference 2-5x while preserving accuracy. Unlike traditional approaches, OSCAR compresses retrieved information dynamically at inference time rather than offline, eliminating storage overhead and enabling higher compression rates.

March 5, 2026 · 5:25 AM1 min read

rag retrieval-augmented-generation compression

via arxiv.org ↗