LLM News

Every LLM release, update, and milestone.

Filtered by:compression✕ clear
research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

research

ByteFlow Net removes tokenizers, learns adaptive byte compression for language models

Researchers introduce ByteFlow Net, a tokenizer-free language model architecture that learns to segment raw byte streams into semantically meaningful units through compression-driven segmentation. The method adapts internal representation granularity per input, outperforming both BPE-based Transformers and previous byte-level approaches in experiments.

research

OSCAR: New RAG compression method achieves 2-5x speedup with minimal accuracy loss

Researchers have introduced OSCAR, a query-dependent compression method for Retrieval-Augmented Generation that speeds up inference 2-5x while preserving accuracy. Unlike traditional approaches, OSCAR compresses retrieved information dynamically at inference time rather than offline, eliminating storage overhead and enabling higher compression rates.