research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

March 6, 2026 · 6:06 AM2 min read

A new research paper on arXiv proposes a practical method for extending language model context windows to 128K tokens without requiring expensive retraining on long sequences.

The framework, called SharedLLM, addresses a fundamental limitation of modern LLMs: their restricted context window, which constrains real-world applications requiring document summarization, code analysis, and long-form reasoning.

How It Works

Instead of continual pre-training on long-context data—which researchers note is prohibitively expensive—the approach uses two stacked short-context LLMs derived from the same base model:

Lower model: Acts as a compressor, reducing long inputs into compact, multi-grained representations
Upper model: Functions as a decoder, processing these compressed representations for context-aware output

The key innovation is "self-injection"—information transfer between the stacked models occurs only at the lowest layers, avoiding lengthy forward passes and redundant cross-attention operations. A specialized tree-based data structure enables efficient encoding and query-aware retrieval of contextual information.

Empirical Results

Despite training on sequences of only 8K tokens, SharedLLM generalizes to inputs exceeding 128K tokens. The method demonstrates:

Performance superior or comparable to strong baselines across long-context benchmarks
2x inference speedup over streaming architectures
3x inference speedup over traditional encoder-decoder approaches
Substantially reduced memory footprint compared to alternatives

The balance between efficiency gains and maintained accuracy suggests the approach addresses both computational and practical constraints simultaneously.

Technical Significance

The research tackles a persistent challenge in LLM deployment: extending capability without proportional computational overhead. By leveraging hierarchical compression and selective information routing, the method avoids the computational explosion that typically accompanies longer contexts.

The tree-based retrieval mechanism for query-aware information acquisition appears novel, enabling the system to surface relevant compressed context efficiently rather than processing all information uniformly.

What This Means

This work provides a practical path for extending existing models' context windows without retraining from scratch. If validated at production scale, such techniques could democratize access to longer-context capabilities and reduce the engineering burden of context window extension. The approach may be particularly valuable for resource-constrained deployments where retraining costs are prohibitive.

Source: arxiv.org ↗

context-window compression llm-architecture inference-efficiency long-context research arxiv