New technique extends LLM context windows to 128K tokens without expensive retraining
Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.
A new research paper on arXiv proposes a practical method for extending language model context windows to 128K tokens without requiring expensive retraining on long sequences.
The framework, called SharedLLM, addresses a fundamental limitation of modern LLMs: their restricted context window, which constrains real-world applications requiring document summarization, code analysis, and long-form reasoning.
How It Works
Instead of continual pre-training on long-context data—which researchers note is prohibitively expensive—the approach uses two stacked short-context LLMs derived from the same base model:
- Lower model: Acts as a compressor, reducing long inputs into compact, multi-grained representations
- Upper model: Functions as a decoder, processing these compressed representations for context-aware output
The key innovation is "self-injection"—information transfer between the stacked models occurs only at the lowest layers, avoiding lengthy forward passes and redundant cross-attention operations. A specialized tree-based data structure enables efficient encoding and query-aware retrieval of contextual information.
Empirical Results
Despite training on sequences of only 8K tokens, SharedLLM generalizes to inputs exceeding 128K tokens. The method demonstrates:
- Performance superior or comparable to strong baselines across long-context benchmarks
- 2x inference speedup over streaming architectures
- 3x inference speedup over traditional encoder-decoder approaches
- Substantially reduced memory footprint compared to alternatives
The balance between efficiency gains and maintained accuracy suggests the approach addresses both computational and practical constraints simultaneously.
Technical Significance
The research tackles a persistent challenge in LLM deployment: extending capability without proportional computational overhead. By leveraging hierarchical compression and selective information routing, the method avoids the computational explosion that typically accompanies longer contexts.
The tree-based retrieval mechanism for query-aware information acquisition appears novel, enabling the system to surface relevant compressed context efficiently rather than processing all information uniformly.
What This Means
This work provides a practical path for extending existing models' context windows without retraining from scratch. If validated at production scale, such techniques could democratize access to longer-context capabilities and reduce the engineering burden of context window extension. The approach may be particularly valuable for resource-constrained deployments where retraining costs are prohibitive.