research

New technique extends LLM context windows to 128K tokens without expensive retraining

Researchers propose a novel framework called SharedLLM that extends language model context windows from 8K to 128K tokens without costly continual pre-training. The method uses two stacked short-context models—one as a compressor, one as a decoder—with specialized tree-based information retrieval, achieving 2-3x inference speedups while maintaining competitive performance.

2 min read

A new research paper on arXiv proposes a practical method for extending language model context windows to 128K tokens without requiring expensive retraining on long sequences.

The framework, called SharedLLM, addresses a fundamental limitation of modern LLMs: their restricted context window, which constrains real-world applications requiring document summarization, code analysis, and long-form reasoning.

How It Works

Instead of continual pre-training on long-context data—which researchers note is prohibitively expensive—the approach uses two stacked short-context LLMs derived from the same base model:

  • Lower model: Acts as a compressor, reducing long inputs into compact, multi-grained representations
  • Upper model: Functions as a decoder, processing these compressed representations for context-aware output

The key innovation is "self-injection"—information transfer between the stacked models occurs only at the lowest layers, avoiding lengthy forward passes and redundant cross-attention operations. A specialized tree-based data structure enables efficient encoding and query-aware retrieval of contextual information.

Empirical Results

Despite training on sequences of only 8K tokens, SharedLLM generalizes to inputs exceeding 128K tokens. The method demonstrates:

  • Performance superior or comparable to strong baselines across long-context benchmarks
  • 2x inference speedup over streaming architectures
  • 3x inference speedup over traditional encoder-decoder approaches
  • Substantially reduced memory footprint compared to alternatives

The balance between efficiency gains and maintained accuracy suggests the approach addresses both computational and practical constraints simultaneously.

Technical Significance

The research tackles a persistent challenge in LLM deployment: extending capability without proportional computational overhead. By leveraging hierarchical compression and selective information routing, the method avoids the computational explosion that typically accompanies longer contexts.

The tree-based retrieval mechanism for query-aware information acquisition appears novel, enabling the system to surface relevant compressed context efficiently rather than processing all information uniformly.

What This Means

This work provides a practical path for extending existing models' context windows without retraining from scratch. If validated at production scale, such techniques could democratize access to longer-context capabilities and reduce the engineering burden of context window extension. The approach may be particularly valuable for resource-constrained deployments where retraining costs are prohibitive.

Stacked Self-Injection: Context Window Extension Research | TPS