research

OSCAR: New RAG compression method achieves 2-5x speedup with minimal accuracy loss

Researchers have introduced OSCAR, a query-dependent compression method for Retrieval-Augmented Generation that speeds up inference 2-5x while preserving accuracy. Unlike traditional approaches, OSCAR compresses retrieved information dynamically at inference time rather than offline, eliminating storage overhead and enabling higher compression rates.

March 5, 2026 · 5:25 AM1 min read

OSCAR Brings Dynamic Compression to RAG Pipelines

A new technique called OSCAR (Online Soft Compression And Reranking) addresses a critical bottleneck in Retrieval-Augmented Generation systems: the computational cost of processing large retrieved document sets.

The Problem

RAG enhances LLMs by integrating external knowledge, improving accuracy and relevance. However, as retrieval sizes scale, computational overhead becomes prohibitive. Existing compression approaches have fundamental limitations:

Hard compression shortens retrieved texts but risks information loss
Soft compression maps documents to continuous embeddings, but requires offline processing and storage overhead

OSCAR's Approach

OSCAR introduces online soft compression—a query-dependent method that compresses retrieved information dynamically at inference time. This eliminates the need to store pre-computed embeddings and enables higher compression rates than offline methods.

The technique simultaneously performs reranking, further optimizing RAG pipeline efficiency. By making compression decisions based on the specific query, OSCAR achieves better context preservation than static approaches.

Experimental Results

Across models ranging from 1B to 24B parameters:

Inference speedup: 2-5x faster
Accuracy impact: Minimal to no loss
Compression method: Query-dependent, inference-time processing

The research demonstrates state-of-the-art performance, suggesting OSCAR significantly improves the practical feasibility of RAG systems at scale.

Availability

Models and implementation are available through the Hugging Face Hub at the NAVER collection, enabling immediate adoption by researchers and practitioners.

What This Means

OSCAR addresses a real constraint in production RAG systems—computational cost during inference. The 2-5x speedup without accuracy loss makes RAG viable for larger-scale applications. Query-dependent compression is a meaningful shift from static offline approaches: it trades inference-time computation for better context preservation and eliminates storage requirements entirely. For teams deploying RAG at scale, this technique could substantially reduce infrastructure costs while maintaining retrieval quality.

Source: arxiv.org ↗

rag retrieval-augmented-generation compression inference-optimization llm research naver