xLLM: Open-source inference framework claims 2.2x vLLM throughput on Ascend accelerators
Researchers have released xLLM, an open-source Large Language Model inference framework designed for enterprise-scale serving. The framework claims to achieve up to 2.2x higher throughput than vLLM-Ascend when serving Qwen-series models under identical latency constraints, using a novel decoupled architecture that separates service scheduling from engine optimization.
xLLM: Open-Source Inference Framework Claims 2.2x Throughput Gains Over vLLM
Researchers have published a technical report introducing xLLM, an open-source Large Language Model inference framework optimized for high-performance, enterprise-grade serving across diverse AI accelerators. The work addresses infrastructure challenges in scaling LLM deployments through a novel decoupled architecture.
Architecture and Design
xLLM separates inference into two layers: a service layer and an engine layer. At the service level, xLLM-Service manages request scheduling and workload orchestration through several mechanisms:
- Intelligent scheduling module: Processes multimodal requests and co-locates online and offline tasks via unified elastic scheduling to maximize cluster utilization
- Adaptive disaggregation policies: Implements workload-adaptive Prefill-Decode (PD) disaggregation and a novel Encode-Prefill-Decode (EPD) disaggregation policy for multimodal inputs
- Distributed KV Cache management: Global cache coordination and fault tolerance for high availability
At the engine layer, xLLM-Engine co-optimizes system and algorithmic components:
- Multi-layer execution pipeline optimizations
- Adaptive graph execution mode
- xTensor memory management system
- Optimized speculative decoding
- Dynamic EPLB (Enhanced Pipeline Load Balancing)
Performance Claims
According to the technical report, xLLM achieves the following throughput improvements under identical time-per-output-token (TPOT) constraints:
- 1.7x throughput compared to MindIE with Qwen-series models
- 2.2x throughput compared to vLLM-Ascend with Qwen-series models
- 1.7x throughput on average compared to MindIE with Deepseek-series models
The framework is optimized for serving on Huawei's Ascend accelerators and other diverse hardware platforms. The authors emphasize resource efficiency alongside throughput improvements.
Availability
xLLM is available as open-source software on GitHub under two repositories: the core framework (xllm) and the service layer (xllm-service), both maintained under JD.com's open-source organization.
What This Means
xLLM represents an engineering-focused approach to LLM inference optimization that explicitly targets enterprise deployment constraints. The 2.2x throughput gain over vLLM-Ascend, if independently verified, would address a significant gap in the Ascend hardware ecosystem. The framework's separation of service scheduling from engine optimization could provide a replicable pattern for other inference systems, though real-world performance will depend heavily on specific hardware configurations and workload characteristics. The open-source release enables broader evaluation and potential adoption in production environments.