research

xLLM: Open-source inference framework claims 2.2x vLLM throughput on Ascend accelerators

Researchers have released xLLM, an open-source Large Language Model inference framework designed for enterprise-scale serving. The framework claims to achieve up to 2.2x higher throughput than vLLM-Ascend when serving Qwen-series models under identical latency constraints, using a novel decoupled architecture that separates service scheduling from engine optimization.

March 5, 2026 · 12:51 AM2 min read

xLLM: Open-Source Inference Framework Claims 2.2x Throughput Gains Over vLLM

Researchers have published a technical report introducing xLLM, an open-source Large Language Model inference framework optimized for high-performance, enterprise-grade serving across diverse AI accelerators. The work addresses infrastructure challenges in scaling LLM deployments through a novel decoupled architecture.

Architecture and Design

xLLM separates inference into two layers: a service layer and an engine layer. At the service level, xLLM-Service manages request scheduling and workload orchestration through several mechanisms:

Intelligent scheduling module: Processes multimodal requests and co-locates online and offline tasks via unified elastic scheduling to maximize cluster utilization
Adaptive disaggregation policies: Implements workload-adaptive Prefill-Decode (PD) disaggregation and a novel Encode-Prefill-Decode (EPD) disaggregation policy for multimodal inputs
Distributed KV Cache management: Global cache coordination and fault tolerance for high availability

At the engine layer, xLLM-Engine co-optimizes system and algorithmic components:

Multi-layer execution pipeline optimizations
Adaptive graph execution mode
xTensor memory management system
Optimized speculative decoding
Dynamic EPLB (Enhanced Pipeline Load Balancing)

Performance Claims

According to the technical report, xLLM achieves the following throughput improvements under identical time-per-output-token (TPOT) constraints:

1.7x throughput compared to MindIE with Qwen-series models
2.2x throughput compared to vLLM-Ascend with Qwen-series models
1.7x throughput on average compared to MindIE with Deepseek-series models

The framework is optimized for serving on Huawei's Ascend accelerators and other diverse hardware platforms. The authors emphasize resource efficiency alongside throughput improvements.

Availability

xLLM is available as open-source software on GitHub under two repositories: the core framework (xllm) and the service layer (xllm-service), both maintained under JD.com's open-source organization.

What This Means

xLLM represents an engineering-focused approach to LLM inference optimization that explicitly targets enterprise deployment constraints. The 2.2x throughput gain over vLLM-Ascend, if independently verified, would address a significant gap in the Ascend hardware ecosystem. The framework's separation of service scheduling from engine optimization could provide a replicable pattern for other inference systems, though real-world performance will depend heavily on specific hardware configurations and workload characteristics. The open-source release enables broader evaluation and potential adoption in production environments.

Source: arxiv.org ↗

inference llm-framework open-source huawei-ascend optimization speculative-decoding scheduling enterprise