Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap
Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.
Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap
Mistral AI has published a detailed technical breakdown of their investigation into a memory leak in vLLM that caused 400MB per minute memory growth under specific production conditions. The leak only manifested during disaggregated Prefill/Decode serving with Mistral Medium 3.1 and graph compilation enabled.
The issue surfaced during pre-production testing and led to out-of-memory states after several hours on production-like traffic. Notably, the leak produced no crashes or errors—just steady linear memory growth visible only in system monitoring.
Isolation proved difficult
The leak appeared exclusively on the decode side of disaggregated setups using NIXL for KV cache transfer. NIXL relies on UCX (Unified Communication X), a high-performance communication library that enables optimized data transfer over technologies including Infiniband.
Standard Python profiling tools including Memray and Guppy 3 showed no leak. Heavier tools like Valgrind were impractically slow for the vLLM setup. GDB caused the entire process to crash.
Heaptrack revealed the problem's location
The team turned to Heaptrack, a memory profiler that overrides malloc and free operations to record allocation events with stack traces. While Heaptrack confirmed heap memory remained stable, it revealed a discrepancy: peak Resident Set Size (RSS) increased between benchmark snapshots despite stable heap allocations.
This indicated the leak was occurring outside the heap—in memory regions that Heaptrack doesn't monitor. RSS includes not just the heap (managed by sbrk and brk system calls), but also the stack and anonymous memory mappings allocated directly via mmap system calls.
Why standard tools missed it
Heaptrack only hooks into glibc's malloc and free functions. Modern memory allocators often use mmap with anonymous mappings for larger blocks rather than traditional sbrk calls, because mmap offers more flexibility and supports huge pages (2MB or 1GB depending on system configuration).
The Mistral team's investigation moved to the /proc filesystem, a kernel API that exposes virtual interfaces to running processes. Specifically, /proc//maps shows detailed memory region maps including heap, stack, shared libraries, and anonymous mappings.
According to Mathis Felardos, the engineer who authored the writeup, the investigation required "descending into kernel-level tracing" to uncover allocations happening through direct mmap calls outside glibc's control.
Broader context
The issue affected only a specific configuration: disaggregated Prefill/Decode serving where prefill requests (set with max_tokens=1 and empty KV Transfer metadata) are sent to prefill instances, then KV cache metadata is transferred alongside decode requests to decode instances via NIXL.
Mistral confirmed the issue with the vLLM team through a GitHub issue, establishing that other users had seen similar behavior. The writeup launches Mistral's new Engineering Deep Dive series focused on sharing technical investigations.
What this means
This investigation highlights the complexity of modern ML serving infrastructure where multiple abstraction layers—from Python frameworks to system-level memory allocators—can obscure performance issues. The leak's specificity to disaggregated serving with particular models and optimizations enabled demonstrates why production workloads often surface bugs that smaller-scale testing misses. For teams running vLLM at scale, particularly with disaggregated architectures, this case study provides a blueprint for debugging memory issues that fall outside traditional profiling tool coverage.
Related Articles
Mistral AI launches Connectors in Studio with MCP protocol integration and direct tool calling
Mistral AI has released Connectors in Studio, allowing developers to integrate custom MCP (Model Context Protocol) servers and built-in connectors via API/SDK. The release includes direct tool calling for deterministic workflows and human-in-the-loop approval flows for sensitive operations.
Mistral Releases Voxtral TTS: 4B Parameter Text-to-Speech Model at $0.016 per 1k Characters
Mistral AI has released Voxtral TTS, a 4B parameter text-to-speech model supporting 9 languages including English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. The model achieves 70ms latency for typical inputs and can clone voices from as little as 3 seconds of audio, priced at $0.016 per 1,000 characters.
Mistral AI Launches Forge for Enterprise Model Training on Proprietary Data
Mistral AI has launched Forge, a platform that allows enterprises to train custom AI models on their proprietary data including codebases, compliance policies, and operational documentation. The system supports both dense and mixture-of-experts architectures with pre-training, post-training, and reinforcement learning capabilities.
Mistral releases Leanstral, open-source 6B-parameter proof assistant for Lean 4 under Apache 2.0
Mistral AI has released Leanstral, a sparse 120B model with 6B active parameters designed specifically for the Lean 4 proof assistant. The model is available under Apache 2.0 license with free API access and achieves a 26.3 FLTEval score at pass@2, outperforming Claude Sonnet 4.6 while costing $36 versus $549.
Comments
Loading...