Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

TL;DR

Mistral AI's engineering team documented their investigation of a memory leak in vLLM that caused 400MB/minute memory growth during disaggregated serving with Mistral Medium 3.1. The leak, which only appeared with specific conditions including graph compilation and NIXL-based KV cache transfer, was eventually traced to mmap allocations outside the traditional heap that standard profiling tools couldn't detect.

June 18, 2026 · 8:54 AM3 min read

Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

Mistral AI has published a detailed technical breakdown of their investigation into a memory leak in vLLM that caused 400MB per minute memory growth under specific production conditions. The leak only manifested during disaggregated Prefill/Decode serving with Mistral Medium 3.1 and graph compilation enabled.

The issue surfaced during pre-production testing and led to out-of-memory states after several hours on production-like traffic. Notably, the leak produced no crashes or errors—just steady linear memory growth visible only in system monitoring.

Isolation proved difficult

The leak appeared exclusively on the decode side of disaggregated setups using NIXL for KV cache transfer. NIXL relies on UCX (Unified Communication X), a high-performance communication library that enables optimized data transfer over technologies including Infiniband.

Standard Python profiling tools including Memray and Guppy 3 showed no leak. Heavier tools like Valgrind were impractically slow for the vLLM setup. GDB caused the entire process to crash.

Heaptrack revealed the problem's location

The team turned to Heaptrack, a memory profiler that overrides malloc and free operations to record allocation events with stack traces. While Heaptrack confirmed heap memory remained stable, it revealed a discrepancy: peak Resident Set Size (RSS) increased between benchmark snapshots despite stable heap allocations.

This indicated the leak was occurring outside the heap—in memory regions that Heaptrack doesn't monitor. RSS includes not just the heap (managed by sbrk and brk system calls), but also the stack and anonymous memory mappings allocated directly via mmap system calls.

Why standard tools missed it

Heaptrack only hooks into glibc's malloc and free functions. Modern memory allocators often use mmap with anonymous mappings for larger blocks rather than traditional sbrk calls, because mmap offers more flexibility and supports huge pages (2MB or 1GB depending on system configuration).

The Mistral team's investigation moved to the /proc filesystem, a kernel API that exposes virtual interfaces to running processes. Specifically, /proc//maps shows detailed memory region maps including heap, stack, shared libraries, and anonymous mappings.

According to Mathis Felardos, the engineer who authored the writeup, the investigation required "descending into kernel-level tracing" to uncover allocations happening through direct mmap calls outside glibc's control.

Broader context

The issue affected only a specific configuration: disaggregated Prefill/Decode serving where prefill requests (set with max_tokens=1 and empty KV Transfer metadata) are sent to prefill instances, then KV cache metadata is transferred alongside decode requests to decode instances via NIXL.

Mistral confirmed the issue with the vLLM team through a GitHub issue, establishing that other users had seen similar behavior. The writeup launches Mistral's new Engineering Deep Dive series focused on sharing technical investigations.

What this means

This investigation highlights the complexity of modern ML serving infrastructure where multiple abstraction layers—from Python frameworks to system-level memory allocators—can obscure performance issues. The leak's specificity to disaggregated serving with particular models and optimizations enabled demonstrates why production workloads often surface bugs that smaller-scale testing misses. For teams running vLLM at scale, particularly with disaggregated architectures, this case study provides a blueprint for debugging memory issues that fall outside traditional profiling tool coverage.

Source: mistral.ai ↗

vllm mistral-ai memory-leak debugging inference engineering ucx nixl

product updateJune 24, 2026

Mistral adds workspace-level connector controls, multi-account authentication, and debugging tools

Mistral AI released new enterprise connector features including workspace-level access controls, multi-account authentication for single connectors, and a debugging tool for Model Context Protocol (MCP) connections. The updates address production deployment challenges for AI agents accessing enterprise data systems.

model releaseJuly 4, 2026

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Mistral AI has released Leanstral 1.5, an open-source 119B parameter mixture-of-experts model designed specifically for Lean 4 proof assistance. The model features 128 experts with 4 active per token (6.5B activated parameters), a 256k token context window, and multimodal input capabilities.

researchAugust 2, 2026

Meta AI Pairs a Second 'Memory Agent' With Coding Agents, Lifts Terminal-Bench Score From 38% to 46%

Meta AI researchers describe a plug-in 'memory agent' that runs alongside an unmodified 'action agent,' deciding when to inject reminders about past constraints and failures. The system lifted Terminal-Bench 2.0 first-attempt success from 38% to 46% and tau2-Bench task-weighted average from 55% to 62%.

researchAugust 1, 2026

OpenAI Claims Internal Astra Model Solved 10 Decade-Old Math Problems for Under $2,000 Each

OpenAI claims an internal version of its next major model, Astra, produced solutions to ten mathematical and theoretical computer science problems that had seen no progress in at least a decade. The company says each solution cost less than $2,000 in GPT-5.6 Sol token pricing, and published Lean 4 formalizations along with a paper describing the results.

Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

Mistral AI traces 400MB/minute memory leak in vLLM to kernel-level mmap calls outside heap

Isolation proved difficult

Heaptrack revealed the problem's location

Why standard tools missed it

Broader context

What this means

Related Articles

Mistral adds workspace-level connector controls, multi-account authentication, and debugging tools

Mistral releases Leanstral 1.5: 119B parameter open-source model for Lean 4 proof assistance

Meta AI Pairs a Second 'Memory Agent' With Coding Agents, Lifts Terminal-Bench Score From 38% to 46%

OpenAI Claims Internal Astra Model Solved 10 Decade-Old Math Problems for Under $2,000 Each

Comments