Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster
Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster
Your cluster loads the model. Weights land in GPU memory. Then — crash. NV_ERR_NO_MEMORY. The OOM killer takes down VLLM::Worker_TP. You restart. Same thing. Every time.
This is not a memory leak. It is a subtle interaction between Linux page cache, NVIDIA UVM (Unified Virtual Memory), and the physics of loading a 148 GiB checkpoint into a system with only 121 GiB of physical RAM.
Here is exactly what went wrong, how we diagnosed it, and how we fixed it — permanently.
The Setup
Two NVIDIA DGX Sparks (GB10 Grace Blackwell Superchip) running DeepSeek-V4-Flash via vLLM 0.21.1rc1 in Docker. Tensor parallelism across both nodes (TP=2), expert parallelism enabled, NCCL over RDMA.
| Node | RAM | Role | Hostname |
|---|---|---|---|
| DGX Spark 1 | 121 GiB | Master (rank 0) | ai1 |
| DGX Spark 2 | 119 GiB | Worker (rank 1) | ai2 |
The model is loaded with --load-format safetensors, which memory-maps 46 shard files totaling 148 GiB. The model itself takes 74 GiB of GPU memory.
The Symptom
Every relaunch followed the same pattern:
- Both nodes start loading the checkpoint files from NVMe
- Model loads successfully (65–106s, 74 GiB allocated)
- Distributed barrier via GLOO connects both nodes
- vLLM attempts KV cache allocation
- NVIDIA driver returns
NV_ERR_NO_MEMORY - OOM killer terminates
VLLM::Worker_TP - Connection closed by peer on the other node
The dmesg output told the real story. On ai2:
NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY]
returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
VLLM::Worker_TP invoked oom-killer: gfp_mask=0x102cc2(GFP_HIGHUSER)
oom_kill_process+0x1f8/0x3f0
sysmemAllocResources+0xa8/0x2a0 [nvidia]
On ai1, the path went through UVM:
VLLM::Worker_TP invoked oom-killer: gfp_mask=0x80dc0(GFP_KERNEL|__GFP_ZERO)
phys_mem_allocate+0x14c/0x230 [nvidia_uvm]
uvm_va_range_map_rm_allocation+0x21c/0xa08 [nvidia_uvm]
Both point to the same root cause: the NVIDIA kernel driver could not allocate physical pages.
Root Cause: The Physics of Unified Memory
DGX Spark uses a unified memory architecture — CPU and GPU share the same physical DDR RAM. There is no dedicated VRAM. When vLLM loads model weights, the 148 GiB checkpoint files are read from NVMe into the Linux page cache, and then mapped into GPU memory via NVIDIA UVM.
The critical sequence:
- NVMe reads shard files → kernel fills page cache (~148 GiB of I/O)
- UVM pins pages → physical pages transition from page cache to GPU ownership
- KV cache allocation → NVIDIA driver needs fresh physical pages
The problem is timing. Step 1 fills the page cache faster than step 2 can drain it. At peak I/O, the page cache consumes the remaining free memory. When step 3 arrives, the kernel has no pages to give — the NVIDIA driver goes straight into OOM.
This affects ai2 (119 GiB) more severely than ai1 (121 GiB) because it has 2 GiB less headroom.
The Fixes
After many failed attempts (CUDA graph compilation crash, repeated OOM, I/O thrashing), the working solution combines four mechanisms:
1. Reserve Emergency Memory (vm.min_free_kbytes)
This is the single most impactful setting. vm.min_free_kbytes tells the kernel to keep a reserve of free pages that cannot be consumed by the page cache.
# ai2 (bottleneck node with 119 GiB)
echo 5242880 | sudo tee /proc/sys/vm/min_free_kbytes
# ai1 (121 GiB — slightly more headroom)
echo 3145728 | sudo tee /proc/sys/vm/min_free_kbytes
At 5 GiB reserved, when the page cache tries to consume the last 5 GiB of free memory, the kernel starts aggressive page reclaim instead of letting the system run dry. This guarantees that sysmemAllocResources always has pages to allocate.
Made permanent via /etc/sysctl.d/90-vllm.conf:
vm.min_free_kbytes = 5242880 # 5242880 on ai2, 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
2. Aggressive Cache Dropping
The launch script runs a background loop that drops clean page cache every 0.1 seconds during model loading:
while true; do
sleep 0.1
echo 3 > /proc/sys/vm/drop_caches
done
At default 0.5s, the page cache could fill 1.5 GiB between drops (NVMe reads at ~3 GiB/s). At 0.1s, the maximum accumulation drops to ~300 MiB — enough to prevent system-wide memory exhaustion.
3. Readahead Throttling
The NVMe default readahead of 128 KB aggressively prefetches checkpoint shards, accelerating page cache buildup. Reducing to 8 KB during model loading:
sudo blockdev --setra 16 /dev/nvme0n1 # 16 sectors × 512B = 8 KB
This slows model loading from ~68s to ~106s but keeps the page cache steady under 5 GiB throughout. A background timer restores the default 128 KB after 180 seconds.
4. Aggressive Dirty Page Settings
vm.dirty_ratio = 5 # Max 5% dirty pages before blocking writers
vm.dirty_background_ratio = 2 # Start background writeback at 2%
vm.vfs_cache_pressure = 200 # Aggressively reclaim inode/dentry caches
These ensure the kernel never accumulates large amounts of dirty pages that compete with GPU allocations.
What DIDN'T Work
Several approaches failed before we found the right fix:
- CUDA graphs (
--enforce-eager=False): Crashed both nodes during kernel compilation. DGX Spark's GB10 does not handle CUDA graph capture reliably with DeepSeek V4 Flash's MoE architecture. Required hard reboot. - Lower gpu_memory_utilization (0.78 → 0.75 → 0.70): Merely reduced KV cache size without fixing the allocation failure. The OOM happens before vLLM's allocator runs.
- Thinner cache drops (0.5s): Not fast enough. Page cache still accumulated faster than drops cleared it.
- No cache drops: Immediate OOM on every launch.
- External cache drop loops (via SSH): The SSH connection dropped after the Docker container started, killing the drop loop. The launch script must run the loop locally.
Results
After applying all fixes:
| Metric | Before | After |
|---|---|---|
| Launch reliability | ~20% (5+ attempts) | 100% (first try) |
| Model loading time | 65s | 106s (slower NVMe — acceptable tradeoff) |
| KV cache allocation | 313,395 tokens at gpu_mem=0.76 | ✅ |
| AI1 used memory | ~108 GiB / 121 GiB | ✅ |
| AI2 used memory | ~113 GiB / 119 GiB | ✅ |
| Page cache peak | >40 GiB (OOM threshold) | <5 GiB |
| Inference TPS | 5.5 | 11.0 |
The cluster runs stable with gpu_memory_utilization=0.76, max_model_len=180000, and max_num_seqs=2.
The Configuration
# /etc/sysctl.d/90-vllm.conf — identical on both nodes
vm.min_free_kbytes = 5242880 # 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2
# launch-cluster.sh (built-in): cache drops every 0.1s
# start-cluster-wrapper.sh: drops caches before launch,
# reduces NVMe readahead to 8KB, restores after 180s
# Restart after reboot:
cd ~/workspace/spark-vllm-docker
./start-cluster-wrapper.sh
Key Takeaways
1. Unified memory changes the OOM physics. On a system with dedicated VRAM, GPU memory pressure doesn't affect host memory. On Grace Blackwell, they share the same DIMMs. Page cache from checkpoint I/O can starve GPU allocations.
2. vm.min_free_kbytes is the safety net, not throttling. All the cache dropping and readahead reduction in the world only reduces the peak — they don't prevent a transient spike from tipping the system over. The min_free reserve guarantees breathing room.
3. Slower loading is cheaper than OOM recovery. A 40-second increase in model loading time is nothing compared to a 5-minute reboot cycle after OOM.
4. The same config ran stable for 31 hours. The OOM was a launch-time issue, not a runtime issue. Once the KV cache was allocated and the page cache settled, the cluster was solid.
See Also
- Two-Node DGX Spark Cluster: Running DeepSeek V4 Flash at 16 TPS — The same cluster's performance optimization journey, baseline to 3× throughput
- n8n Automation on GB10: AI-Powered Workflows — Using GB10 hardware for self-hosted workflow automation
Enjoyed this deep-dive? Subscribe to get notified about new AI infrastructure articles.
Never miss a deep-dive
Get notified when I publish new articles on AI infrastructure, DevOps, and XR development.