Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

Your cluster loads the model. Weights land in GPU memory. Then — crash. NV_ERR_NO_MEMORY. The OOM killer takes down VLLM::Worker_TP. You restart. Same thing. Every time.

This is not a memory leak. It is a subtle interaction between Linux page cache, NVIDIA UVM (Unified Virtual Memory), and the physics of loading a 148 GiB checkpoint into a system with only 121 GiB of physical RAM.

Here is exactly what went wrong, how we diagnosed it, and how we fixed it — permanently.

The Setup

Two NVIDIA DGX Sparks (GB10 Grace Blackwell Superchip) running DeepSeek-V4-Flash via vLLM 0.21.1rc1 in Docker. Tensor parallelism across both nodes (TP=2), expert parallelism enabled, NCCL over RDMA.

Node	RAM	Role	Hostname
DGX Spark 1	121 GiB	Master (rank 0)	`ai1`
DGX Spark 2	119 GiB	Worker (rank 1)	`ai2`

The model is loaded with --load-format safetensors, which memory-maps 46 shard files totaling 148 GiB. The model itself takes 74 GiB of GPU memory.

The Symptom

Every relaunch followed the same pattern:

Both nodes start loading the checkpoint files from NVMe
Model loads successfully (65–106s, 74 GiB allocated)
Distributed barrier via GLOO connects both nodes
vLLM attempts KV cache allocation
NVIDIA driver returns NV_ERR_NO_MEMORY
OOM killer terminates VLLM::Worker_TP
Connection closed by peer on the other node

The dmesg output told the real story. On ai2:

NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY]
       returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
VLLM::Worker_TP invoked oom-killer: gfp_mask=0x102cc2(GFP_HIGHUSER)
  oom_kill_process+0x1f8/0x3f0
  sysmemAllocResources+0xa8/0x2a0 [nvidia]

On ai1, the path went through UVM:

VLLM::Worker_TP invoked oom-killer: gfp_mask=0x80dc0(GFP_KERNEL|__GFP_ZERO)
  phys_mem_allocate+0x14c/0x230 [nvidia_uvm]
  uvm_va_range_map_rm_allocation+0x21c/0xa08 [nvidia_uvm]

Both point to the same root cause: the NVIDIA kernel driver could not allocate physical pages.

Root Cause: The Physics of Unified Memory

DGX Spark uses a unified memory architecture — CPU and GPU share the same physical DDR RAM. There is no dedicated VRAM. When vLLM loads model weights, the 148 GiB checkpoint files are read from NVMe into the Linux page cache, and then mapped into GPU memory via NVIDIA UVM.

The critical sequence:

NVMe reads shard files → kernel fills page cache (~148 GiB of I/O)
UVM pins pages → physical pages transition from page cache to GPU ownership
KV cache allocation → NVIDIA driver needs fresh physical pages

The problem is timing. Step 1 fills the page cache faster than step 2 can drain it. At peak I/O, the page cache consumes the remaining free memory. When step 3 arrives, the kernel has no pages to give — the NVIDIA driver goes straight into OOM.

This affects ai2 (119 GiB) more severely than ai1 (121 GiB) because it has 2 GiB less headroom.

The Fixes

After many failed attempts (CUDA graph compilation crash, repeated OOM, I/O thrashing), the working solution combines four mechanisms:

1. Reserve Emergency Memory (`vm.min_free_kbytes`)

This is the single most impactful setting. vm.min_free_kbytes tells the kernel to keep a reserve of free pages that cannot be consumed by the page cache.

# ai2 (bottleneck node with 119 GiB)
echo 5242880 | sudo tee /proc/sys/vm/min_free_kbytes
# ai1 (121 GiB — slightly more headroom)
echo 3145728 | sudo tee /proc/sys/vm/min_free_kbytes

At 5 GiB reserved, when the page cache tries to consume the last 5 GiB of free memory, the kernel starts aggressive page reclaim instead of letting the system run dry. This guarantees that sysmemAllocResources always has pages to allocate.

Made permanent via /etc/sysctl.d/90-vllm.conf:

vm.min_free_kbytes = 5242880    # 5242880 on ai2, 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

2. Aggressive Cache Dropping

The launch script runs a background loop that drops clean page cache every 0.1 seconds during model loading:

while true; do
    sleep 0.1
    echo 3 > /proc/sys/vm/drop_caches
done

At default 0.5s, the page cache could fill 1.5 GiB between drops (NVMe reads at ~3 GiB/s). At 0.1s, the maximum accumulation drops to ~300 MiB — enough to prevent system-wide memory exhaustion.

3. Readahead Throttling

The NVMe default readahead of 128 KB aggressively prefetches checkpoint shards, accelerating page cache buildup. Reducing to 8 KB during model loading:

sudo blockdev --setra 16 /dev/nvme0n1   # 16 sectors × 512B = 8 KB

This slows model loading from ~68s to ~106s but keeps the page cache steady under 5 GiB throughout. A background timer restores the default 128 KB after 180 seconds.

4. Aggressive Dirty Page Settings

vm.dirty_ratio = 5          # Max 5% dirty pages before blocking writers
vm.dirty_background_ratio = 2  # Start background writeback at 2%
vm.vfs_cache_pressure = 200 # Aggressively reclaim inode/dentry caches

These ensure the kernel never accumulates large amounts of dirty pages that compete with GPU allocations.

What DIDN'T Work

Several approaches failed before we found the right fix:

CUDA graphs (--enforce-eager=False): Crashed both nodes during kernel compilation. DGX Spark's GB10 does not handle CUDA graph capture reliably with DeepSeek V4 Flash's MoE architecture. Required hard reboot.
Lower gpu_memory_utilization (0.78 → 0.75 → 0.70): Merely reduced KV cache size without fixing the allocation failure. The OOM happens before vLLM's allocator runs.
Thinner cache drops (0.5s): Not fast enough. Page cache still accumulated faster than drops cleared it.
No cache drops: Immediate OOM on every launch.
External cache drop loops (via SSH): The SSH connection dropped after the Docker container started, killing the drop loop. The launch script must run the loop locally.

Results

After applying all fixes:

Metric	Before	After
Launch reliability	~20% (5+ attempts)	100% (first try)
Model loading time	65s	106s (slower NVMe — acceptable tradeoff)
KV cache allocation	313,395 tokens at gpu_mem=0.76	✅
AI1 used memory	~108 GiB / 121 GiB	✅
AI2 used memory	~113 GiB / 119 GiB	✅
Page cache peak	>40 GiB (OOM threshold)	<5 GiB
Inference TPS	5.5	11.0

The cluster runs stable with gpu_memory_utilization=0.76, max_model_len=180000, and max_num_seqs=2.

The Configuration

# /etc/sysctl.d/90-vllm.conf — identical on both nodes
vm.min_free_kbytes = 5242880    # 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

# launch-cluster.sh (built-in): cache drops every 0.1s
# start-cluster-wrapper.sh: drops caches before launch,
#   reduces NVMe readahead to 8KB, restores after 180s

# Restart after reboot:
cd ~/workspace/spark-vllm-docker
./start-cluster-wrapper.sh

Key Takeaways

1. Unified memory changes the OOM physics. On a system with dedicated VRAM, GPU memory pressure doesn't affect host memory. On Grace Blackwell, they share the same DIMMs. Page cache from checkpoint I/O can starve GPU allocations.

2. vm.min_free_kbytes is the safety net, not throttling. All the cache dropping and readahead reduction in the world only reduces the peak — they don't prevent a transient spike from tipping the system over. The min_free reserve guarantees breathing room.

3. Slower loading is cheaper than OOM recovery. A 40-second increase in model loading time is nothing compared to a 5-minute reboot cycle after OOM.

4. The same config ran stable for 31 hours. The OOM was a launch-time issue, not a runtime issue. Once the KV cache was allocated and the page cache settled, the cluster was solid.

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

Your cluster loads the model. Weights land in GPU memory. Then — crash. NV_ERR_NO_MEMORY. The OOM killer takes down VLLM::Worker_TP. You restart. Same thing. Every time.

Here is exactly what went wrong, how we diagnosed it, and how we fixed it — permanently.

The Setup

Two NVIDIA DGX Sparks (GB10 Grace Blackwell Superchip) running DeepSeek-V4-Flash via vLLM 0.21.1rc1 in Docker. Tensor parallelism across both nodes (TP=2), expert parallelism enabled, NCCL over RDMA.

Node	RAM	Role	Hostname
DGX Spark 1	121 GiB	Master (rank 0)	`ai1`
DGX Spark 2	119 GiB	Worker (rank 1)	`ai2`

The model is loaded with --load-format safetensors, which memory-maps 46 shard files totaling 148 GiB. The model itself takes 74 GiB of GPU memory.

The Symptom

Every relaunch followed the same pattern:

Both nodes start loading the checkpoint files from NVMe
Model loads successfully (65–106s, 74 GiB allocated)
Distributed barrier via GLOO connects both nodes
vLLM attempts KV cache allocation
NVIDIA driver returns NV_ERR_NO_MEMORY
OOM killer terminates VLLM::Worker_TP
Connection closed by peer on the other node

The dmesg output told the real story. On ai2:

NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory [NV_ERR_NO_MEMORY]
       returned from _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359
VLLM::Worker_TP invoked oom-killer: gfp_mask=0x102cc2(GFP_HIGHUSER)
  oom_kill_process+0x1f8/0x3f0
  sysmemAllocResources+0xa8/0x2a0 [nvidia]

On ai1, the path went through UVM:

VLLM::Worker_TP invoked oom-killer: gfp_mask=0x80dc0(GFP_KERNEL|__GFP_ZERO)
  phys_mem_allocate+0x14c/0x230 [nvidia_uvm]
  uvm_va_range_map_rm_allocation+0x21c/0xa08 [nvidia_uvm]

Both point to the same root cause: the NVIDIA kernel driver could not allocate physical pages.

Root Cause: The Physics of Unified Memory

The critical sequence:

NVMe reads shard files → kernel fills page cache (~148 GiB of I/O)
UVM pins pages → physical pages transition from page cache to GPU ownership
KV cache allocation → NVIDIA driver needs fresh physical pages

This affects ai2 (119 GiB) more severely than ai1 (121 GiB) because it has 2 GiB less headroom.

The Fixes

After many failed attempts (CUDA graph compilation crash, repeated OOM, I/O thrashing), the working solution combines four mechanisms:

1. Reserve Emergency Memory (`vm.min_free_kbytes`)

This is the single most impactful setting. vm.min_free_kbytes tells the kernel to keep a reserve of free pages that cannot be consumed by the page cache.

# ai2 (bottleneck node with 119 GiB)
echo 5242880 | sudo tee /proc/sys/vm/min_free_kbytes
# ai1 (121 GiB — slightly more headroom)
echo 3145728 | sudo tee /proc/sys/vm/min_free_kbytes

Made permanent via /etc/sysctl.d/90-vllm.conf:

vm.min_free_kbytes = 5242880    # 5242880 on ai2, 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

2. Aggressive Cache Dropping

The launch script runs a background loop that drops clean page cache every 0.1 seconds during model loading:

while true; do
    sleep 0.1
    echo 3 > /proc/sys/vm/drop_caches
done

At default 0.5s, the page cache could fill 1.5 GiB between drops (NVMe reads at ~3 GiB/s). At 0.1s, the maximum accumulation drops to ~300 MiB — enough to prevent system-wide memory exhaustion.

3. Readahead Throttling

The NVMe default readahead of 128 KB aggressively prefetches checkpoint shards, accelerating page cache buildup. Reducing to 8 KB during model loading:

sudo blockdev --setra 16 /dev/nvme0n1   # 16 sectors × 512B = 8 KB

This slows model loading from ~68s to ~106s but keeps the page cache steady under 5 GiB throughout. A background timer restores the default 128 KB after 180 seconds.

4. Aggressive Dirty Page Settings

vm.dirty_ratio = 5          # Max 5% dirty pages before blocking writers
vm.dirty_background_ratio = 2  # Start background writeback at 2%
vm.vfs_cache_pressure = 200 # Aggressively reclaim inode/dentry caches

These ensure the kernel never accumulates large amounts of dirty pages that compete with GPU allocations.

What DIDN'T Work

Several approaches failed before we found the right fix:

CUDA graphs (--enforce-eager=False): Crashed both nodes during kernel compilation. DGX Spark's GB10 does not handle CUDA graph capture reliably with DeepSeek V4 Flash's MoE architecture. Required hard reboot.
Lower gpu_memory_utilization (0.78 → 0.75 → 0.70): Merely reduced KV cache size without fixing the allocation failure. The OOM happens before vLLM's allocator runs.
Thinner cache drops (0.5s): Not fast enough. Page cache still accumulated faster than drops cleared it.
No cache drops: Immediate OOM on every launch.
External cache drop loops (via SSH): The SSH connection dropped after the Docker container started, killing the drop loop. The launch script must run the loop locally.

Results

After applying all fixes:

Metric	Before	After
Launch reliability	~20% (5+ attempts)	100% (first try)
Model loading time	65s	106s (slower NVMe — acceptable tradeoff)
KV cache allocation	313,395 tokens at gpu_mem=0.76	✅
AI1 used memory	~108 GiB / 121 GiB	✅
AI2 used memory	~113 GiB / 119 GiB	✅
Page cache peak	>40 GiB (OOM threshold)	<5 GiB
Inference TPS	5.5	11.0

The cluster runs stable with gpu_memory_utilization=0.76, max_model_len=180000, and max_num_seqs=2.

The Configuration

# /etc/sysctl.d/90-vllm.conf — identical on both nodes
vm.min_free_kbytes = 5242880    # 3145728 on ai1
vm.vfs_cache_pressure = 200
vm.dirty_ratio = 5
vm.dirty_background_ratio = 2

# launch-cluster.sh (built-in): cache drops every 0.1s
# start-cluster-wrapper.sh: drops caches before launch,
#   reduces NVMe readahead to 8KB, restores after 180s

# Restart after reboot:
cd ~/workspace/spark-vllm-docker
./start-cluster-wrapper.sh

Key Takeaways

3. Slower loading is cheaper than OOM recovery. A 40-second increase in model loading time is nothing compared to a 5-minute reboot cycle after OOM.

4. The same config ran stable for 31 hours. The OOM was a launch-time issue, not a runtime issue. Once the KV cache was allocated and the page cache settled, the cluster was solid.

Cart

Wishlist

Your wishlist is empty

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

The Setup

The Symptom

Root Cause: The Physics of Unified Memory

The Fixes

1. Reserve Emergency Memory (`vm.min_free_kbytes`)

2. Aggressive Cache Dropping

3. Readahead Throttling

4. Aggressive Dirty Page Settings

What DIDN'T Work

Results

The Configuration

Key Takeaways

See Also

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

The Setup

The Symptom

Root Cause: The Physics of Unified Memory

The Fixes

1. Reserve Emergency Memory (`vm.min_free_kbytes`)

2. Aggressive Cache Dropping

3. Readahead Throttling

4. Aggressive Dirty Page Settings

What DIDN'T Work

Results

The Configuration

Key Takeaways

See Also

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

The Setup

The Symptom

Root Cause: The Physics of Unified Memory

The Fixes

1. Reserve Emergency Memory (vm.min_free_kbytes)

2. Aggressive Cache Dropping

3. Readahead Throttling

4. Aggressive Dirty Page Settings

What DIDN'T Work

Results

The Configuration

Key Takeaways

See Also

Never miss a deep-dive

Taming OOM on DGX Spark: Debugging Unified Memory Pressure in a 2-Node vLLM Cluster

The Setup

The Symptom

Root Cause: The Physics of Unified Memory

The Fixes

1. Reserve Emergency Memory (vm.min_free_kbytes)

2. Aggressive Cache Dropping

3. Readahead Throttling

4. Aggressive Dirty Page Settings

What DIDN'T Work

Results

The Configuration

Key Takeaways

See Also

Never miss a deep-dive

1. Reserve Emergency Memory (`vm.min_free_kbytes`)

1. Reserve Emergency Memory (`vm.min_free_kbytes`)