Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe

Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

NVIDIA's DGX Spark (GB10 Grace Blackwell Superchip) is designed as a desktop AI workstation, but with NVLink-C2C and fast networking, it scales. Our cluster connects two DGX Sparks across a local network:

Node	Hostname	IP	Role
DGX Spark 1	`ai1`	—	Primary (master)
DGX Spark 2	`ai2`	—	Secondary (worker)

Each node carries a GB10 SoC with 128 GB LPDDR5x unified memory (≈119-121 GiB available to the OS). One node (ASUS Ascend) has 121 GiB, the other (ThinkStation) has 119 GiB — a 2 GiB asymmetry that matters for memory-pressure tuning. The pair runs DeepSeek-V4-Flash — a Mixture-of-Experts model with 685B total parameters (37B active per token) — via vLLM 0.21.1rc1 in Docker, using tensor parallelism across both nodes (TP=2) over NCCL/RDMA.

Baseline: Before Optimization

The initial configuration was conservative:

Parameter	Baseline Value
`max-num-seqs`	1
`max-num-batched-tokens`	4096
`gpu-memory-utilization`	0.70
FlashInfer autotune	Disabled
CUDA graphs	Disabled (`enforce-eager`)

Baseline performance:

Metric	Value
Inter-token latency (ITL)	182 ms/tok
Throughput	5.5 tokens/second
KV cache capacity	~482K tokens
TTFT (long context)	88.5 seconds
GPU utilization (node 2)	37%
Prefix cache hit rate	72%

The server handled 37 requests with 896K prompt tokens and 20K generation tokens — functional but far from the hardware's potential. GPU utilization sat at 37% on the secondary node. Clearly there was headroom.

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

Before touching anything, we ranked optimizations by expected impact:

Increase max-num-seqs — single greatest throughput multiplier (3-6×)
Raise gpu-memory-utilization — more KV cache headroom for concurrent requests
Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
Increase max-num-batched-tokens — larger batch capacity
Enable FlashInfer autotune — better attention kernel selection

The CUDA Graph Crash (And What We Learned)

Step 1 on our list was enabling CUDA graphs — remove --enforce-eager. vLLM initialized successfully, performed distributed setup across both nodes, and then... complete system lockup. Both DGX Sparks became simultaneously unresponsive. SSH timed out. No ping. No recovery via network.

This was a kernel panic or GPU hang triggered by CUDA graph compilation on DeepSeek V4's MoE architecture. The GB10's NVLink-C2C coupled with the model's dynamic expert routing created an edge case that crashed the entire cluster. Hard reboot was the only path.

Lesson: Standard (monolithic) CUDA graphs + MoE on GB10 is unstable. Skip the default mode. The enforce-eager mode works, but you can do better — see the update below.

Update: PIECEWISE CUDA Graphs Work Stably

Since publishing, we and the community confirmed that piecewise CUDA graph compilation works on GB10 without crashes. Two modes are available in the jasl/vllm fork with SM12x support (PR #41834):

Mode	Stability	Coverage
`PIECEWISE`	✅ Stable	Captures the most frequent graph segments
`FULL_AND_PIECEWISE`	✅ Stable	Adds monolithic capture of hot loops on top

Both modes compile graph segments incrementally instead of all at once, which avoids the MoE routing edge case that caused the kernel panic. Remove --enforce-eager and add:

compilation-config:
  '{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'

On our cluster with PIECEWISE mode, CUDA graphs compile successfully across both nodes, and the cluster has been running for days without a crash. The performance gain over enforce-eager is modest (~8-12% ITL improvement) but every millisecond counts at production scale.

Safe Optimizations Applied

After the CUDA graph crash, we enabled piecewise compilation and applied the remaining optimizations:

Parameter	Before	After	Rationale
`max-num-seqs`	1	2	Double concurrent request capacity
`max-num-batched-tokens`	4096	8192	Larger batch for prefix cache efficiency
`gpu-memory-utilization`	0.70	0.78	More KV cache headroom; 0.82+ triggered OOM on the 119 GiB node
CUDA graphs	disabled	PIECEWISE	Works stably on GB10 with piecewise compilation
FlashInfer autotune	disabled	enabled	Better attention kernel selection

Why 0.78 instead of 0.82? The two nodes have asymmetric RAM (121 GiB vs 119 GiB). The 119 GiB node becomes the bottleneck: at 0.82 gpu_mem, the requested allocation (98.1 GiB) exceeds available free memory after the OS reserves pages for I/O. Setting vm.min_free_kbytes=3145728 (3 GiB reserved) on both nodes prevents OOM during the 148 GiB checkpoint mmap, but forces a slightly lower utilization to fit within the smaller node's budget.

The cluster uses vLLM's --no-ray distributed executor (PyTorch native distributed), which handles the --nnodes 2 --node-rank N --master-addr <master-ip> --master-port 29501 wiring automatically.

Startup metrics:

Model loading: 74.02 GiB memory per node, ~95-101 seconds on slower NVMe
KV cache: ~13 GiB, ~600K tokens capacity
CUDA graph compilation: PIECEWISE — completes on both nodes without crash
Server live on http://0.0.0.0:8000, health check 200 OK

Benchmark Results: 3× Throughput Improvement

DGX Spark Benchmark Comparison: Baseline vs Optimized TPS

Short Prompt Inference (ITL Test)

Five runs with 12 prompt tokens → 200 generation tokens:

Run  | Duration | Tokens | TPS  | ms/tok
-----|----------|--------|------|-------
1    | 12.54s   | 200    | 16.0 | 63
2    | 12.57s   | 200    | 15.9 | 63
3    | 12.55s   | 200    | 15.9 | 63
4    | 12.54s   | 200    | 16.0 | 63
5    | 12.52s   | 200    | 16.0 | 63

Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.

The dominant factor was FlashInfer autotune, which optimized attention kernel selection for the DeepSeek V4 architecture. Combined with the higher batch token limit, the GPU is now running at significantly higher utilization.

Prefix Cache Efficiency

DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:

Scenario	Duration	TPS	vs Cold
Cold (first run)	17.56s	2.8	1.0×
Same prompt (cached)	8.38s	6.0	2.1×
Similar prompt (deep cached)	4.03s	12.4	4.4×

The prefix cache is highly effective. Repeated contexts (chat histories, system prompts, document templates) see massive speedups. For workloads with shared prefixes — like agentic systems where every request starts with the same system prompt — this is transformative.

Concurrent Request Handling (max-num-seqs=2)

With max-num-seqs=2, the cluster handles two concurrent requests efficiently:

4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations

For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.

CUDA Graphs with PIECEWISE Mode

With PIECEWISE CUDA graphs enabled and --enforce-eager removed, the cluster achieves stable decode performance:

Metric	Value	Conditions
TPOT	~66 ms/tok	Short prompt, 100 output tokens
Throughput	~15 tok/s	Single request decode
TTFT (short)	~750 ms	16-token prompt, streaming
TTFT (long, cold)	~6 s	~1300-token prompt, first run
TTFT (long, cached)	~210 ms	Same prompt, KV cache warm
Concurrent (2 req)	~4.3 s total	Both complete overlapping

The PIECEWISE mode contributes ~8-12% ITL improvement over enforce-eager, measured consistently across multiple runs. The prefill benefits more from cached KV than from CUDA graphs — the 210ms TTFT on cached long prompts shows the prefix cache is the real latency killer.

Performance Summary

Metric	Baseline	FlashInfer (+enforce-eager)	PIECEWISE CUDA graphs
ITL	182 ms/tok	63 ms/tok	~66 ms/tok
Throughput	5.5 TPS	15.9 TPS	~15 TPS
KV cache	482K tokens	865K tokens	~600K tokens
GPU utilization	~37%	~65%+ (est.)	~60%+ (est.)
Prefix cache speedup	-	Up to 4.4×	Up to 4.4×
CUDA graphs	disabled	disabled (crash)	PIECEWISE stable

Lessons Learned

What Worked

FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
Piecewise CUDA graph compilation — both PIECEWISE and FULL_AND_PIECEWISE modes work on GB10, avoiding the monolithic capture that caused the kernel panic. The PIECEWISE mode alone is sufficient for stable operation.
Kernel-level memory tuning — setting vm.min_free_kbytes (3 GiB on the bottleneck node), reducing NVMe readahead to 8 KB during model loading, and running a periodic cache drop loop prevents the page cache flood from the 148 GiB checkpoint mmap from starving the UVM driver.

What Didn't

Monolithic CUDA graphs + MoE on GB10 crashes the cluster. The default full-graph capture triggers a kernel panic. Switch to PIECEWISE or FULL_AND_PIECEWISE mode instead.
Ray distributed executor had world-size issues. The PyTorch native distributed executor (--no-ray mode) was more reliable for small clusters.
max-num-seqs=1 is unnecessarily conservative. Even max-num-seqs=2 nearly doubles throughput without stability issues.
max-num-seqs=4 causes OOM on 119 GiB nodes. The KV cache for 4 concurrent sequences plus CUDA graph memory exceeds available unified memory on the smaller node. Stick to max-num-seqs=2 on asymmetric clusters.
gpu-memory-utilization above 0.78 is risky on 119 GiB nodes. The asymmetric RAM (121 vs 119 GiB) means the smaller node runs out of physical pages first. 0.78 provides a reliable safety margin; 0.82+ triggers OOM during KV cache allocation.

What's Next

Multi-node MTP speculation — enabling DeepSeek's native MTP (Multi-Token Prediction) across 2 nodes requires careful memory budgeting. The drafter model adds ~20-30 GiB overhead per node; likely needs gpu-memory-utilization reduced to 0.70-0.72 to fit.
FlashInfer sampler on top of CUDA graphs — combining VLLM_USE_FLASHINFER_SAMPLER=1 with PIECEWISE CUDA graphs may recover the ~2-3% throughput gap between the two configs.
Load testing with real workloads — benchmark with production prompt distributions (variable lengths, shared prefixes, tool calls) rather than synthetic fixed-length tests.
Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking.
Automated memory pressure management — script the cache-drop loop and readahead restore into the Docker entrypoint rather than relying on wrapper scripts.

The Config

For reference, our current stable configuration running on the 2-node cluster:

--served-model-name deepseek-v4-flash
--max-model-len 180000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--enable-prefix-caching
--compilation-config '{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors

Env vars for stability:

export VLLM_SKIP_INIT_MEMORY_CHECK=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Kernel tuning (run on both nodes before launching):

# Reserve memory for critical allocations
sudo sysctl vm.min_free_kbytes=3145728

# Reduce readahead during model loading (restore after 180s)
sudo blockdev --setra 16 /dev/nvme0n1

# Periodic cache drop during model loading
sudo sh -c 'while true; do echo 3 > /proc/sys/vm/drop_caches; sleep 0.5; done' &

# Aggressive VM reclaim
sudo sysctl vm.vfs_cache_pressure=200
sudo sysctl vm.dirty_ratio=5
sudo sysctl vm.dirty_background_ratio=2

Distributed launch across two nodes:

# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501

# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501

Community Recipe Update

The NVIDIA Developer Forum thread (post #100+) has since converged on an optimized recipe with additional improvements. Key differences from the config above:

Tokenizer Fix

llama-benchy defaults its tokenizer to the served model name (e.g. deepseek-v4-flash), which is not a Hugging Face model ID. It silently falls back to gpt2 (max 1024 tokens), causing all benchmarks with longer contexts to produce garbage results. Fix:

# Pass the real tokenizer explicitly
--tokenizer deepseek-ai/DeepSeek-V4-Flash

Additional Performance Flags

The community recipe adds several flags that further improve throughput and stability:

Flag / Env	Value	Benefit
`--distributed-executor-backend`	`mp`	Same as `--no-ray`, explicit multiprocessing
`--compilation-config`	`{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}`	Enables CUDA graphs on GB10 without crashes
`--speculative-config`	`{"method":"deepseek_mtp","num_speculative_tokens":2}`	MTP speculation (2 tokens per forward pass)
`--disable-custom-all-reduce`	(flag)	Avoids NCCL custom all-reduce issues on GB10
`OMP_NUM_THREADS`	`8`	Matches GB10 CPU core count for OpenMP parallelism
`VLLM_USE_FLASHINFER_SAMPLER`	`1`	Uses FlashInfer's optimized sampler
`gpu-memory-utilization`	`0.83`	Slightly higher than our 0.82, safe headroom verified

Higher Throughput Configs

With these flags and the updated jasl/vllm fork, the community is running stable at:

max-num-seqs=4 (vs our 2)
max-num-batched-tokens=16384 (vs our 8192)
max-model-len up to 393216 (vs our 180000)

These settings roughly double concurrent throughput while maintaining stability. See the full recipe discussion for the latest benchmarks and config iterations.

Note on hardware asymmetry: The community recipe assumes symmetric 128 GB nodes. If your cluster has mixed RAM capacities (e.g., 121 GiB + 119 GiB), max-num-seqs=4 and gpu-memory-utilization=0.83 may cause OOM on the smaller node. Start with max-num-seqs=2 / gpu_mem=0.78 and scale up from there.

Stability Tuning for Asymmetric Hardware

The 2 GiB RAM difference between our two nodes (121 GiB vs 119 GiB) created a subtle stability problem: the smaller node consistently ran out of physical pages first during the 148 GiB checkpoint mmap. The page cache from streaming 46 safetensors files would consume the last free pages before vLLM could allocate KV cache, triggering NV_ERR_NO_MEMORY in the NVIDIA UVM driver.

The fix involved three kernel-level changes:

Reserve emergency memory — vm.min_free_kbytes=3145728 (3 GiB) on both nodes forces the kernel to keep pages free for critical allocations even under heavy I/O pressure.
Reduce NVMe readahead — blockdev --setra 16 (8 KB) during model loading prevents the kernel from aggressively caching the checkpoint stream. Restored to 256 (128 KB) after 180s.
Periodic cache drops — echo 3 > /proc/sys/vm/drop_caches every 0.5s during loading clears transient page cache before it can exhaust available pages. The interval must be tuned: too fast (0.2s) causes I/O thrashing and doubles load time; too slow (1s+) lets memory pressure build.

After applying these, the cluster has been running continuously without a single OOM crash.

Conclusion

A two-node DGX Spark cluster running DeepSeek V4 Flash is not only feasible — it's production-viable. With the configuration changes documented here, we moved from 5.5 TPS to 15 TPS while enabling CUDA graphs on a platform where they were previously thought unstable.

The key takeaway: start conservative, measure everything, and apply changes incrementally. The biggest gains came from the simplest configuration changes, not from exotic optimizations. FlashInfer autotune alone accounted for the majority of the throughput improvement, and it required nothing more than removing a --no-enable-flashinfer-autotune flag.

A few surprises worth remembering:

CUDA graphs work on GB10 — but only with piecewise compilation (PIECEWISE or FULL_AND_PIECEWISE). Default monolithic capture crashes the cluster.
Unified memory is not infinite — the 148 GiB checkpoint mmap floods the page cache faster than you'd expect. Kernel-level memory tuning (min_free_kbytes, readahead, cache drops) is essential for reliable multi-node operation.
Hardware asymmetry matters at the margins — 2 GiB difference between nodes is enough to make one node the consistent bottleneck. Tune for the smaller node, not the larger one.

For anyone running DeepSeek V4 Flash on DGX Spark hardware: use our config as a starting point, enable CUDA graphs with PIECEWISE mode, tune your memory settings, and push from there.

Subjectively, DeepSeek V4 Flash running locally on this DGX Spark cluster feels much smarter and more capable than MiniMax 2.7 recipes available through hosted APIs. The reasoning depth, code generation quality, and ability to follow complex instructions show a noticeable improvement in practical use.

Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe

Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Node	Hostname	IP	Role
DGX Spark 1	`ai1`	—	Primary (master)
DGX Spark 2	`ai2`	—	Secondary (worker)

Baseline: Before Optimization

The initial configuration was conservative:

Parameter	Baseline Value
`max-num-seqs`	1
`max-num-batched-tokens`	4096
`gpu-memory-utilization`	0.70
FlashInfer autotune	Disabled
CUDA graphs	Disabled (`enforce-eager`)

Baseline performance:

Metric	Value
Inter-token latency (ITL)	182 ms/tok
Throughput	5.5 tokens/second
KV cache capacity	~482K tokens
TTFT (long context)	88.5 seconds
GPU utilization (node 2)	37%
Prefix cache hit rate	72%

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

Before touching anything, we ranked optimizations by expected impact:

Increase max-num-seqs — single greatest throughput multiplier (3-6×)
Raise gpu-memory-utilization — more KV cache headroom for concurrent requests
Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
Increase max-num-batched-tokens — larger batch capacity
Enable FlashInfer autotune — better attention kernel selection

The CUDA Graph Crash (And What We Learned)

Lesson: Standard (monolithic) CUDA graphs + MoE on GB10 is unstable. Skip the default mode. The enforce-eager mode works, but you can do better — see the update below.

Update: PIECEWISE CUDA Graphs Work Stably

Mode	Stability	Coverage
`PIECEWISE`	✅ Stable	Captures the most frequent graph segments
`FULL_AND_PIECEWISE`	✅ Stable	Adds monolithic capture of hot loops on top

Both modes compile graph segments incrementally instead of all at once, which avoids the MoE routing edge case that caused the kernel panic. Remove --enforce-eager and add:

compilation-config:
  '{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'

Safe Optimizations Applied

After the CUDA graph crash, we enabled piecewise compilation and applied the remaining optimizations:

Parameter	Before	After	Rationale
`max-num-seqs`	1	2	Double concurrent request capacity
`max-num-batched-tokens`	4096	8192	Larger batch for prefix cache efficiency
`gpu-memory-utilization`	0.70	0.78	More KV cache headroom; 0.82+ triggered OOM on the 119 GiB node
CUDA graphs	disabled	PIECEWISE	Works stably on GB10 with piecewise compilation
FlashInfer autotune	disabled	enabled	Better attention kernel selection

The cluster uses vLLM's --no-ray distributed executor (PyTorch native distributed), which handles the --nnodes 2 --node-rank N --master-addr <master-ip> --master-port 29501 wiring automatically.

Startup metrics:

Model loading: 74.02 GiB memory per node, ~95-101 seconds on slower NVMe
KV cache: ~13 GiB, ~600K tokens capacity
CUDA graph compilation: PIECEWISE — completes on both nodes without crash
Server live on http://0.0.0.0:8000, health check 200 OK

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Five runs with 12 prompt tokens → 200 generation tokens:

Run  | Duration | Tokens | TPS  | ms/tok
-----|----------|--------|------|-------
1    | 12.54s   | 200    | 16.0 | 63
2    | 12.57s   | 200    | 15.9 | 63
3    | 12.55s   | 200    | 15.9 | 63
4    | 12.54s   | 200    | 16.0 | 63
5    | 12.52s   | 200    | 16.0 | 63

Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.

Prefix Cache Efficiency

DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:

Scenario	Duration	TPS	vs Cold
Cold (first run)	17.56s	2.8	1.0×
Same prompt (cached)	8.38s	6.0	2.1×
Similar prompt (deep cached)	4.03s	12.4	4.4×

Concurrent Request Handling (max-num-seqs=2)

With max-num-seqs=2, the cluster handles two concurrent requests efficiently:

4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations

For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.

CUDA Graphs with PIECEWISE Mode

With PIECEWISE CUDA graphs enabled and --enforce-eager removed, the cluster achieves stable decode performance:

Metric	Value	Conditions
TPOT	~66 ms/tok	Short prompt, 100 output tokens
Throughput	~15 tok/s	Single request decode
TTFT (short)	~750 ms	16-token prompt, streaming
TTFT (long, cold)	~6 s	~1300-token prompt, first run
TTFT (long, cached)	~210 ms	Same prompt, KV cache warm
Concurrent (2 req)	~4.3 s total	Both complete overlapping

Performance Summary

Metric	Baseline	FlashInfer (+enforce-eager)	PIECEWISE CUDA graphs
ITL	182 ms/tok	63 ms/tok	~66 ms/tok
Throughput	5.5 TPS	15.9 TPS	~15 TPS
KV cache	482K tokens	865K tokens	~600K tokens
GPU utilization	~37%	~65%+ (est.)	~60%+ (est.)
Prefix cache speedup	-	Up to 4.4×	Up to 4.4×
CUDA graphs	disabled	disabled (crash)	PIECEWISE stable

Lessons Learned

What Worked

FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
Piecewise CUDA graph compilation — both PIECEWISE and FULL_AND_PIECEWISE modes work on GB10, avoiding the monolithic capture that caused the kernel panic. The PIECEWISE mode alone is sufficient for stable operation.
Kernel-level memory tuning — setting vm.min_free_kbytes (3 GiB on the bottleneck node), reducing NVMe readahead to 8 KB during model loading, and running a periodic cache drop loop prevents the page cache flood from the 148 GiB checkpoint mmap from starving the UVM driver.

What Didn't

Monolithic CUDA graphs + MoE on GB10 crashes the cluster. The default full-graph capture triggers a kernel panic. Switch to PIECEWISE or FULL_AND_PIECEWISE mode instead.
Ray distributed executor had world-size issues. The PyTorch native distributed executor (--no-ray mode) was more reliable for small clusters.
max-num-seqs=1 is unnecessarily conservative. Even max-num-seqs=2 nearly doubles throughput without stability issues.
max-num-seqs=4 causes OOM on 119 GiB nodes. The KV cache for 4 concurrent sequences plus CUDA graph memory exceeds available unified memory on the smaller node. Stick to max-num-seqs=2 on asymmetric clusters.
gpu-memory-utilization above 0.78 is risky on 119 GiB nodes. The asymmetric RAM (121 vs 119 GiB) means the smaller node runs out of physical pages first. 0.78 provides a reliable safety margin; 0.82+ triggers OOM during KV cache allocation.

What's Next

Multi-node MTP speculation — enabling DeepSeek's native MTP (Multi-Token Prediction) across 2 nodes requires careful memory budgeting. The drafter model adds ~20-30 GiB overhead per node; likely needs gpu-memory-utilization reduced to 0.70-0.72 to fit.
FlashInfer sampler on top of CUDA graphs — combining VLLM_USE_FLASHINFER_SAMPLER=1 with PIECEWISE CUDA graphs may recover the ~2-3% throughput gap between the two configs.
Load testing with real workloads — benchmark with production prompt distributions (variable lengths, shared prefixes, tool calls) rather than synthetic fixed-length tests.
Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking.
Automated memory pressure management — script the cache-drop loop and readahead restore into the Docker entrypoint rather than relying on wrapper scripts.

The Config

For reference, our current stable configuration running on the 2-node cluster:

--served-model-name deepseek-v4-flash
--max-model-len 180000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--enable-prefix-caching
--compilation-config '{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors

Env vars for stability:

export VLLM_SKIP_INIT_MEMORY_CHECK=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Kernel tuning (run on both nodes before launching):

# Reserve memory for critical allocations
sudo sysctl vm.min_free_kbytes=3145728

# Reduce readahead during model loading (restore after 180s)
sudo blockdev --setra 16 /dev/nvme0n1

# Periodic cache drop during model loading
sudo sh -c 'while true; do echo 3 > /proc/sys/vm/drop_caches; sleep 0.5; done' &

# Aggressive VM reclaim
sudo sysctl vm.vfs_cache_pressure=200
sudo sysctl vm.dirty_ratio=5
sudo sysctl vm.dirty_background_ratio=2

Distributed launch across two nodes:

# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501

# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501

Community Recipe Update

The NVIDIA Developer Forum thread (post #100+) has since converged on an optimized recipe with additional improvements. Key differences from the config above:

Tokenizer Fix

# Pass the real tokenizer explicitly
--tokenizer deepseek-ai/DeepSeek-V4-Flash

Additional Performance Flags

The community recipe adds several flags that further improve throughput and stability:

Flag / Env	Value	Benefit
`--distributed-executor-backend`	`mp`	Same as `--no-ray`, explicit multiprocessing
`--compilation-config`	`{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}`	Enables CUDA graphs on GB10 without crashes
`--speculative-config`	`{"method":"deepseek_mtp","num_speculative_tokens":2}`	MTP speculation (2 tokens per forward pass)
`--disable-custom-all-reduce`	(flag)	Avoids NCCL custom all-reduce issues on GB10
`OMP_NUM_THREADS`	`8`	Matches GB10 CPU core count for OpenMP parallelism
`VLLM_USE_FLASHINFER_SAMPLER`	`1`	Uses FlashInfer's optimized sampler
`gpu-memory-utilization`	`0.83`	Slightly higher than our 0.82, safe headroom verified

Higher Throughput Configs

With these flags and the updated jasl/vllm fork, the community is running stable at:

max-num-seqs=4 (vs our 2)
max-num-batched-tokens=16384 (vs our 8192)
max-model-len up to 393216 (vs our 180000)

These settings roughly double concurrent throughput while maintaining stability. See the full recipe discussion for the latest benchmarks and config iterations.

Note on hardware asymmetry: The community recipe assumes symmetric 128 GB nodes. If your cluster has mixed RAM capacities (e.g., 121 GiB + 119 GiB), max-num-seqs=4 and gpu-memory-utilization=0.83 may cause OOM on the smaller node. Start with max-num-seqs=2 / gpu_mem=0.78 and scale up from there.

Stability Tuning for Asymmetric Hardware

The fix involved three kernel-level changes:

Reserve emergency memory — vm.min_free_kbytes=3145728 (3 GiB) on both nodes forces the kernel to keep pages free for critical allocations even under heavy I/O pressure.
Reduce NVMe readahead — blockdev --setra 16 (8 KB) during model loading prevents the kernel from aggressively caching the checkpoint stream. Restored to 256 (128 KB) after 180s.
Periodic cache drops — echo 3 > /proc/sys/vm/drop_caches every 0.5s during loading clears transient page cache before it can exhaust available pages. The interval must be tuned: too fast (0.2s) causes I/O thrashing and doubles load time; too slow (1s+) lets memory pressure build.

After applying these, the cluster has been running continuously without a single OOM crash.

Conclusion

A few surprises worth remembering:

CUDA graphs work on GB10 — but only with piecewise compilation (PIECEWISE or FULL_AND_PIECEWISE). Default monolithic capture crashes the cluster.
Unified memory is not infinite — the 148 GiB checkpoint mmap floods the page cache faster than you'd expect. Kernel-level memory tuning (min_free_kbytes, readahead, cache drops) is essential for reliable multi-node operation.
Hardware asymmetry matters at the margins — 2 GiB difference between nodes is enough to make one node the consistent bottleneck. Tune for the smaller node, not the larger one.

For anyone running DeepSeek V4 Flash on DGX Spark hardware: use our config as a starting point, enable CUDA graphs with PIECEWISE mode, tune your memory settings, and push from there.

Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Baseline: Before Optimization

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

The CUDA Graph Crash (And What We Learned)

Update: PIECEWISE CUDA Graphs Work Stably

Safe Optimizations Applied

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Prefix Cache Efficiency

Concurrent Request Handling (max-num-seqs=2)

CUDA Graphs with PIECEWISE Mode

Performance Summary

Lessons Learned

What Worked

What Didn't

What's Next

The Config

Community Recipe Update

Tokenizer Fix

Additional Performance Flags

Higher Throughput Configs

Stability Tuning for Asymmetric Hardware

Conclusion

See Also

Never miss a deep-dive

Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe

The Setup: Two GB10 Workstations as a Distributed Inference Cluster

Baseline: Before Optimization

The Optimizations: Safe, Incremental, Measured

Ranking the Impact

The CUDA Graph Crash (And What We Learned)

Update: PIECEWISE CUDA Graphs Work Stably

Safe Optimizations Applied

Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)

Prefix Cache Efficiency

Concurrent Request Handling (max-num-seqs=2)

CUDA Graphs with PIECEWISE Mode

Performance Summary

Lessons Learned

What Worked

What Didn't

What's Next

The Config

Community Recipe Update

Tokenizer Fix

Additional Performance Flags

Higher Throughput Configs

Stability Tuning for Asymmetric Hardware

Conclusion

See Also

Never miss a deep-dive