Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe
Two-Node DGX Spark Cluster: DeepSeek V4 Flash with vLLM — Config, Optimizations & Community Recipe
Two NVIDIA DGX Sparks. One 685B-parameter MoE model. A 3× throughput improvement without a single crash. Here's how.
The Setup: Two GB10 Workstations as a Distributed Inference Cluster
NVIDIA's DGX Spark (GB10 Grace Blackwell Superchip) is designed as a desktop AI workstation, but with NVLink-C2C and fast networking, it scales. Our cluster connects two DGX Sparks across a local network:
| Node | Hostname | IP | Role |
|---|---|---|---|
| DGX Spark 1 | ai1 | — | Primary (master) |
| DGX Spark 2 | ai2 | — | Secondary (worker) |
Each node carries a GB10 SoC with 128 GB LPDDR5x unified memory (≈119-121 GiB available to the OS). One node (ASUS Ascend) has 121 GiB, the other (ThinkStation) has 119 GiB — a 2 GiB asymmetry that matters for memory-pressure tuning. The pair runs DeepSeek-V4-Flash — a Mixture-of-Experts model with 685B total parameters (37B active per token) — via vLLM 0.21.1rc1 in Docker, using tensor parallelism across both nodes (TP=2) over NCCL/RDMA.
Baseline: Before Optimization
The initial configuration was conservative:
| Parameter | Baseline Value |
|---|---|
max-num-seqs | 1 |
max-num-batched-tokens | 4096 |
gpu-memory-utilization | 0.70 |
| FlashInfer autotune | Disabled |
| CUDA graphs | Disabled (enforce-eager) |
Baseline performance:
| Metric | Value |
|---|---|
| Inter-token latency (ITL) | 182 ms/tok |
| Throughput | 5.5 tokens/second |
| KV cache capacity | ~482K tokens |
| TTFT (long context) | 88.5 seconds |
| GPU utilization (node 2) | 37% |
| Prefix cache hit rate | 72% |
The server handled 37 requests with 896K prompt tokens and 20K generation tokens — functional but far from the hardware's potential. GPU utilization sat at 37% on the secondary node. Clearly there was headroom.
The Optimizations: Safe, Incremental, Measured
Ranking the Impact
Before touching anything, we ranked optimizations by expected impact:
- Increase
max-num-seqs— single greatest throughput multiplier (3-6×) - Raise
gpu-memory-utilization— more KV cache headroom for concurrent requests - Enable CUDA graphs — reduce per-step overhead (10-20% ITL improvement)
- Increase
max-num-batched-tokens— larger batch capacity - Enable FlashInfer autotune — better attention kernel selection
The CUDA Graph Crash (And What We Learned)
Step 1 on our list was enabling CUDA graphs — remove --enforce-eager. vLLM initialized successfully, performed distributed setup across both nodes, and then... complete system lockup. Both DGX Sparks became simultaneously unresponsive. SSH timed out. No ping. No recovery via network.
This was a kernel panic or GPU hang triggered by CUDA graph compilation on DeepSeek V4's MoE architecture. The GB10's NVLink-C2C coupled with the model's dynamic expert routing created an edge case that crashed the entire cluster. Hard reboot was the only path.
Lesson: Standard (monolithic) CUDA graphs + MoE on GB10 is unstable. Skip the default mode. The enforce-eager mode works, but you can do better — see the update below.
Update: PIECEWISE CUDA Graphs Work Stably
Since publishing, we and the community confirmed that piecewise CUDA graph compilation works on GB10 without crashes. Two modes are available in the jasl/vllm fork with SM12x support (PR #41834):
| Mode | Stability | Coverage |
|---|---|---|
PIECEWISE | ✅ Stable | Captures the most frequent graph segments |
FULL_AND_PIECEWISE | ✅ Stable | Adds monolithic capture of hot loops on top |
Both modes compile graph segments incrementally instead of all at once, which avoids the MoE routing edge case that caused the kernel panic. Remove --enforce-eager and add:
compilation-config:
'{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'
On our cluster with PIECEWISE mode, CUDA graphs compile successfully across both nodes, and the cluster has been running for days without a crash. The performance gain over enforce-eager is modest (~8-12% ITL improvement) but every millisecond counts at production scale.
Safe Optimizations Applied
After the CUDA graph crash, we enabled piecewise compilation and applied the remaining optimizations:
| Parameter | Before | After | Rationale |
|---|---|---|---|
max-num-seqs | 1 | 2 | Double concurrent request capacity |
max-num-batched-tokens | 4096 | 8192 | Larger batch for prefix cache efficiency |
gpu-memory-utilization | 0.70 | 0.78 | More KV cache headroom; 0.82+ triggered OOM on the 119 GiB node |
| CUDA graphs | disabled | PIECEWISE | Works stably on GB10 with piecewise compilation |
| FlashInfer autotune | disabled | enabled | Better attention kernel selection |
Why 0.78 instead of 0.82? The two nodes have asymmetric RAM (121 GiB vs 119 GiB). The 119 GiB node becomes the bottleneck: at 0.82 gpu_mem, the requested allocation (98.1 GiB) exceeds available free memory after the OS reserves pages for I/O. Setting vm.min_free_kbytes=3145728 (3 GiB reserved) on both nodes prevents OOM during the 148 GiB checkpoint mmap, but forces a slightly lower utilization to fit within the smaller node's budget.
The cluster uses vLLM's --no-ray distributed executor (PyTorch native distributed), which handles the --nnodes 2 --node-rank N --master-addr <master-ip> --master-port 29501 wiring automatically.
Startup metrics:
- Model loading: 74.02 GiB memory per node, ~95-101 seconds on slower NVMe
- KV cache: ~13 GiB, ~600K tokens capacity
- CUDA graph compilation: PIECEWISE — completes on both nodes without crash
- Server live on
http://0.0.0.0:8000, health check 200 OK
Benchmark Results: 3× Throughput Improvement

Short Prompt Inference (ITL Test)
Five runs with 12 prompt tokens → 200 generation tokens:
Run | Duration | Tokens | TPS | ms/tok
-----|----------|--------|------|-------
1 | 12.54s | 200 | 16.0 | 63
2 | 12.57s | 200 | 15.9 | 63
3 | 12.55s | 200 | 15.9 | 63
4 | 12.54s | 200 | 16.0 | 63
5 | 12.52s | 200 | 16.0 | 63
Before: 182 ms/tok → After: 63 ms/tok — 2.9× speedup.
The dominant factor was FlashInfer autotune, which optimized attention kernel selection for the DeepSeek V4 architecture. Combined with the higher batch token limit, the GPU is now running at significantly higher utilization.
Prefix Cache Efficiency
DeepSeek V4 uses a shared prefix KV cache across the cluster. Testing with ~1200 prompt token contexts:
| Scenario | Duration | TPS | vs Cold |
|---|---|---|---|
| Cold (first run) | 17.56s | 2.8 | 1.0× |
| Same prompt (cached) | 8.38s | 6.0 | 2.1× |
| Similar prompt (deep cached) | 4.03s | 12.4 | 4.4× |
The prefix cache is highly effective. Repeated contexts (chat histories, system prompts, document templates) see massive speedups. For workloads with shared prefixes — like agentic systems where every request starts with the same system prompt — this is transformative.
Concurrent Request Handling (max-num-seqs=2)
With max-num-seqs=2, the cluster handles two concurrent requests efficiently:
4 concurrent requests completed in 14.24s total
- First pair: ~5-9s each (parallel)
- Second pair: queued behind first
- Effective throughput: 0.28 req/s for 100-token generations
For production use, increasing max-num-seqs further (4-8) would multiply throughput, provided the KV cache has capacity. At 52% KV cache utilization with max-num-seqs=2, there is headroom.
CUDA Graphs with PIECEWISE Mode
With PIECEWISE CUDA graphs enabled and --enforce-eager removed, the cluster achieves stable decode performance:
| Metric | Value | Conditions |
|---|---|---|
| TPOT | ~66 ms/tok | Short prompt, 100 output tokens |
| Throughput | ~15 tok/s | Single request decode |
| TTFT (short) | ~750 ms | 16-token prompt, streaming |
| TTFT (long, cold) | ~6 s | ~1300-token prompt, first run |
| TTFT (long, cached) | ~210 ms | Same prompt, KV cache warm |
| Concurrent (2 req) | ~4.3 s total | Both complete overlapping |
The PIECEWISE mode contributes ~8-12% ITL improvement over enforce-eager, measured consistently across multiple runs. The prefill benefits more from cached KV than from CUDA graphs — the 210ms TTFT on cached long prompts shows the prefix cache is the real latency killer.
Performance Summary
| Metric | Baseline | FlashInfer (+enforce-eager) | PIECEWISE CUDA graphs |
|---|---|---|---|
| ITL | 182 ms/tok | 63 ms/tok | ~66 ms/tok |
| Throughput | 5.5 TPS | 15.9 TPS | ~15 TPS |
| KV cache | 482K tokens | 865K tokens | ~600K tokens |
| GPU utilization | ~37% | ~65%+ (est.) | ~60%+ (est.) |
| Prefix cache speedup | - | Up to 4.4× | Up to 4.4× |
| CUDA graphs | disabled | disabled (crash) | PIECEWISE stable |
Lessons Learned
What Worked
- FlashInfer autotune is the highest-impact safe change. It tunes attention kernels to the model architecture and hardware, improving both latency and throughput without any configuration risk.
- Incremental deployment with health checks. Changing one parameter at a time and verifying stability prevented cascading failures.
- Prefix caching is essential for multi-turn workloads. With 72%+ hit rates, it effectively doubles throughput for chat and agent applications.
- Piecewise CUDA graph compilation — both
PIECEWISEandFULL_AND_PIECEWISEmodes work on GB10, avoiding the monolithic capture that caused the kernel panic. ThePIECEWISEmode alone is sufficient for stable operation. - Kernel-level memory tuning — setting
vm.min_free_kbytes(3 GiB on the bottleneck node), reducing NVMe readahead to 8 KB during model loading, and running a periodic cache drop loop prevents the page cache flood from the 148 GiB checkpoint mmap from starving the UVM driver.
What Didn't
- Monolithic CUDA graphs + MoE on GB10 crashes the cluster. The default full-graph capture triggers a kernel panic. Switch to
PIECEWISEorFULL_AND_PIECEWISEmode instead. - Ray distributed executor had world-size issues. The PyTorch native distributed executor (
--no-raymode) was more reliable for small clusters. max-num-seqs=1is unnecessarily conservative. Evenmax-num-seqs=2nearly doubles throughput without stability issues.max-num-seqs=4causes OOM on 119 GiB nodes. The KV cache for 4 concurrent sequences plus CUDA graph memory exceeds available unified memory on the smaller node. Stick tomax-num-seqs=2on asymmetric clusters.gpu-memory-utilizationabove 0.78 is risky on 119 GiB nodes. The asymmetric RAM (121 vs 119 GiB) means the smaller node runs out of physical pages first. 0.78 provides a reliable safety margin; 0.82+ triggers OOM during KV cache allocation.
What's Next
- Multi-node MTP speculation — enabling DeepSeek's native MTP (Multi-Token Prediction) across 2 nodes requires careful memory budgeting. The drafter model adds ~20-30 GiB overhead per node; likely needs
gpu-memory-utilizationreduced to 0.70-0.72 to fit. - FlashInfer sampler on top of CUDA graphs — combining
VLLM_USE_FLASHINFER_SAMPLER=1with PIECEWISE CUDA graphs may recover the ~2-3% throughput gap between the two configs. - Load testing with real workloads — benchmark with production prompt distributions (variable lengths, shared prefixes, tool calls) rather than synthetic fixed-length tests.
- Add monitoring — Prometheus + Grafana for real-time GPU utilization and token throughput tracking.
- Automated memory pressure management — script the cache-drop loop and readahead restore into the Docker entrypoint rather than relying on wrapper scripts.
The Config
For reference, our current stable configuration running on the 2-node cluster:
--served-model-name deepseek-v4-flash
--max-model-len 180000
--max-num-seqs 2
--max-num-batched-tokens 8192
--gpu-memory-utilization 0.78
--kv-cache-dtype fp8
--block-size 256
--enable-prefix-caching
--compilation-config '{"cudagraph_mode":"PIECEWISE","custom_ops":["all"]}'
--trust-remote-code
--host 0.0.0.0
--port 8000
--load-format safetensors
Env vars for stability:
export VLLM_SKIP_INIT_MEMORY_CHECK=1
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Kernel tuning (run on both nodes before launching):
# Reserve memory for critical allocations
sudo sysctl vm.min_free_kbytes=3145728
# Reduce readahead during model loading (restore after 180s)
sudo blockdev --setra 16 /dev/nvme0n1
# Periodic cache drop during model loading
sudo sh -c 'while true; do echo 3 > /proc/sys/vm/drop_caches; sleep 0.5; done' &
# Aggressive VM reclaim
sudo sysctl vm.vfs_cache_pressure=200
sudo sysctl vm.dirty_ratio=5
sudo sysctl vm.dirty_background_ratio=2
Distributed launch across two nodes:
# Node 1 (master):
--nnodes 2 --node-rank 0 --master-addr <master-ip> --master-port 29501
# Node 2 (worker):
--nnodes 2 --node-rank 1 --master-addr <master-ip> --master-port 29501
Community Recipe Update
The NVIDIA Developer Forum thread (post #100+) has since converged on an optimized recipe with additional improvements. Key differences from the config above:
Tokenizer Fix
llama-benchy defaults its tokenizer to the served model name (e.g. deepseek-v4-flash), which is not a Hugging Face model ID. It silently falls back to gpt2 (max 1024 tokens), causing all benchmarks with longer contexts to produce garbage results. Fix:
# Pass the real tokenizer explicitly
--tokenizer deepseek-ai/DeepSeek-V4-Flash
Additional Performance Flags
The community recipe adds several flags that further improve throughput and stability:
| Flag / Env | Value | Benefit |
|---|---|---|
--distributed-executor-backend | mp | Same as --no-ray, explicit multiprocessing |
--compilation-config | {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]} | Enables CUDA graphs on GB10 without crashes |
--speculative-config | {"method":"deepseek_mtp","num_speculative_tokens":2} | MTP speculation (2 tokens per forward pass) |
--disable-custom-all-reduce | (flag) | Avoids NCCL custom all-reduce issues on GB10 |
OMP_NUM_THREADS | 8 | Matches GB10 CPU core count for OpenMP parallelism |
VLLM_USE_FLASHINFER_SAMPLER | 1 | Uses FlashInfer's optimized sampler |
gpu-memory-utilization | 0.83 | Slightly higher than our 0.82, safe headroom verified |
Higher Throughput Configs
With these flags and the updated jasl/vllm fork, the community is running stable at:
max-num-seqs=4(vs our 2)max-num-batched-tokens=16384(vs our 8192)max-model-lenup to 393216 (vs our 180000)
These settings roughly double concurrent throughput while maintaining stability. See the full recipe discussion for the latest benchmarks and config iterations.
Note on hardware asymmetry: The community recipe assumes symmetric 128 GB nodes. If your cluster has mixed RAM capacities (e.g., 121 GiB + 119 GiB),
max-num-seqs=4andgpu-memory-utilization=0.83may cause OOM on the smaller node. Start withmax-num-seqs=2/gpu_mem=0.78and scale up from there.
Stability Tuning for Asymmetric Hardware
The 2 GiB RAM difference between our two nodes (121 GiB vs 119 GiB) created a subtle stability problem: the smaller node consistently ran out of physical pages first during the 148 GiB checkpoint mmap. The page cache from streaming 46 safetensors files would consume the last free pages before vLLM could allocate KV cache, triggering NV_ERR_NO_MEMORY in the NVIDIA UVM driver.
The fix involved three kernel-level changes:
- Reserve emergency memory —
vm.min_free_kbytes=3145728(3 GiB) on both nodes forces the kernel to keep pages free for critical allocations even under heavy I/O pressure. - Reduce NVMe readahead —
blockdev --setra 16(8 KB) during model loading prevents the kernel from aggressively caching the checkpoint stream. Restored to 256 (128 KB) after 180s. - Periodic cache drops —
echo 3 > /proc/sys/vm/drop_cachesevery 0.5s during loading clears transient page cache before it can exhaust available pages. The interval must be tuned: too fast (0.2s) causes I/O thrashing and doubles load time; too slow (1s+) lets memory pressure build.
After applying these, the cluster has been running continuously without a single OOM crash.
Conclusion
A two-node DGX Spark cluster running DeepSeek V4 Flash is not only feasible — it's production-viable. With the configuration changes documented here, we moved from 5.5 TPS to 15 TPS while enabling CUDA graphs on a platform where they were previously thought unstable.
The key takeaway: start conservative, measure everything, and apply changes incrementally. The biggest gains came from the simplest configuration changes, not from exotic optimizations. FlashInfer autotune alone accounted for the majority of the throughput improvement, and it required nothing more than removing a --no-enable-flashinfer-autotune flag.
A few surprises worth remembering:
- CUDA graphs work on GB10 — but only with piecewise compilation (
PIECEWISEorFULL_AND_PIECEWISE). Default monolithic capture crashes the cluster. - Unified memory is not infinite — the 148 GiB checkpoint mmap floods the page cache faster than you'd expect. Kernel-level memory tuning (
min_free_kbytes, readahead, cache drops) is essential for reliable multi-node operation. - Hardware asymmetry matters at the margins — 2 GiB difference between nodes is enough to make one node the consistent bottleneck. Tune for the smaller node, not the larger one.
For anyone running DeepSeek V4 Flash on DGX Spark hardware: use our config as a starting point, enable CUDA graphs with PIECEWISE mode, tune your memory settings, and push from there.
Subjectively, DeepSeek V4 Flash running locally on this DGX Spark cluster feels much smarter and more capable than MiniMax 2.7 recipes available through hosted APIs. The reasoning depth, code generation quality, and ability to follow complex instructions show a noticeable improvement in practical use.
See Also
- Taming OOM on DGX Spark — The same 2-node cluster, debugging the page cache / UVM interaction that made model loading unreliable
- n8n Automation on GB10: AI-Powered Workflows — Using the same GB10 hardware for self-hosted workflow automation with n8n
- Atlas Engine on DGX Spark — Multi-model orchestration with a Rust-based inference engine on single-node GB10
- NVIDIA Developer Forum Discussion — Community recipe, benchmarks, and ongoing optimizations for DeepSeek V4 Flash on 2-node DGX Spark
Enjoyed this guide? Subscribe to get notified about new AI infrastructure and DevOps articles.
Never miss a deep-dive
Get notified when I publish new articles on AI infrastructure, DevOps, and XR development.