~/blog/dgx-spark-deployment-guide

DGX Spark · part 0

[DGX Spark] From Unboxing to Running: Complete Deployment Guide

cat --toc

TL;DR

From sealed box to serving LLMs in under an hour. Ollama for day one (5 minutes), vLLM for production (30 minutes). Check power delivery before anything else — some units ship defective.

Plain-Language Version: Setting Up Your Personal AI Server

You bought a desktop computer that can run AI models locally — the same kind of models that power ChatGPT, but on your desk, with no cloud fees and no data leaving your network. The problem is that "AI workstation" documentation assumes you already know what you're doing.

This guide assumes you don't. It covers everything from plugging in the power cable to choosing which AI model to run and why, with two paths: a 5-minute quickstart for experimentation, and a production setup for always-on AI agents. Every command is real, tested on this specific hardware, and includes the non-obvious traps that official docs skip.

If you came here from a specific problem (overheating, slow inference, crashes), the series index at the bottom points to the right article.


Preface

A new machine without a deployment guide is a box of potential energy sitting on your desk doing nothing. The gap between "it powers on" and "it's serving useful inference" is where most people's weekends disappear.

This is the article I wish existed when my ASUS Ascent GX10 arrived. Not a benchmark (that's Part 1), not a deep dive into a specific model (that's Parts 2-14), but the straight line from sealed box to working inference endpoint.


Day 1: Hardware Check (Do This First)

Step 1: Power Delivery Verification

Some DGX Spark units ship with a defective PD controller that silently caps power at 30W. The machine boots, runs, and looks normal — just painfully slow. Check this before anything else.

Boot the machine, open a terminal, and run:

# Start any GPU workload first (a simple matrix multiply works)
python3 -c "import torch; x = torch.randn(4096, 4096, device='cuda'); y = x @ x; print('GPU works')"

# Then check power draw
nvidia-smi --query-gpu=power.draw,utilization.gpu,clocks.sm --format=csv,noheader

Healthy output:

35.65 W, 96 %, 2522 MHz

Defective unit (30W safety mode):

4.80 W, 2 %, 2411 MHz

If you see the second output, stop here. This is a hardware defect requiring RMA — no amount of software configuration will fix it. See the full diagnostic guide.

Step 2: Driver and CUDA Version

nvidia-smi

You need:

  • Driver: 580.x or later
  • CUDA: 13.0 or later

Older drivers (550.x + CUDA 12.4) have a separate bug that makes the GPU appear stuck at 5W/0% utilization. This one is fixable with a driver upgrade.

Step 3: Confirm GPU Identity

nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv,noheader

Expected:

NVIDIA GB10, 12.1, 128 GB

The compute capability 12.1 is SM121 — this is not the same as SM100 (GB200) or SM89 (RTX 4090). Many CUDA kernels don't support SM121 yet. This matters when choosing software (Ollama handles it; vLLM needs cu130-nightly).


Path A: Ollama (5 Minutes to First Inference)

Ollama is the fastest path to running a model. One install, one command.

Install

curl -fsSL https://ollama.com/install.sh | sh

Run Your First Model

ollama run qwen3-coder-next

This downloads a ~20 GB model and starts an interactive chat. First run takes a few minutes for the download; subsequent runs start in seconds.

Why Ollama First?

  • Works out of the box on GB10 (SM121 support built in)
  • No Docker, no configuration files, no flags
  • Good for experimenting with different models quickly
  • The /api/chat endpoint is OpenAI-compatible enough for most tools

Ollama Limitations (Why You'll Eventually Want vLLM)

  • No prefix caching: every request re-processes the full system prompt. TTFT is 2-4 seconds on long prompts.
  • No NVFP4: Ollama uses GGUF quantization (Q4_K_M, Q8_0). vLLM's NVFP4 is 30% faster on the same model.
  • KEEP_ALIVE trap: Ollama holds models in memory for 2 hours by default. On shared unified memory, this blocks vLLM from starting.

For experimentation and quick testing, Ollama is perfect. For production agent workloads, move to vLLM.


Path B: vLLM (Production Deployment)

vLLM is the production serving engine. More setup, but prefix caching alone justifies the migration for always-on agents.

Prerequisites

# Install Docker if not present
sudo apt-get update && sudo apt-get install -y docker.io nvidia-container-toolkit
sudo systemctl restart docker

# Verify GPU access in Docker
docker run --rm --gpus all nvidia/cuda:13.0-base nvidia-smi

Download a Model

pip install "huggingface_hub[cli,hf_transfer]"

# Qwen3.5-35B — the workhorse for agents
HF_HUB_ENABLE_HF_TRANSFER=1 hf download Qwen/Qwen3.5-35B-A3B-FP8 \
  --local-dir ~/models/qwen35-35b-hf

HF_HUB_ENABLE_HF_TRANSFER=1 enables the Rust-based transfer backend — meaningfully faster for multi-GB downloads.

Before Starting vLLM: Unload Ollama

If Ollama is installed, it may be holding a model in the shared 128 GB memory pool. vLLM will OOM on startup if Ollama is occupying memory.

# Check what Ollama has loaded
curl -s http://localhost:11434/api/ps

# Unload everything
curl -s -X POST http://localhost:11434/api/generate \
  -d '{"model": "MODEL_NAME", "keep_alive": 0}'

The Working Docker Command

docker run -d --name qwen35 --restart unless-stopped \
  --gpus all --ipc host --shm-size 64gb -p 8000:8000 \
  -v ~/models/qwen35-35b-hf:/models/qwen35 \
  vllm/vllm-openai:cu130-nightly \
  --model /models/qwen35 \
  --served-model-name qwen3.5-35b \
  --max-model-len 200000 \
  --gpu-memory-utilization 0.90 \
  --max-num-batched-tokens 4096 \
  --enable-prefix-caching \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Critical notes:

  • Use cu130-nightly, not the stable image. Stable doesn't support GB10/SM121.
  • Do NOT add --enable-chunked-prefill — it causes a 9x throughput regression on SSM+MoE models.
  • Do NOT add --kv-cache-dtype fp8 — it causes output repetition loops on GB10.

Cold start takes 2-3 minutes (model loading + torch.compile + FlashInfer autotuning). Watch the logs:

docker logs -f qwen35

Ready when you see vLLM engine started.

Verify

# Health check
curl -s http://localhost:8000/health

# Test inference
curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-35b",
    "messages": [{"role": "user", "content": "Hello, what hardware am I running on?"}],
    "chat_template_kwargs": {"enable_thinking": false}
  }'

The chat_template_kwargs is required per-request for Qwen3.5 to suppress thinking mode. Cannot be set server-side.


Which Model Should You Run?

This depends on your use case and hardware. Here's the decision tree:

On DGX Spark (128 GB, 273 GB/s)

Use CaseModelRuntimeSpeedWhy
Agent workloadsQwen3.5-35B-A3B FP8vLLM47 tok/sBest tool calling, SSM hybrid for stable long-context
Vision tasksGemma 4 26B-A4B NVFP4vLLM52 tok/sMultimodal, only 16 GB — leaves room for other models
Quick experimentationAny modelOllamavariesOne command, no config
Maximum capabilitygpt-oss-120B MXFP4vLLM~40 tok/s120B parameters, fits in 128 GB

Speed Floor Principle

Intelligence matters more than speed — but only above a usability threshold. The floor is roughly ~15-20 tok/s for interactive use. Below that, you're waiting long enough that the intelligence advantage evaporates into frustration.

On DGX Spark, 31B Dense models hit 7 tok/s — below the floor. Choose MoE variants instead. On RTX 5090, the same 31B Dense runs at 62 tok/s — well above the floor, making it the smartest comfortable choice.

Full model comparison with all four Gemma 4 variants across three machines: Complete Guide


The 5 Gotchas Nobody Warns You About

These are the traps that cost hours. Each links to the full article.

1. Power Delivery Defect

Some units are permanently power-capped at 30W. Looks like a software problem. It's hardware. Full diagnostic →

2. Unified Memory Is Shared

Ollama and vLLM compete for the same 128 GB pool. nvidia-smi shows N/A for memory — use vLLM's /metrics endpoint instead. Details →

3. SM121 Is Not SM100

The GB10's compute capability (SM121) is different from GB200 (SM100). Many CUDA kernels don't support it. This is why you need cu130-nightly for vLLM and why llama.cpp crashes on this hardware.

4. FP8 KV Cache Causes Repetition

Adding --kv-cache-dtype fp8 looks like a free memory optimization. On GB10, it causes outputs to degrade into repetition loops after ~500 tokens due to missing calibration data. Root cause →

5. Chunked Prefill Kills SSM Models

--enable-chunked-prefill is a standard vLLM optimization for Transformers. On SSM+MoE hybrids (Qwen3.5, Mamba-based models), it causes a 9x throughput regression. Analysis →


What Was Gained

The shortest path: Ollama installed → ollama run qwen3-coder-next → inference in 5 minutes. No Docker, no flags, no config.

The production path: vLLM with cu130-nightly → prefix caching drops TTFT from 2-4s to 0.12s → agent workloads become responsive.

The diagnostic: one nvidia-smi command before anything else. If power draw shows 5W under load, stop troubleshooting software.


Every section above has a full article behind it:

TopicArticle
Model benchmark (8 models)Part 1: Finding the Best Stack
Ollama → vLLM migrationPart 2: Qwen3.5 at 47 tok/s
120B model deploymentPart 3: Nemotron-120B Debug Log
FP8 KV cache bugPart 5: Repetition Loops
Power/thermal diagnosticPart 6: 30W Safety Mode
Gemma 4 model selectionPart 14: Complete Guide
vLLM vs Ollama speed gapPart 8: Why 30% Faster

FAQ

What is the DGX Spark and how much does it cost?
The DGX Spark is NVIDIA's desktop AI workstation with a GB10 Grace Blackwell Superchip — 128 GB unified memory, 273 GB/s bandwidth, SM121 compute. MSRP $4,699 as of February 2026 (up from $3,999 due to memory shortages). The ASUS Ascent GX10 is the same hardware in an ASUS chassis at a similar price point.
How long does it take to go from unboxing to running an LLM on DGX Spark?
About 15 minutes with Ollama (download + one command). About 30-45 minutes with vLLM (Docker setup + model download + configuration). The hardware check takes 2 minutes.
Is the DGX Spark worth it in 2026?
At $4,699, the DGX Spark is worth it if you need to run models larger than 32 GB locally (70B-120B+). Its 128 GB unified memory is unmatched at this price point. For models under 32 GB, an RTX 5090 system ($3,700+ for the GPU alone) is 6.6x faster on bandwidth. The DGX Spark's value is capacity and silence, not raw speed.
DGX Spark vs RTX 5090 — which should I buy for local AI?
DGX Spark: 128 GB memory, 273 GB/s bandwidth, $4,699, runs 120B+ models. RTX 5090: 32 GB GDDR7, 1,792 GB/s bandwidth, ~$3,700+ GPU only, 6.6x faster but limited to ~30B models. Buy DGX Spark for large models and always-on serving. Buy RTX 5090 for maximum speed on smaller models.
Should I use Ollama or vLLM on DGX Spark?
Start with Ollama (works in 5 minutes, great for experimentation). Move to vLLM when you need prefix caching (0.12s vs 2-4s TTFT), production stability, or NVFP4 quantization. Both can coexist on the same machine — just don't run them simultaneously.