DGX Spark · part 0
[DGX Spark] From Unboxing to Running: Complete Deployment Guide
❯ cat --toc
- Plain-Language Version: Setting Up Your Personal AI Server
- Preface
- Day 1: Hardware Check (Do This First)
- Step 1: Power Delivery Verification
- Step 2: Driver and CUDA Version
- Step 3: Confirm GPU Identity
- Path A: Ollama (5 Minutes to First Inference)
- Install
- Run Your First Model
- Why Ollama First?
- Ollama Limitations (Why You'll Eventually Want vLLM)
- Path B: vLLM (Production Deployment)
- Prerequisites
- Download a Model
- Before Starting vLLM: Unload Ollama
- The Working Docker Command
- Verify
- Which Model Should You Run?
- On DGX Spark (128 GB, 273 GB/s)
- Speed Floor Principle
- The 5 Gotchas Nobody Warns You About
- 1. Power Delivery Defect
- 2. Unified Memory Is Shared
- 3. SM121 Is Not SM100
- 4. FP8 KV Cache Causes Repetition
- 5. Chunked Prefill Kills SSM Models
- What Was Gained
- Deep Dive Links
TL;DR
From sealed box to serving LLMs in under an hour. Ollama for day one (5 minutes), vLLM for production (30 minutes). Check power delivery before anything else — some units ship defective.
Plain-Language Version: Setting Up Your Personal AI Server
You bought a desktop computer that can run AI models locally — the same kind of models that power ChatGPT, but on your desk, with no cloud fees and no data leaving your network. The problem is that "AI workstation" documentation assumes you already know what you're doing.
This guide assumes you don't. It covers everything from plugging in the power cable to choosing which AI model to run and why, with two paths: a 5-minute quickstart for experimentation, and a production setup for always-on AI agents. Every command is real, tested on this specific hardware, and includes the non-obvious traps that official docs skip.
If you came here from a specific problem (overheating, slow inference, crashes), the series index at the bottom points to the right article.
Preface
A new machine without a deployment guide is a box of potential energy sitting on your desk doing nothing. The gap between "it powers on" and "it's serving useful inference" is where most people's weekends disappear.
This is the article I wish existed when my ASUS Ascent GX10 arrived. Not a benchmark (that's Part 1), not a deep dive into a specific model (that's Parts 2-14), but the straight line from sealed box to working inference endpoint.
Day 1: Hardware Check (Do This First)
Step 1: Power Delivery Verification
Some DGX Spark units ship with a defective PD controller that silently caps power at 30W. The machine boots, runs, and looks normal — just painfully slow. Check this before anything else.
Boot the machine, open a terminal, and run:
# Start any GPU workload first (a simple matrix multiply works)
python3 -c "import torch; x = torch.randn(4096, 4096, device='cuda'); y = x @ x; print('GPU works')"
# Then check power draw
nvidia-smi --query-gpu=power.draw,utilization.gpu,clocks.sm --format=csv,noheader
Healthy output:
35.65 W, 96 %, 2522 MHz
Defective unit (30W safety mode):
4.80 W, 2 %, 2411 MHz
If you see the second output, stop here. This is a hardware defect requiring RMA — no amount of software configuration will fix it. See the full diagnostic guide.
Step 2: Driver and CUDA Version
nvidia-smi
You need:
- Driver: 580.x or later
- CUDA: 13.0 or later
Older drivers (550.x + CUDA 12.4) have a separate bug that makes the GPU appear stuck at 5W/0% utilization. This one is fixable with a driver upgrade.
Step 3: Confirm GPU Identity
nvidia-smi --query-gpu=name,compute_cap,memory.total --format=csv,noheader
Expected:
NVIDIA GB10, 12.1, 128 GB
The compute capability 12.1 is SM121 — this is not the same as SM100 (GB200) or SM89 (RTX 4090). Many CUDA kernels don't support SM121 yet. This matters when choosing software (Ollama handles it; vLLM needs cu130-nightly).
Path A: Ollama (5 Minutes to First Inference)
Ollama is the fastest path to running a model. One install, one command.
Install
curl -fsSL https://ollama.com/install.sh | sh
Run Your First Model
ollama run qwen3-coder-next
This downloads a ~20 GB model and starts an interactive chat. First run takes a few minutes for the download; subsequent runs start in seconds.
Why Ollama First?
- Works out of the box on GB10 (SM121 support built in)
- No Docker, no configuration files, no flags
- Good for experimenting with different models quickly
- The
/api/chatendpoint is OpenAI-compatible enough for most tools
Ollama Limitations (Why You'll Eventually Want vLLM)
- No prefix caching: every request re-processes the full system prompt. TTFT is 2-4 seconds on long prompts.
- No NVFP4: Ollama uses GGUF quantization (Q4_K_M, Q8_0). vLLM's NVFP4 is 30% faster on the same model.
- KEEP_ALIVE trap: Ollama holds models in memory for 2 hours by default. On shared unified memory, this blocks vLLM from starting.
For experimentation and quick testing, Ollama is perfect. For production agent workloads, move to vLLM.
Path B: vLLM (Production Deployment)
vLLM is the production serving engine. More setup, but prefix caching alone justifies the migration for always-on agents.
Prerequisites
# Install Docker if not present
sudo apt-get update && sudo apt-get install -y docker.io nvidia-container-toolkit
sudo systemctl restart docker
# Verify GPU access in Docker
docker run --rm --gpus all nvidia/cuda:13.0-base nvidia-smi
Download a Model
pip install "huggingface_hub[cli,hf_transfer]"
# Qwen3.5-35B — the workhorse for agents
HF_HUB_ENABLE_HF_TRANSFER=1 hf download Qwen/Qwen3.5-35B-A3B-FP8 \
--local-dir ~/models/qwen35-35b-hf
HF_HUB_ENABLE_HF_TRANSFER=1 enables the Rust-based transfer backend — meaningfully faster for multi-GB downloads.
Before Starting vLLM: Unload Ollama
If Ollama is installed, it may be holding a model in the shared 128 GB memory pool. vLLM will OOM on startup if Ollama is occupying memory.
# Check what Ollama has loaded
curl -s http://localhost:11434/api/ps
# Unload everything
curl -s -X POST http://localhost:11434/api/generate \
-d '{"model": "MODEL_NAME", "keep_alive": 0}'
The Working Docker Command
docker run -d --name qwen35 --restart unless-stopped \
--gpus all --ipc host --shm-size 64gb -p 8000:8000 \
-v ~/models/qwen35-35b-hf:/models/qwen35 \
vllm/vllm-openai:cu130-nightly \
--model /models/qwen35 \
--served-model-name qwen3.5-35b \
--max-model-len 200000 \
--gpu-memory-utilization 0.90 \
--max-num-batched-tokens 4096 \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
Critical notes:
- Use
cu130-nightly, not the stable image. Stable doesn't support GB10/SM121. - Do NOT add
--enable-chunked-prefill— it causes a 9x throughput regression on SSM+MoE models. - Do NOT add
--kv-cache-dtype fp8— it causes output repetition loops on GB10.
Cold start takes 2-3 minutes (model loading + torch.compile + FlashInfer autotuning). Watch the logs:
docker logs -f qwen35
Ready when you see vLLM engine started.
Verify
# Health check
curl -s http://localhost:8000/health
# Test inference
curl -s http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.5-35b",
"messages": [{"role": "user", "content": "Hello, what hardware am I running on?"}],
"chat_template_kwargs": {"enable_thinking": false}
}'
The chat_template_kwargs is required per-request for Qwen3.5 to suppress thinking mode. Cannot be set server-side.
Which Model Should You Run?
This depends on your use case and hardware. Here's the decision tree:
On DGX Spark (128 GB, 273 GB/s)
| Use Case | Model | Runtime | Speed | Why |
|---|---|---|---|---|
| Agent workloads | Qwen3.5-35B-A3B FP8 | vLLM | 47 tok/s | Best tool calling, SSM hybrid for stable long-context |
| Vision tasks | Gemma 4 26B-A4B NVFP4 | vLLM | 52 tok/s | Multimodal, only 16 GB — leaves room for other models |
| Quick experimentation | Any model | Ollama | varies | One command, no config |
| Maximum capability | gpt-oss-120B MXFP4 | vLLM | ~40 tok/s | 120B parameters, fits in 128 GB |
Speed Floor Principle
Intelligence matters more than speed — but only above a usability threshold. The floor is roughly ~15-20 tok/s for interactive use. Below that, you're waiting long enough that the intelligence advantage evaporates into frustration.
On DGX Spark, 31B Dense models hit 7 tok/s — below the floor. Choose MoE variants instead. On RTX 5090, the same 31B Dense runs at 62 tok/s — well above the floor, making it the smartest comfortable choice.
Full model comparison with all four Gemma 4 variants across three machines: Complete Guide
The 5 Gotchas Nobody Warns You About
These are the traps that cost hours. Each links to the full article.
1. Power Delivery Defect
Some units are permanently power-capped at 30W. Looks like a software problem. It's hardware. Full diagnostic →
2. Unified Memory Is Shared
Ollama and vLLM compete for the same 128 GB pool. nvidia-smi shows N/A for memory — use vLLM's /metrics endpoint instead. Details →
3. SM121 Is Not SM100
The GB10's compute capability (SM121) is different from GB200 (SM100). Many CUDA kernels don't support it. This is why you need cu130-nightly for vLLM and why llama.cpp crashes on this hardware.
4. FP8 KV Cache Causes Repetition
Adding --kv-cache-dtype fp8 looks like a free memory optimization. On GB10, it causes outputs to degrade into repetition loops after ~500 tokens due to missing calibration data. Root cause →
5. Chunked Prefill Kills SSM Models
--enable-chunked-prefill is a standard vLLM optimization for Transformers. On SSM+MoE hybrids (Qwen3.5, Mamba-based models), it causes a 9x throughput regression. Analysis →
What Was Gained
The shortest path: Ollama installed → ollama run qwen3-coder-next → inference in 5 minutes. No Docker, no flags, no config.
The production path: vLLM with cu130-nightly → prefix caching drops TTFT from 2-4s to 0.12s → agent workloads become responsive.
The diagnostic: one nvidia-smi command before anything else. If power draw shows 5W under load, stop troubleshooting software.
Deep Dive Links
Every section above has a full article behind it:
| Topic | Article |
|---|---|
| Model benchmark (8 models) | Part 1: Finding the Best Stack |
| Ollama → vLLM migration | Part 2: Qwen3.5 at 47 tok/s |
| 120B model deployment | Part 3: Nemotron-120B Debug Log |
| FP8 KV cache bug | Part 5: Repetition Loops |
| Power/thermal diagnostic | Part 6: 30W Safety Mode |
| Gemma 4 model selection | Part 14: Complete Guide |
| vLLM vs Ollama speed gap | Part 8: Why 30% Faster |
FAQ
- What is the DGX Spark and how much does it cost?
- The DGX Spark is NVIDIA's desktop AI workstation with a GB10 Grace Blackwell Superchip — 128 GB unified memory, 273 GB/s bandwidth, SM121 compute. MSRP $4,699 as of February 2026 (up from $3,999 due to memory shortages). The ASUS Ascent GX10 is the same hardware in an ASUS chassis at a similar price point.
- How long does it take to go from unboxing to running an LLM on DGX Spark?
- About 15 minutes with Ollama (download + one command). About 30-45 minutes with vLLM (Docker setup + model download + configuration). The hardware check takes 2 minutes.
- Is the DGX Spark worth it in 2026?
- At $4,699, the DGX Spark is worth it if you need to run models larger than 32 GB locally (70B-120B+). Its 128 GB unified memory is unmatched at this price point. For models under 32 GB, an RTX 5090 system ($3,700+ for the GPU alone) is 6.6x faster on bandwidth. The DGX Spark's value is capacity and silence, not raw speed.
- DGX Spark vs RTX 5090 — which should I buy for local AI?
- DGX Spark: 128 GB memory, 273 GB/s bandwidth, $4,699, runs 120B+ models. RTX 5090: 32 GB GDDR7, 1,792 GB/s bandwidth, ~$3,700+ GPU only, 6.6x faster but limited to ~30B models. Buy DGX Spark for large models and always-on serving. Buy RTX 5090 for maximum speed on smaller models.
- Should I use Ollama or vLLM on DGX Spark?
- Start with Ollama (works in 5 minutes, great for experimentation). Move to vLLM when you need prefix caching (0.12s vs 2-4s TTFT), production stability, or NVFP4 quantization. Both can coexist on the same machine — just don't run them simultaneously.