~ / blog / series / DGX Spark
❯ ls ~/blog/series/dgx-spark
26 posts
- partdatetitle
- 02026-04-13[DGX Spark] From Unboxing to Running: Complete Deployment Guide
Everything you need to go from a sealed DGX Spark box to serving your first local LLM. Hardware check, Ollama quickstart, vLLM production setup, model selection, and the 5 gotchas that cost hours.
- 12026-02-19[Benchmark] 8 Models on DGX Spark: Finding the Best Stack for AI Agents
Benchmarking 8 local LLMs on NVIDIA GB10 (128GB unified memory) across 7 task categories. Quantization surprises, a 120B model that fails at JSON, and thinking models that spend their entire budget thinking.
- 22026-03-05[vLLM] Qwen3.5-35B at 47 tok/s on DGX Spark: Ollama to vLLM Migration Guide
Step-by-step guide: Ollama to vLLM on DGX Spark GB10. Qwen3.5-35B hits 47 tok/s with TTFT dropping from 3s to 0.12s. Covers 6 real gotchas including SSM + chunked prefill trap and GPU memory conflicts.
- 32026-03-30[Benchmark] TurboQuant on GX10: Is 3-bit KV Cache Compression Actually Lossless?
Real benchmark numbers for Google's TurboQuant on a GB10/SM121 (DGX Spark) — actual compression ratios, Qwen2.5-3B accuracy validation, and why Qwen3.5-35B's hybrid attention architecture makes things complicated.
- 32026-03-13[vLLM] Nemotron-3-Super-120B on a Single GB10: Full Day Debug Log
Getting NVIDIA's Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (SM121, 128GB). Four SM121-specific pitfalls, the env-var-that-does-nothing, and a working docker command.
- 42026-03-17[vLLM] Why Your DGX Spark Only Says "!!!!!": Debugging NVFP4 on SM121
CUTLASS FP4 kernels target SM120 (GB200). On SM121 (GB10, DGX Spark) they run silently and produce garbage. Here's the full diagnostic story — 4 bugs, the row-identical failure signature, and the working fix.
- 52026-03-21[vLLM] FP8 KV Cache on GB10: Why Outputs Collapse into Repetition Loops
Adding --kv-cache-dtype fp8 to a vLLM serve script on GB10 causes outputs to degrade into repetition after ~500 tokens. Root cause: missing calibration data, q_scale defaults to 1.0.
- 62026-04-02[DGX Spark] Overheating, 100W Power Cap, 30W Safety Mode — Complete Diagnostic Guide
DGX Spark power and thermal issues blew up after Carmack's criticism. This guide covers three distinct symptoms: 30W PD controller defect (needs RMA), 100W thermal throttling, and 5W driver bug (fixable). One command, 30 seconds to diagnose.
- 72026-04-05[Benchmark] Gemma 4 26B on DGX Spark: 52 tok/s at Only 16 GB — vLLM NVFP4 Real Numbers
Real benchmark: Gemma 4 26B-A4B MoE in NVFP4 runs at 52 tok/s on DGX Spark (GB10) with just 16.5 GB model weight, leaving 82 GB free for KV cache. Includes the Phase 0 analysis that eliminated the 31B Dense variant.
- 82026-04-05[Benchmark] vLLM vs Ollama on the Same Model: Why 30% Faster on GB10
Same Gemma 4 26B-A4B, same GPU, 30% speed gap. vLLM NVFP4 hits 52 tok/s while Ollama Q4_K_M tops at 40. Root cause: Marlin kernels, CUDA graphs, and an Ollama CPU/GPU split trap.
- 92026-04-05[Benchmark] Gemma 4 31B Dense on DGX Spark: 7 tok/s and the Bandwidth Wall
Gemma 4 31B-IT NVFP4 on GB10 maxes out at 7.0 tok/s — bandwidth-bound at 273 GB/s. The math predicted 4.4 tok/s theoretical; NVFP4 compression buys 60% but can't escape the wall. Choose MoE.
- 102026-04-07[Benchmark] From 19 to 50 tok/s: We Quantized Gemma 4 E4B to NVFP4 Before Anyone Else
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 112026-04-07[Benchmark] Gemma 4 E2B vs E4B: 81 tok/s vs 52 on Three Machines — Bandwidth Is Everything
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 122026-04-08[Benchmark] 4 Machines, 4 Models, 1 Answer: Memory Decides Everything
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 132026-04-08[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 142026-04-13[Benchmark] Gemma 4 Complete Guide on DGX Spark — Which Model Should You Pick?
Gemma 4 E2B / E4B / 26B MoE / 31B Dense benchmarked on DGX Spark, RTX 5090, and MacBook Pro. One table with speed, memory, quantization format. Selection guide included.
- 152026-04-13[AI Agent] Gemma 4 Went from 40 Errors to a 9-Step Bug Fix — by Switching One Thing
A feasibility test: can open-source models run SWE-Bench locally for free? Gemma 4 26B failed on OpenHands (40+ errors) but fixed a test bug in 9 steps on SWE-agent. Same model — the action format was the difference.
- 162026-04-15[AI Agent] Gemma 4 26B Cleared a SWE-bench Lite Instance — After 28 Tries Across Two Days
Two days running mini-swe-agent + vLLM on a GB10. From wrong doc conclusions to Gemma 4 self-submitting a clean patch in 38 steps — what actually unlocked it.
- 172026-04-17[Benchmark] SWE-bench Lite 38.67% with a 26B Local Model — 0.33% from Claude 3.5 Sonnet Scaffolds
Gemma 4 26B-A4B FP8 scored 116/300 on SWE-bench Lite, ranking #16 globally. Zero API cost on a DGX Spark. The scaffold — not the model — was the differentiator.
- 182026-04-20[Benchmark] Same Scaffold, Three Models: 16% → 38% → 48% on SWE-bench Lite
One scaffold (backticks + edit-tool + budget prompt), three models (Gemma 4 E4B, Gemma 4 26B, Qwen 3.6 35B), zero code changes between runs. Qwen 3.6 hit 48.33% — beating SWE-agent + Claude 3.7 Sonnet. The scaffold is the fixed cost; the model is the variable.
- 192026-04-21[Benchmark] NVFP4 Is a Trap on GB10: FP8 Wins by 32% (vLLM + SGLang Tested)
NVFP4 should be faster than FP8 — fewer bits, less bandwidth. On DGX Spark's GB10 (SM121), it's 32% slower. Root cause: missing hardware instruction. Dual-engine proof with vLLM and SGLang.
- 202026-04-22[Hands-On] Making NVFP4 17% Faster on GB10 with a Triton FP8 Bypass
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.
- 212026-04-25[Benchmark] TMMLU+ Paired Eval: Qwen 3.6 35B Sweeps Gemma 4 26B 51-of-51 on Traditional Chinese
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
- 222026-04-26[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
- 232026-04-28[llm-compressor] Self-Quantizing a 35B Abliterated MoE to FP8 on DGX Spark: 4 OOMs, 3 Prefix Bugs, and Why the First Success Wasn't Actually FP8
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
- 242026-04-28[SWE-bench] Where Qwen 3.6 35B Loses on SWE-bench Lite: Anatomy of 155 Unresolved Tasks
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.