~ / blog / series / DGX Spark
❯ ls ~/blog/series/dgx-spark
38 posts
- partdatetitle
- 02026-04-13[DGX Spark] From Unboxing to Running: Complete Deployment Guide
Everything you need to go from a sealed DGX Spark box to serving your first local LLM. Hardware check, Ollama quickstart, vLLM production setup, model selection, and the 5 gotchas that cost hours.
- 12026-02-19[Benchmark] 8 Models on DGX Spark: Finding the Best Stack for AI Agents
Benchmarking 8 local LLMs on NVIDIA GB10 (128GB unified memory) across 7 task categories. Quantization surprises, a 120B model that fails at JSON, and thinking models that spend their entire budget thinking.
- 22026-03-05[vLLM] Qwen3.5-35B at 47 tok/s on DGX Spark: Ollama to vLLM Migration Guide
Step-by-step guide: Ollama to vLLM on DGX Spark GB10. Qwen3.5-35B hits 47 tok/s with TTFT dropping from 3s to 0.12s. Covers 6 real gotchas including SSM + chunked prefill trap and GPU memory conflicts.
- 32026-03-13[vLLM] Nemotron-3-Super-120B on a Single GB10: Full Day Debug Log
Getting NVIDIA's Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (SM121, 128GB). Four SM121-specific pitfalls, the env-var-that-does-nothing, and a working docker command.
- 42026-03-17[vLLM] Why Your DGX Spark Only Says "!!!!!": Debugging NVFP4 on SM121
CUTLASS FP4 kernels target SM120 (GB200). On SM121 (GB10, DGX Spark) they run silently and produce garbage. Here's the full diagnostic story — 4 bugs, the row-identical failure signature, and the working fix.
- 52026-03-21[vLLM] FP8 KV Cache on GB10: Why Outputs Collapse into Repetition Loops
Adding --kv-cache-dtype fp8 to a vLLM serve script on GB10 causes outputs to degrade into repetition after ~500 tokens. Root cause: missing calibration data, q_scale defaults to 1.0.
- 62026-04-02[DGX Spark] Overheating, 100W Power Cap, 30W Safety Mode — Complete Diagnostic Guide
DGX Spark power and thermal issues blew up after Carmack's criticism. This guide covers three distinct symptoms: 30W PD controller defect (needs RMA), 100W thermal throttling, and 5W driver bug (fixable). One command, 30 seconds to diagnose.
- 72026-04-05Gemma 4 on DGX Spark: 52 tok/s in 16 GB with NVFP4 (Benchmark)
Benchmark: Gemma 4 26B-A4B + NVFP4 on DGX Spark GB10 reaches 52 tok/s decode. Uses 16.5 GB, leaves 82 GB for KV cache. 7.5× faster than 31B dense variant. vLLM 0.19 setup notes.
- 82026-04-05[Benchmark] vLLM vs Ollama on the Same Model: Why 30% Faster on GB10
Same Gemma 4 26B-A4B, same GPU, 30% speed gap. vLLM NVFP4 hits 52 tok/s while Ollama Q4_K_M tops at 40. Root cause: Marlin kernels, CUDA graphs, and an Ollama CPU/GPU split trap.
- 92026-04-05[Benchmark] Gemma 4 31B Dense on DGX Spark: 7 tok/s and the Bandwidth Wall
Gemma 4 31B-IT NVFP4 on GB10 maxes out at 7.0 tok/s — bandwidth-bound at 273 GB/s. The math predicted 4.4 tok/s theoretical; NVFP4 compression buys 60% but can't escape the wall. Choose MoE.
- 102026-04-07[Benchmark] From 19 to 50 tok/s: We Quantized Gemma 4 E4B to NVFP4 Before Anyone Else
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 112026-04-07[Benchmark] Gemma 4 E2B vs E4B: 81 tok/s vs 52 on Three Machines — Bandwidth Is Everything
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 122026-04-08[Benchmark] 4 Machines, 4 Models, 1 Answer: Memory Decides Everything
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 132026-04-08[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 142026-04-13[Benchmark] Gemma 4 Complete Guide on DGX Spark — Which Model Should You Pick?
Gemma 4 E2B / E4B / 26B MoE / 31B Dense benchmarked on DGX Spark, RTX 5090, and MacBook Pro. One table with speed, memory, quantization format. Selection guide included.
- 152026-04-13[AI Agent] Gemma 4 Went from 40 Errors to a 9-Step Bug Fix — by Switching One Thing
A feasibility test: can open-source models run SWE-Bench locally for free? Gemma 4 26B failed on OpenHands (40+ errors) but fixed a test bug in 9 steps on SWE-agent. Same model — the action format was the difference.
- 162026-04-15[AI Agent] Gemma 4 26B Cleared a SWE-bench Lite Instance — After 28 Tries Across Two Days
Two days running mini-swe-agent + vLLM on a GB10. From wrong doc conclusions to Gemma 4 self-submitting a clean patch in 38 steps — what actually unlocked it.
- 172026-04-17[Benchmark] SWE-bench Lite 38.67% with a 26B Local Model — 0.33% from Claude 3.5 Sonnet Scaffolds
Gemma 4 26B-A4B FP8 scored 116/300 on SWE-bench Lite, ranking #16 globally. Zero API cost on a DGX Spark. The scaffold — not the model — was the differentiator.
- 182026-04-20[Benchmark] Same Scaffold, Three Models: 16% → 38% → 48% on SWE-bench Lite
One scaffold (backticks + edit-tool + budget prompt), three models (Gemma 4 E4B, Gemma 4 26B, Qwen 3.6 35B), zero code changes between runs. Qwen 3.6 hit 48.33% — beating SWE-agent + Claude 3.7 Sonnet. The scaffold is the fixed cost; the model is the variable.
- 192026-04-21[Benchmark] NVFP4 Is a Trap on GB10: FP8 Wins by 32% (vLLM + SGLang Tested)
NVFP4 should be faster than FP8 — fewer bits, less bandwidth. On DGX Spark's GB10 (SM121), it's 32% slower. Root cause: missing hardware instruction. Dual-engine proof with vLLM and SGLang.
- 202026-04-22[Hands-On] Making NVFP4 17% Faster on GB10 with a Triton FP8 Bypass
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.
- 212026-04-25[Benchmark] TMMLU+ Paired Eval: Qwen 3.6 35B Sweeps Gemma 4 26B 51-of-51 on Traditional Chinese
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
- 222026-04-26[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
- 232026-04-28[llm-compressor] Self-Quantizing a 35B Abliterated MoE to FP8 on DGX Spark: 4 OOMs, 3 Prefix Bugs, and Why the First Success Wasn't Actually FP8
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
- 242026-04-28[SWE-bench] Where Qwen 3.6 35B Loses on SWE-bench Lite: Anatomy of 155 Unresolved Tasks
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.
- 252026-05-01[vLLM] Nemotron 3 Nano on DGX Spark: 74.75 tok/s NVFP4 — 11.5% Past the Public Baseline
Ten days ago I called NVFP4 a trap on DGX Spark GB10. Today the same hardware hits 74.75 tok/s on Nemotron 3 Nano W4A16, beating my own FP8 ceiling and the public 67 tok/s forum number. The 4-layer patch stack, the quant variant choice, and the bandwidth math behind it.
- 262026-05-01[vLLM] Watching English Videos with DGX Spark: Nemotron Omni Multimodal on GB10
Same DGX Spark, different goal: watch a 3-minute Andrej Karpathy talk and output the spoken content + visual scene. 89 seconds wall, 53,842 prompt tokens, factually correct. The use_audio_in_video flag, the upstream-image gotcha, and the long-video knob math.
- 272026-05-06Liftoff: Gemma 4 hits 670 tok/s aggregate on DGX Spark (108 tok/s single-stream)
Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.
- 282026-05-09Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body
I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.
- 292026-05-1430 lines of docker for +34% on DGX Spark: huihui Gemma 4 FP8 + vanilla MTP n=1 deployment recipe
Part 28 explained why deep speculation breaks on an abliterated body; this post is the recipe for the part that already works. huihui Gemma 4 26B-A4B FP8 + Google's vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 tok/s to 52.6 tok/s (+34%) on GB10, no retraining required. ~30 lines of docker plus a bind-mount of PR #41745's gemma4_mtp.py. Includes a 3-step sanity check and a clear list of when n=1 stops being enough.
- 302026-05-16EAGLE-3 fine-tune against an abliterated Gemma 4 body — Round 1 flattens the acceptance curve (plus a measurement lesson)
RedHatAI's EAGLE-3 drafter fine-tuned to realign with huihui Gemma 4 26B-A4B abliterated FP8 on a single DGX Spark GB10 — 1 epoch / 50k Magpie samples / 11h. Inference bench on raw `/v1/completions`: pos 3 acceptance climbs from vanilla's 20.5% to 72.7%; n=4 throughput goes from ~50 to 100.36 tok/s aggregate. **A later paired bench revealed the throughput comparison used different endpoints for baseline (chat) and retrain (raw) — on production chat workloads the real uplift is far smaller than 2×; see the endpoint correction at the top of the post**. Part 28's mechanism observation (deep speculation acceptance scatters on abliterated distributions) still holds. Includes a Speculators upstream create_empty_sample dtype bug + patch and a Phase 0 catalog of 6 community prior-art repos.
- 312026-05-21Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup
After Part 30's endpoint correction showed Round 1 didn't actually 2x chat throughput, Round 2 added 30k regenerated Chinese instruction samples and trained for 41 hours. Result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45. The EAGLE-3 small head hits an architectural ceiling against the abliterated body; more data doesn't fix it. Plus we found a scheduler deadlock in the vLLM Gemma 4 preview image (`gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`) under long-running extract_hidden_states use (hit three times, mitigated with a watchdog).
- 322026-05-30NVFP4 is 1.5× FP8 on a DGX Spark — but it's compression, not the FP4 cores
On a GB10 DGX Spark, NVFP4 beats FP8 by ~1.5× for single-stream decode on a dense model. But the win is bandwidth (smaller weights), not the FP4 tensor cores — the fastest path never touches them.
- 332026-06-01[Benchmark] NVFP4 W4A4 beats FP8 on a DGX Spark MoE: 67 vs 52 tok/s once CUDA graphs fire
On a GB10 DGX Spark, NVFP4 W4A4 went from 23 to 67 tok/s the moment I dropped --enforce-eager — beating FP8 by 29% and saving 16GB. The catch from Part 32 was real, just dense-only.
- 342026-06-01[Benchmark] NVFP4 shrinks a video model 33% on a DGX Spark — with zero speed gain
NVFP4 took a distilled Sulphur 2 (LTX-2.3) video model from 29 to 19.5 GB on a GB10 DGX Spark with no quality loss and — since video is compute-bound — no speed gain (if anything a hair slower).
- 352026-06-02[AI Agent] My Local Agent Flailed at Image Gen — It Was the Harness, Not the Weights
My local 35B agent went haywire generating images until I read its tool-call logs: 0% malformed calls. The model was fine — a broken ComfyUI tool was making it improvise. The fix was a clean ACI skill, not fine-tuning.
- 362026-06-04[Benchmark] Gemma 4 12B Omni on DGX Spark: Weight-Only NVFP4 Beats W4A4 (and Keeps Multimodal)
I quantized Google's new omni Gemma 4 12B on a DGX Spark GB10. Weight-only NVFP4 hits 24.9 tok/s in 7.7 GB and keeps image/audio/video working — full W4A4 is slower AND breaks multimodal.
- 372026-06-05[Benchmark] NVFP4 Weight-Only Quantization Taxes Chinese ~2x Harder Than English (gemma-4-12B)
I benchmarked BF16 vs FP8 vs NVFP4 weight-only on gemma-4-12B across English (MMLU) and Traditional Chinese (TMMLU+) on a DGX Spark. FP8 is near-lossless on both; NVFP4 drops Chinese ~6pp but English only ~3pp.