Qwen3.5-122B on DGX Spark · part 2
[Benchmark] Qwen3.5-122B on DGX Spark: the 17 tok/s GDN wall was real — but the 2× fix was outside vLLM
❯ cat --toc
- Plain-Language Version: a 122B model that fits on a desktop but runs slow — until I changed engines
- 17 tok/s: the GDN wall, and it's a hardware gap not a tuning problem
- MTP makes it slower: 15.2 < 17
- Abliteration removes the refusal, not the trained framing
- vLLM PR #44700 did exactly nothing here: still 16.9 tok/s
- The fix was outside vLLM: swap the engine, or swap the format
- Debugging the upgrade: four ways the build fought back
- Takeaways
- Conclusion
TL;DR
Qwen3.5-122B-A10B (hybrid-GDN MoE, NVFP4, 76GB) runs on a 128GB DGX Spark: serves under vLLM, calls tools for real, writes uncensored prose. But it tops out at 17 tok/s — the GDN linear-attention layers have no fast kernel on the GB10's sm_121, so vLLM runs a Triton/FLA fallback. I spent an afternoon on vLLM PR #44700 (a merged GDN perf change, ~1.93x microbench / +24% on a B200): result 16.9 tok/s, zero gain — its speedup routes to a fast recurrent kernel sm_121 doesn't have. The real lever was outside the quant toolbox: I measured the Atlas engine at 33.9 tok/s (36.5 with MTP, ~2×, not yet a daily-stable setup) on the same abliterated weights, and a community INT4-AutoRound recipe reports ~51 (3x) on the stock model. 17 tok/s wasn't a wrong measurement — it was a correct local optimum inside one toolbox. Measure first, then widen the search space.
The GDN wall on consumer Blackwell — and why the 2x was outside vLLM
Plain-Language Version: a 122B model that fits on a desktop but runs slow — until I changed engines
Big AI models are usually too large for a desktop. This one — Qwen3.5-122B, 122 billion parameters — fits on a single small NVIDIA DGX Spark, a desktop AI box with 128GB of shared memory. It loads, it answers, it can even use tools (look up the weather, do multi-step tasks). On paper that's impressive: a frontier-class model running off your desk.
The problem is speed. It produces about 17 words' worth of text per second — readable, but slow, the kind of slow you feel waiting for a chat reply. So I tried to speed it up with a change the vLLM team had just merged — one they measured at nearly 2x on this kind of model. A full afternoon of building later, the result was identical: 17 became 16.9. Nothing.
The reason is simple: that speedup only helps a piece of hardware my desktop's GPU doesn't have. The model was already on the slow path; the upgrade only sped up a fast path this chip never takes. So far that's the obvious lesson — before you sink an afternoon into an upgrade, check that the thing it speeds up is actually your bottleneck.
But there's a second half to the story. The speedup did exist — just not inside the tool I was using. When I stopped tuning vLLM and ran the exact same model under a different inference engine (Atlas), the speed roughly doubled. The afternoon wasn't wasted, but the real move wasn't climbing the wall — it was stepping around it.
New here? If "MoE", "GDN", or "NVFP4" don't mean much yet, What Quantization Algorithms Actually Do covers quantization and TurboQuant on GX10 covers KV cache compression. This one is about a hardware wall — a model architecture that has no fast kernel for this specific GPU — and the two ways out of it.
The interesting thing about a 128GB box isn't what runs fast on it — it's what runs at all. A 122B model has no business fitting on a desktop, and the fact that it does is its own small thrill. This is a follow-up to Part 1, where the same model first ran at 14 tok/s; a sister piece on DeepSeek-V4-Flash (284B on the same box, hitting the same family of wall from a different direction) is coming separately.
Everything below was measured on a GX10 (GB10 / sm_121, 128GB unified, CUDA 13.0 driver). All numbers are mine, this box, this week.
17 tok/s: the GDN wall, and it's a hardware gap not a tuning problem
Steady-state generation lands at 17 tok/s. That's the ceiling under vLLM, and tuning flags don't move it.
The reason is the GDN layers. vLLM's fast GDN prefill backends (FlashInfer / CuteDSL) gate to datacenter Blackwell (sm_10x) and H100 — those have the tcgen05/TMEM tensor-memory instructions the fast paths lean on. The GB10 is sm_121 (compute capability 12.1), which isn't on that list, so vLLM falls back to a Triton/FLA implementation. 17 tok/s is the Triton-fallback ceiling. (To be clear: GDN isn't impossible without those instructions — the Triton path runs fine — it's just that the fast path isn't selected here.)
The silver lining: Triton isn't as bad as I'd predicted. I'd modeled ~14 tok/s; the real fallback runs at 17, beating the DeepSeek-V4-Flash number (13.6) on the same box. A Triton kernel is still a GPU kernel.
It's worth being precise about what "wall" means here. This isn't "the fast path is slow." It's "the fast path isn't selected" — the optimized GDN backends don't run on this SM, so any work that targets them is invisible to this machine. Hold that thought for the #44700 section.
MTP makes it slower: 15.2 < 17
The model ships with MTP (multi-token prediction) draft heads, and the official recipe enables speculative decoding:
--speculative-config '{"method":"mtp","num_speculative_tokens":6}'
On the vendor's 2× RTX 6000 Pro (TP=2) that's a win. On a single GB10 under vLLM it's a loss: 15.2 tok/s, down from 17.
It's the same constraint that makes MTP slower on my daily 35B model. Speculative decoding trades spare compute/bandwidth for fewer sequential steps. On a single bandwidth- and GDN-bound box there's no spare to trade — the draft-and-verify overhead just adds latency. NVFP4 already cashed in the memory-traffic savings; MTP is trying to spend the same budget twice. (Hold this one too — it doesn't generalize to every engine, as the Atlas section shows.)
Abliteration removes the refusal, not the trained framing
The "heretic" abliteration is genuinely uncensored in one specific sense and genuinely not in another, and the split is the interesting part.
- NSFW / erotica: fully open. No disclaimer, no hedging, and the prose is good.
- Lockpicking and similar: it answers, but prepends a disclaimer — a softer, partial unlock.
- Politically sensitive (e.g. Tiananmen 1989): still produces the original CCP-aligned framing. The giveaway is the reasoning trace, where the model talks itself into "needing to stay compliant" and "referencing official commemorative phrasing."
So abliteration removes the refusal direction — the model no longer says "I can't help with that" — but it does not remove trained answer framing. The model learned how to talk about certain topics during training, and that's baked into the weights in a way that knocking out refusal doesn't touch. Compare a more thoroughly abliterated DeepSeek-V4 variant, which does answer politically sensitive prompts directly: the difference is how hard the abliteration was pushed and how the base model's censorship was implemented — not "Chinese base model, therefore locked."
This matters for the speed story, too. Most of the community's fast paths are built for the stock model. The question that kept the wall standing for me wasn't just "what's faster" — it was "what's faster and still runs my abliterated weights." Keep that in mind for the next two sections.
vLLM PR #44700 did exactly nothing here: still 16.9 tok/s
First, the dead end. vLLM PR #44700 — "[PERF][Qwen3.5] Split mixed prefill+decode batches: route decodes to the recurrent kernel" — is merged and reports ~1.93x on a GDN kernel microbench plus +24% end-to-end on a B200. A 122B model bottlenecked on GDN layers — this looked like the exact fix. I built it.
Result: 16.9 tok/s. The same 17, minus measurement noise. Zero gain.
Why: the PR's win comes from routing decode batches to the fast recurrent kernel. On the GB10 that recurrent path is the Triton/FLA fallback — there's no fast native version on sm_121 (see the 17 tok/s section). So better routing still lands the decodes on the slow kernel, and the end-to-end number doesn't budge. The +24% is real — on a B200, where the fast recurrent kernel exists. On sm_121 it's a no-op.
Put bluntly: the GDN wall on GB10 is "the fast road isn't built here," not "the road is slow." A PR that improves how you merge onto the fast road can't help if that road doesn't exist on your machine. As long as I stayed inside vLLM, 122B on GB10 was 17 tok/s, and no amount of vLLM upgrading was going to change that.
That last clause is the trap I sat in for an afternoon: as long as I stayed inside vLLM.
The fix was outside vLLM: swap the engine, or swap the format
Here's the part I almost didn't write, because my first conclusion was "17 is the ceiling, wait for an upstream sm_121 GDN kernel." That conclusion was true — but only inside the NVFP4-on-vLLM toolbox. I'd spent the afternoon trying to climb the wall with a vLLM PR when the move was to step outside it. There are two doors.
Same 122B, same GB10: vLLM stuck at 15–17, Atlas jumps to 33.9 / 36.5 (~2×), community INT4-AutoRound reports 51 on the stock model.
Door 1 — change the engine: Atlas measured 33.9 tok/s (36.5 with MTP), about 2×. Atlas is a Rust inference engine (Avarok-Cybersecurity/atlas), currently targeted at a single platform — GB10 / DGX Spark / sm_121. I pointed it straight at my local abliterated NVFP4 122B — no re-quant, no re-download. It loads cleanly for a reason rooted in my checkpoint: the heretic build is native Qwen3.5 with the GDN layers left in plain BF16, not packed into a quant format, and Atlas reads those unquantized linear-attention layers directly (I assume it leaves them untouched the way vLLM does, but I couldn't find an explicit precision policy in Atlas's docs — I'm inferring it from the fact that it loads at all). Measured on the GB10 (warmup dropped, same prompt ×3 at steady state): 33.9 tok/s baseline (33.9/34.0/33.9, rock-steady), 36.5 with MTP (K=2) — about 2× vLLM's 17, on the exact same abliterated weights, with the uncensored behavior carried over intact — NSFW, politically sensitive, all of it. To be clear: the numbers are real; I just haven't turned it into a stable, reproducible daily setup.
Note what just happened to the MTP rule. Under vLLM, MTP made the 122B slower (15.2 < 17); under Atlas it still helps — 36.5 vs 33.9 baseline, though only +7.6% (SSM-hybrid and bandwidth-bound, so the gain is small). Same model, same card — whether speculative decoding pays off comes down to the engine. "Don't enable MTP" was never a law of the hardware; it was a property of vLLM's path on this box.
Up front, because this 2× has caveats:
- Not a daily-stable setup yet. I can measure ~35, but it's a way off a config I'd run every day.
- Context is capped at 4096 for now. The 76GB of weights leave only ~12GB free, so the KV pool needs hand-tuning to not OOM:
--max-prefill-tokens 2048,--max-batch-size 1,--gpu-memory-utilization 0.99,--max-seq-len 4096. Pushing context to 16–32K is plausible but untested. - My ~34–37 is still under the ~46–48 the community reports for stock 122B on Atlas, probably vision-encoder plus heretic overhead. I haven't run the base-vs-MTP A/B to close that gap.
For those reasons I've parked Atlas for the 122B rather than promoting it to daily — but the direction stands: the wall I'd called permanent opened to an engine swap, on my own weights, with uncensored output intact.
Door 2 — change the format: INT4-AutoRound is reported at ~51 tok/s (3x), but stock. A community recipe — INT4-AutoRound weights, an INT8 LM head, and MTP on FlashInfer — is reported at ~51 tok/s on the GB10 (NVIDIA forum thread). The catch is the one from the abliteration section: those are stock weights. To hit that number with my abliterated model I'd have to redo the AutoRound quant myself, and I haven't. So 51 is a door I can see but haven't walked through — a measured community result, not my number.
So is 17 tok/s the ceiling? Under vLLM, yes — under a different engine, not necessarily. Same weights, same box, and Atlas doubled it (the same engine runs Qwen3.6-35B at 90+ tok/s, for what it's worth). I almost stopped at "17 is the ceiling, wait for upstream." What flipped it wasn't a better flag — it was routinely scanning for new tools and fresh info. Keep your intel current, and the wall turns into a door.
Debugging the upgrade: four ways the build fought back
The afternoon I spent getting #44700 to even run was mostly fighting the build, not the model. Four gotchas, all reusable:
1. The prebuilt wheel silently pulled CPU torch. Installing a rolling prebuilt vLLM wheel (d20260607.cu133) without an override let FlashInfer's unconstrained torch dependency make uv re-resolve to torch 2.10.0+cpu — no CUDA at all. Fix: pin it.
uv pip install vllm-*.whl --override torch==2.11.0+cu130 # ← keep CUDA torch
2. A NO-GO that was wrong, caught by an adversarial check. I'd concluded "torch 2.10 + cu133 doesn't exist, dead end." Running it past Codex (reading primary sources) falsified that: the wheel is actually built against torch 2.11.0+cu130 — the .cu133 is just a build-base label — the real difference is at the ABI level, in symbol export (CUDAStream::query() changed export behavior between the two torch versions). The NO-GO was based on a label, not the ABI. Lesson logged: verify NO-GO conclusions against the actual artifact.
3. A PTX wall on the vision tower. cudaErrorUnsupportedPtxVersion, dying in the ViT attention path: the _vllm_fa2_C flash-attn-2 kernel shipped only cu133 PTX with no sm_121 cubin, so the CUDA 13.0 driver couldn't JIT it. The main decode path (flashinfer + GDN) was fine; only the encoder choked. Fix: route the encoder around fa2.
--mm-encoder-attn-backend TORCH_SDPA # ViT uses torch-native attention
4. The runner Docker stage didn't inherit the timeout. Only the base stage had UV_HTTP_TIMEOUT; the runner stage timed out downloading a 517MB cublas wheel from pypi.nvidia.com. Fix: set it in the runner stage too (ENV UV_HTTP_TIMEOUT=600).
None of this produced a faster model. It produced a working build that proved the speedup wasn't available on this hardware — a null result, but a real one.
Takeaways
Where the time went. Almost none of it was the model. It was the build: torch resolution, the PTX/cubin mismatch, the wheel-label confusion. The actual benchmark — load, generate, count tokens — took minutes. The "make it faster" attempt that produced no speedup ate the afternoon. The thing that actually made it faster — running a different engine — took less time than the build did.
Reusable diagnostics.
- When a prebuilt CUDA wheel "works" but inference is slow or CPU-bound, check
torch.version.cudafirst — an unconstrained transitive dep can downgrade you to+cpuwithout an error. cudaErrorUnsupportedPtxVersionon one subsystem (vision) while the main path loads means a kernel shipped PTX-only for a newer toolkit than your driver. Route that subsystem to a native backend instead of rebuilding everything.- Run NO-GO conclusions past a source-reading verifier before you act on them. A build-base label (
.cu133) is not an ABI.
The general principle, in two halves. First: before an upgrade, ask whether your bottleneck is even on the path the upgrade optimizes. PR #44700's win comes from routing work to a fast kernel sm_121 doesn't run, so it's structurally incapable of helping here — no matter how real its +24% is on a B200. Measure the path, not the changelog. Second, and the one I learned the hard way: when you hit a wall, check whether the wall is the problem or the toolbox is. 17 tok/s was a true number with a false implication. The fix wasn't a flag inside vLLM — it was a different engine, or a different format.
Conclusion
If you're running a hybrid-GDN model on a GB10 / DGX Spark (sm_121):
- It fits and it works. Qwen3.5-122B-A10B NVFP4 (~76GB) serves under vLLM, calls tools (
qwen3_coderparser), writes uncensored prose. - 17 tok/s is the vLLM ceiling — not the box's ceiling. GDN runs a Triton/FLA fallback; the fast GDN backends aren't selected on sm_121. Inside vLLM, that number won't move.
- The 2× lives outside vLLM. Atlas measured 33.9 tok/s (36.5 with MTP, about 2×) on the same abliterated weights — not a daily-stable setup yet. A community INT4-AutoRound recipe reports ~51 for the stock model — a format swap, if you're willing to re-quant.
- MTP is engine-dependent, not hardware-law. Under vLLM it's slower (15.2 < 17); under Atlas it's part of the win. Don't generalize a flag's behavior across engines.
- Don't chase GDN speedup PRs inside vLLM. PR #44700's +24% comes from routing to a fast recurrent kernel sm_121 doesn't run — zero gain here (16.9). The leverage is outside vLLM, not in it.
- Abliteration ≠ uncensored. It removes refusal, not trained framing. And it narrows your fast-path options, since most community speedups target the stock model.
The wall was real: inside vLLM, 122B on this box is 17 tok/s and an afternoon of upgrading won't change it. But the wall had a door. The measurement was right; the search space was too narrow. Change the engine, or change the format — don't keep tuning the slow one.
Also in this series: Part 1 — Qwen3.5-122B Runs. But at 14 tok/s.
Background: What Quantization Algorithms Actually Do · TurboQuant 3-bit KV Cache on GX10
FAQ
- Can a 122B model run on a 128GB DGX Spark?
- Yes. Qwen3.5-122B-A10B is a hybrid-GDN MoE; the NVFP4 quant is ~76GB and fits in the GB10's 128GB unified memory with room for KV cache. Under vLLM it serves, calls tools, and runs as a real multi-step agent at about 17 tok/s. I measured a different engine (Atlas) at 33.9 tok/s (36.5 with MTP, ~2×) on the same weights — I just haven't turned it into a stable daily setup yet.
- Why is Qwen3.5-122B only 17 tok/s under vLLM on GB10?
- Its Gated DeltaNet (GDN) linear-attention layers have no fast kernel path on the GB10's sm_121. vLLM's fast GDN backends gate to datacenter Blackwell (sm_10x, which has the tcgen05/TMEM tensor-memory instructions); on sm_121 it falls back to a Triton/FLA path. 17 tok/s is the Triton-fallback ceiling, which is actually a bit faster than the predicted 14.
- Is there a faster way to run Qwen3.5-122B on a DGX Spark than 17 tok/s?
- Yes, but not by tuning vLLM. I measured the Atlas engine (Rust) at 33.9 tok/s baseline / 36.5 with MTP (about 2×) on my abliterated NVFP4 122B, same weights, still fully uncensored — I just haven't turned it into a stable, reproducible daily setup yet (context capped at 4096 for now). A community INT4-AutoRound + INT8-LM-head recipe is reported at ~51 tok/s, but only for the stock model and not by me. The point: the vLLM NVFP4 17 tok/s was a correct local optimum; the way out is outside vLLM — change the engine or the quant format, not the flags.
- Does vLLM PR #44700 help on DGX Spark?
- No. PR #44700 ('Split mixed prefill+decode batches: route decodes to the recurrent kernel') reports ~1.93x kernel microbench and +24% end-to-end on a B200. Its gain comes from routing decodes to the fast recurrent kernel — which on sm_121 is the Triton/FLA fallback, not a fast native kernel. So the routing still lands on the slow kernel and the number doesn't move: 17 to 16.9 tok/s, zero gain.