Qwen3.5-122B on DGX Spark · part 1
[vLLM] Qwen3.5-122B Runs. But at 14 tok/s.
Preface
This article is about a result that looks like failure but isn't.
After patching four SM121-specific bugs (Part 1), Qwen3.5-122B runs correctly on a DGX Spark. The output is coherent. The model works. But the speed settles at 14 tok/s — roughly a quarter of what gpt-oss-120B achieves on the same hardware.
No amount of tuning moves it. That's the story.
The analogy: imagine a 12-lane highway where six lanes are freshly paved and six are still dirt. Traffic flows — it's not broken — but half the lanes are running at construction-zone speed. The hardware is fine. The pavement (software kernel support) for half the architecture isn't there yet. You can't fix a missing road by adjusting your driving.
What I Wanted
The goal was to use Qwen3.5-122B-A12B-NVFP4 as the primary model on the DGX Spark. 122B parameters, NVFP4 quantization, 128 GB unified memory — the math works out: 122B × 4-bit ≈ 61 GB, leaving room for KV cache. With gpt-oss-120B hitting 59 tok/s on the same box, the expectation for 122B NVFP4 was in the same range. Not identical — larger model, different architecture — but comparable.
14 tok/s was not the expectation.
What I Tried
After the Part 1 fixes (SM121 PTX path, Marlin thread race, SupportsQuant, CUTLASS_FP4 exclusion), the model loads and generates correct output. The first tokens arrive fast. Then throughput stabilizes at 14 tok/s.
Everything obvious was tried:
--enforce-eagerremoved — wasn't there, not the issue- KV cache dtype — fp8, same as gpt-oss, no change
--max-num-batched-tokens— tuned up and down, marginal effect onlyVLLM_MXFP4_BACKEND=marlin— already set and confirmed in startup log--moe-backend marlin— already set- GPU memory utilization — 0.90, standard
14 tok/s held steady regardless. This is when "tuning problem" becomes "architecture problem."
Where the Problem Actually Is
Qwen3.5-122B-A12B is a hybrid architecture — not a standard Transformer. It combines two types of layers:
- Standard attention layers — normal Transformer MHA/MLA, Marlin handles these with a specialized kernel
- GDN (Gated Delta Network) layers — SSM-style recurrent layers, must stay in BF16
Marlin has fast, optimized kernels for standard linear layers and MoE expert GEMMs. It does not have a specialized kernel for GDN. Every GDN GEMM falls back to a generic compute path — functional, but slow.
This is also why the SupportsQuant fix from Part 1 is non-negotiable: without it, vLLM quantizes the GDN layers to NVFP4, which corrupts their recurrent hidden state and breaks the model entirely. The fix keeps GDN in BF16 — correct behavior, but it means GDN runs on the unoptimized path.
On GB10's 273 GB/s bandwidth, the math is straightforward: moving 61 GB of weights takes ~0.22 seconds per decode step. The theoretical ceiling for 122B NVFP4 on GB10 is around 30 tok/s. To reach it, every layer needs an efficient kernel. When half the architecture (GDN) falls back to generic, the throughput averages down toward half the ceiling. 14 tok/s is where that average lands.
What Was Gained
The result is 14 tok/s, not the ~30 tok/s ceiling. But the work wasn't wasted.
Confirmed:
- The SM121 fix stack from Part 1 works correctly for Qwen3.5-122B. Output is coherent and accurate.
SupportsQuantis required — without it, GDN layers get quantized and output is garbage. This is a non-obvious bug that would cost hours to diagnose from scratch.- The theoretical decode ceiling on GB10 for 122B NVFP4 is ~30 tok/s. When Marlin adds GDN kernel support, this is the target to validate against.
Established:
- 14 tok/s is the current floor, not a bug. There is nothing misconfigured.
- The bottleneck is a software gap (missing Marlin GDN kernel), not a hardware limitation.
- GB10 has sufficient compute for Qwen3.5-122B. The problem is squarely in the software stack.
The diagnostic pattern: if Qwen3.5-122B runs slowly on SM121 and you've already set VLLM_MXFP4_BACKEND=marlin and --moe-backend marlin, confirmed startup logs show Using backend: marlin, and output is correct — you've hit the kernel gap. Stop tuning. The number will move when the kernel arrives.
Conclusion
Qwen3.5-122B runs correctly on DGX Spark after the Part 1 patches. 14 tok/s is not a failure mode — it's the current software ceiling given what Marlin supports today.
For interactive use on GB10 right now: gpt-oss-120B at 59 tok/s is the working choice. Qwen3.5-122B at 14 tok/s is usable for offline batch workloads where latency matters less.
If you found this because your Qwen is running at 14 tok/s: you haven't missed a flag. You've found the ceiling. Come back when the Marlin GDN kernel lands.