Qwen3.5-122B on DGX Spark · part 1
[vLLM] Qwen3.5-122B Runs. But at 14 tok/s.
❯ cat --toc
TL;DR
Qwen3.5-122B boots correctly on DGX Spark (GB10/SM121) after the Part 1 SM121 fixes — output is coherent, model works. Speed stabilizes at 14 tok/s regardless of tuning. Root cause: Marlin has no specialized kernel for GDN (Gated Delta Network) SSM layers, so they fall back to a generic compute path. No flag fixes this. Bandwidth-only ceiling for a 10B-active MoE on GB10 is ~55 tok/s; observed is 14 — the gap is compute, not bandwidth. Use gpt-oss-120B (~59 tok/s) for interactive workloads until Marlin ships GDN support.
Plain-Language Version: A 122-Billion-Parameter AI Model — Why Only 14 tok/s?
Let's start with the numbers. 122 billion parameters means this AI model has 122 billion adjustable "knobs" — more parameters generally means smarter, capable of handling more complex tasks. Qwen3.5-122B is one of the largest open-source models available, with capabilities approaching ChatGPT's level.
14 tok/s means 14 characters generated per second. Is it usable? Yes. Is it comfortable? Not really. Chatting with it feels like texting someone who types slowly. On the same machine, another model of similar size (gpt-oss-120B) runs at 59 tok/s — four times faster.
Why the huge gap? It's not the hardware, and it's not misconfiguration. Qwen3.5-122B uses a newer architecture (GDN — a hybrid of traditional Transformer and SSM layers), and the current software doesn't have optimized kernels for it yet. Imagine a 12-lane highway where 6 lanes are freshly paved and 6 are still dirt — traffic flows, but half the lanes are stuck at construction-zone speed.
This article documents my journey from "I must have configured something wrong" to "this is a software-level limitation, not a user error." If you're considering running large models yourself, this story could save you hours of troubleshooting.
Preface
This article is about a result that looks like failure but isn't.
After patching four SM121-specific bugs (Part 1), Qwen3.5-122B runs correctly on a DGX Spark. The output is coherent. The model works. But the speed settles at 14 tok/s — roughly a quarter of what gpt-oss-120B achieves on the same hardware.
No amount of tuning moves it. That's the story.
The analogy: imagine a 12-lane highway where six lanes are freshly paved and six are still dirt. Traffic flows — it's not broken — but half the lanes are running at construction-zone speed. The hardware is fine. The pavement (software kernel support) for half the architecture isn't there yet. You can't fix a missing road by adjusting your driving.
What Was the Expected Performance Target?
The goal was to use Qwen3.5-122B-A12B-NVFP4 as the primary model on the DGX Spark. 122B parameters, NVFP4 quantization, 128 GB unified memory — the math works out: 122B × 4-bit ≈ 61 GB, leaving room for KV cache. With gpt-oss-120B hitting 59 tok/s on the same box, the expectation for 122B NVFP4 was in the same range. Not identical — larger model, different architecture — but comparable.
14 tok/s was not the expectation.
What Configuration Changes Were Tried Before Diagnosing the Root Cause?
After the Part 1 fixes (SM121 PTX path, Marlin thread race, SupportsQuant, CUTLASS_FP4 exclusion), the model loads and generates correct output. The first tokens arrive fast. Then throughput stabilizes at 14 tok/s.
Everything obvious was tried:
--enforce-eagerremoved — wasn't there, not the issue- KV cache dtype — fp8, same as gpt-oss, no change
--max-num-batched-tokens— tuned up and down, marginal effect onlyVLLM_MXFP4_BACKEND=marlin— already set and confirmed in startup log--moe-backend marlin— already set- GPU memory utilization — 0.90, standard
14 tok/s held steady regardless. This is when "tuning problem" becomes "architecture problem."
Where Is the Actual Performance Bottleneck?
Qwen3.5-122B-A12B is a hybrid architecture — not a standard Transformer. It combines two types of layers:
- Standard attention layers — normal Transformer MHA/MLA, Marlin handles these with a specialized kernel
- GDN (Gated Delta Network) layers — SSM-style recurrent layers, must stay in BF16
Marlin has fast, optimized kernels for standard linear layers and MoE expert GEMMs. It does not have a specialized kernel for GDN. Every GDN GEMM falls back to a generic compute path — functional, but slow.
This is also why the SupportsQuant fix from Part 1 is non-negotiable: without it, vLLM quantizes the GDN layers to NVFP4, which corrupts their recurrent hidden state and breaks the model entirely. The fix keeps GDN in BF16 — correct behavior, but it means GDN runs on the unoptimized path.
On GB10's 273 GB/s bandwidth, the math needs the MoE correction. Qwen3.5-122B-A10B routes 8 of 256 experts per token, so only ~10B params are active per decode step. At 4-bit that's ~5 GB streamed per token — bandwidth ceiling of ~55 tok/s, not 30. The earlier "61 GB streamed per step" framing was wrong: that's the total weights, not the per-token active slice. A fully optimized stack should land well above 14 tok/s. When 36 of 48 layers (the linear-attention + GDN layers) fall back to a generic compute path, throughput collapses to 14 tok/s — about 4x below the bandwidth ceiling, which says compute is the bottleneck, not memory bandwidth.
What Was Gained
The result is 14 tok/s, far below what an efficient kernel stack would deliver. But the work wasn't wasted.
Confirmed:
- The SM121 fix stack from Part 1 works correctly for Qwen3.5-122B. Output is coherent and accurate.
SupportsQuantis required — without it, GDN layers get quantized and output is garbage. This is a non-obvious bug that would cost hours to diagnose from scratch.- The bandwidth ceiling on GB10 for a 10B-active MoE is ~55 tok/s; real-world targets will be lower once attention/norm overhead is included. When Marlin adds GDN kernel support, validate against this ceiling, not the 14 tok/s currently observed.
Established:
- 14 tok/s is the current floor, not a bug. There is nothing misconfigured.
- The bottleneck is a software gap (missing Marlin GDN kernel), not a hardware limitation.
- GB10 has sufficient compute for Qwen3.5-122B. The problem is squarely in the software stack.
The diagnostic pattern: if Qwen3.5-122B runs slowly on SM121 and you've already set VLLM_MXFP4_BACKEND=marlin and --moe-backend marlin, confirmed startup logs show Using backend: marlin, and output is correct — you've hit the kernel gap. Stop tuning. The number will move when the kernel arrives.
Conclusion
Qwen3.5-122B runs correctly on DGX Spark after the Part 1 patches. 14 tok/s is not a failure mode — it's the current software ceiling given what Marlin supports today.
For interactive use on GB10 right now: gpt-oss-120B at 59 tok/s is the working choice. Qwen3.5-122B at 14 tok/s is usable for offline batch workloads where latency matters less.
If you found this because your Qwen is running at 14 tok/s: you haven't missed a flag. You've found the ceiling. Come back when the Marlin GDN kernel lands.
FAQ
- Why is Qwen3.5-122B running at 14 tok/s on DGX Spark even after applying all SM121 fixes?
- Qwen3.5-122B-A10B is a hybrid MoE architecture where 36 of 48 layers are linear-attention + GDN (Gated Delta Network) SSM layers. Marlin has optimized kernels for standard linear/MoE layers but no specialized kernel for GDN. Every GDN GEMM falls back to a generic compute path. On GB10 (273 GB/s), a 10B-active MoE should hit ~55 tok/s on bandwidth alone; the GDN kernel gap drags actual throughput to ~14 tok/s — compute-bound, not bandwidth-bound.
- Is 14 tok/s on Qwen3.5-122B a configuration error or a kernel gap?
- A software kernel gap, not a configuration error. If you've set VLLM_MXFP4_BACKEND=marlin, --moe-backend marlin, and the startup log confirms 'Using backend: marlin', you've hit the ceiling. No flag fixes it. The bottleneck is a missing Marlin GDN kernel. Stop tuning and wait for the kernel to land.
- What is the theoretical decode ceiling for Qwen3.5-122B NVFP4 on GB10?
- ~55 tok/s on bandwidth alone. The correct math for MoE: only ~10B params are active per token (8 of 256 experts routed), so ~5 GB streams per decode step at 4-bit. 273 GB/s ÷ 5 GB ≈ 55 tok/s. Real-world practical ceiling is lower after attention/norm overhead but well above the 14 tok/s observed. Earlier text in this post claimed a 30 tok/s ceiling based on streaming all 61 GB of weights — that assumption is wrong for MoE.
- What is the practical choice between gpt-oss-120B and Qwen3.5-122B on DGX Spark today?
- For interactive use, gpt-oss-120B at ~59 tok/s is the correct choice. Qwen3.5-122B at 14 tok/s is usable for offline batch workloads where latency matters less. Revisit Qwen3.5-122B when Marlin adds GDN kernel support.