~/blog/part2-qwen-122b-14-toks-gdn-kernel-gap

Qwen3.5-122B on DGX Spark · part 1

[vLLM] Qwen3.5-122B Runs. But at 14 tok/s.

2026-03-194 min read#dgx-spark#sm121#qwen3.5-122b#vllm中文版

Preface

This article is about a result that looks like failure but isn't.

After patching four SM121-specific bugs (Part 1), Qwen3.5-122B runs correctly on a DGX Spark. The output is coherent. The model works. But the speed settles at 14 tok/s — roughly a quarter of what gpt-oss-120B achieves on the same hardware.

No amount of tuning moves it. That's the story.

The analogy: imagine a 12-lane highway where six lanes are freshly paved and six are still dirt. Traffic flows — it's not broken — but half the lanes are running at construction-zone speed. The hardware is fine. The pavement (software kernel support) for half the architecture isn't there yet. You can't fix a missing road by adjusting your driving.


What I Wanted

The goal was to use Qwen3.5-122B-A12B-NVFP4 as the primary model on the DGX Spark. 122B parameters, NVFP4 quantization, 128 GB unified memory — the math works out: 122B × 4-bit ≈ 61 GB, leaving room for KV cache. With gpt-oss-120B hitting 59 tok/s on the same box, the expectation for 122B NVFP4 was in the same range. Not identical — larger model, different architecture — but comparable.

14 tok/s was not the expectation.


What I Tried

After the Part 1 fixes (SM121 PTX path, Marlin thread race, SupportsQuant, CUTLASS_FP4 exclusion), the model loads and generates correct output. The first tokens arrive fast. Then throughput stabilizes at 14 tok/s.

Everything obvious was tried:

  • --enforce-eager removed — wasn't there, not the issue
  • KV cache dtype — fp8, same as gpt-oss, no change
  • --max-num-batched-tokens — tuned up and down, marginal effect only
  • VLLM_MXFP4_BACKEND=marlin — already set and confirmed in startup log
  • --moe-backend marlin — already set
  • GPU memory utilization — 0.90, standard

14 tok/s held steady regardless. This is when "tuning problem" becomes "architecture problem."


Where the Problem Actually Is

Qwen3.5-122B-A12B is a hybrid architecture — not a standard Transformer. It combines two types of layers:

  • Standard attention layers — normal Transformer MHA/MLA, Marlin handles these with a specialized kernel
  • GDN (Gated Delta Network) layers — SSM-style recurrent layers, must stay in BF16

Marlin has fast, optimized kernels for standard linear layers and MoE expert GEMMs. It does not have a specialized kernel for GDN. Every GDN GEMM falls back to a generic compute path — functional, but slow.

This is also why the SupportsQuant fix from Part 1 is non-negotiable: without it, vLLM quantizes the GDN layers to NVFP4, which corrupts their recurrent hidden state and breaks the model entirely. The fix keeps GDN in BF16 — correct behavior, but it means GDN runs on the unoptimized path.

On GB10's 273 GB/s bandwidth, the math is straightforward: moving 61 GB of weights takes ~0.22 seconds per decode step. The theoretical ceiling for 122B NVFP4 on GB10 is around 30 tok/s. To reach it, every layer needs an efficient kernel. When half the architecture (GDN) falls back to generic, the throughput averages down toward half the ceiling. 14 tok/s is where that average lands.


What Was Gained

The result is 14 tok/s, not the ~30 tok/s ceiling. But the work wasn't wasted.

Confirmed:

  • The SM121 fix stack from Part 1 works correctly for Qwen3.5-122B. Output is coherent and accurate.
  • SupportsQuant is required — without it, GDN layers get quantized and output is garbage. This is a non-obvious bug that would cost hours to diagnose from scratch.
  • The theoretical decode ceiling on GB10 for 122B NVFP4 is ~30 tok/s. When Marlin adds GDN kernel support, this is the target to validate against.

Established:

  • 14 tok/s is the current floor, not a bug. There is nothing misconfigured.
  • The bottleneck is a software gap (missing Marlin GDN kernel), not a hardware limitation.
  • GB10 has sufficient compute for Qwen3.5-122B. The problem is squarely in the software stack.

The diagnostic pattern: if Qwen3.5-122B runs slowly on SM121 and you've already set VLLM_MXFP4_BACKEND=marlin and --moe-backend marlin, confirmed startup logs show Using backend: marlin, and output is correct — you've hit the kernel gap. Stop tuning. The number will move when the kernel arrives.


Conclusion

Qwen3.5-122B runs correctly on DGX Spark after the Part 1 patches. 14 tok/s is not a failure mode — it's the current software ceiling given what Marlin supports today.

For interactive use on GB10 right now: gpt-oss-120B at 59 tok/s is the working choice. Qwen3.5-122B at 14 tok/s is usable for offline batch workloads where latency matters less.

If you found this because your Qwen is running at 14 tok/s: you haven't missed a flag. You've found the ceiling. Come back when the Marlin GDN kernel lands.