DGX Spark · part 19
[Benchmark] NVFP4 Is a Trap on GB10: FP8 Wins by 32% (vLLM + SGLang Tested)
❯ cat --toc
- White-Box Version: Why Smaller Numbers Aren't Always Faster
- Preface
- 40.8 vs 53.8 tok/s: FP8 Beats NVFP4 by 32%
- Root Cause: SM121 Is Missing a Hardware Instruction
- SGLang: Three Attempts, Three Failures
- Driver 580.142: A Free +4% for FP8
- What Was Gained
- Most Time-Consuming: SGLang Compatibility Chain
- Transferable Diagnostic: Check Compute Capability Before Quantization Format
- Universal Pattern
- Conclusion
TL;DR
NVFP4 on DGX Spark's GB10 (SM121) is 32% slower than FP8 — not faster. Root cause: SM121 lacks the cvt.e2m1x2 hardware instruction that SM120 (RTX 5090) and SM100 (B200) have natively. Qwen 3.6-35B-A3B FP8 on vLLM hits 53.8 tok/s (driver 580.142). NVFP4 on vLLM falls back to Marlin dequant at 40.8 tok/s. SGLang NVFP4 crashes entirely. This is silicon — no driver update fixes it.
White-Box Version: Why Smaller Numbers Aren't Always Faster
NVFP4 is NVIDIA's 4-bit floating point format — half the bits of FP8, so in theory it should move data faster and use less memory. On most Blackwell GPUs (RTX 5090, B200), it does. But DGX Spark uses a different chip — the GB10, codenamed SM121 — and this chip is missing a key instruction that makes FP4 math work at full speed.
The result: FP4 on DGX Spark is actually slower than FP8. I tested two inference engines (vLLM and SGLang) to confirm this isn't a software problem. It's hardware. If you have a DGX Spark, use FP8 — it's 32% faster than NVFP4 and actually works.
Preface
Part 7 showed NVFP4 running Gemma 4 at 52 tok/s on GB10. Part 8 confirmed vLLM beats Ollama by 30% on the same model. Both articles used NVFP4 and everything looked fine.
Then I tried Qwen 3.6-35B-A3B — a model that ships with both NVFP4 and FP8 checkpoints — and the numbers didn't add up. NVFP4 was slower. So I spent a day testing every combination of engine, quantization format, and driver version to find out why.
The answer is in the silicon.
40.8 vs 53.8 tok/s: FP8 Beats NVFP4 by 32%
All tests: Qwen 3.6-35B-A3B on DGX Spark (GB10, 128GB unified memory), single-user generation.
| Engine | Quantization | Driver | tok/s | Notes |
|---|---|---|---|---|
| vLLM 0.19.1 | FP8 | 580.142 | 53.8 | Best config |
| vLLM 0.19.1 | FP8 | 580.126 | 51.8 | Previous driver |
| vLLM 0.19.1 | NVFP4 | 580.142 | 40.8 | Marlin fallback (-24%) |
| SGLang 0.5.4 (spark) | NVFP4 | 580.142 | — | Model not recognized |
| SGLang 0.5.10 | NVFP4 | 580.142 | — | CUDA kernel crash |
| SGLang 0.5.10 + no CUDA graph | NVFP4 | 580.142 | — | OOM → crash |
FP8 with the latest driver is the only configuration that both works and performs well.
Root Cause: SM121 Is Missing a Hardware Instruction
The GB10 chip in DGX Spark is SM121. It's Blackwell architecture, but it's not the same Blackwell as RTX 5090 (SM120) or B200 (SM100).
The critical difference: SM121 does not have the cvt.rn.satfinite.e2m1x2.f32 PTX instruction. This instruction converts between FP4 (e2m1) and FP32 in hardware — a single clock cycle on SM120/SM100.
Without it, SM121 has two options:
- Software emulation — the driver added
cvt.e2m1x2support in PTX 8.6+ (driver 580.142), but it runs through software emulation. NVIDIA's changelog confirms this is ~30x slower than native hardware execution. - Fallback path — vLLM detects that SM121 is not in
FP4_ARCHSand routes through Marlin kernels, which decompress FP4 weights to BF16 before computation. This is the 40.8 tok/s path.
This is a silicon-level limitation. No driver update, no firmware flash, no software patch can add a missing transistor circuit. The instruction simply does not exist on this die.
For comparison:
| GPU | Compute Capability | Native FP4 | NVFP4 Performance |
|---|---|---|---|
| B200 | SM100 | ✅ Yes | Full speed |
| RTX 5090 | SM120 | ✅ Yes | Full speed |
| DGX Spark GB10 | SM121 | ❌ No | Marlin fallback or crash |
SGLang: Three Attempts, Three Failures
I tried SGLang because maybe vLLM's fallback path was the bottleneck. Maybe SGLang had a different NVFP4 kernel that could work on SM121. It didn't.
Attempt 1: SGLang 0.5.4 (spark image)
The lmsysorg/sglang:spark Docker image exists specifically for DGX Spark with SM121 patches. But it ships with transformers 4.57.1, which doesn't recognize Qwen 3.6's qwen3_5_moe architecture. Dead on arrival.
Attempt 2: SGLang 0.5.10 (latest)
Upgrading to the latest SGLang with transformers 5.3.0 fixed model recognition. The TRITON_PTXAS_PATH=/usr/local/cuda/bin/ptxas workaround resolved the sm_121a compilation error. The model loaded, selected the native NVFP4 path — and then crashed on the first CUDA kernel call.
The DeltaNet hybrid attention in Qwen 3.6 combined with NVFP4 on SM121 hits a kernel that simply doesn't exist.
Attempt 3: SGLang 0.5.10 + --disable-cuda-graph
Disabling CUDA graphs to reduce memory overhead. OOM immediately, even after lowering --mem-fraction-static. The model needs the memory that CUDA graphs would normally manage.
Driver 580.142: A Free +4% for FP8
While NVFP4 was a dead end, upgrading from driver 580.126 to 580.142 gave an unexpected FP8 boost:
| Driver | FP8 tok/s | Change |
|---|---|---|
| 580.126 | 51.8 | baseline |
| 580.142 | 53.8 | +3.9% |
Key changes in 580.142:
- UMA memory reporting fix — GB10's unified memory pressure values were reported incorrectly. This may explain some of the SGLang OOM issues on older drivers.
- SM121 ISA fallback fix — the driver no longer falls back to SM80 (Ampere) instruction paths. FP8 kernels now run native SM121 code instead of compatibility mode.
- Status upgrade — from Beta to Production/Recommended.
If you're running a DGX Spark, update to 580.142. It's a free speed boost for FP8 workloads.
What Was Gained
Most Time-Consuming: SGLang Compatibility Chain
The three SGLang attempts took the most time — not because each one was hard, but because each failure revealed the next dependency. Old transformers → upgrade → ptxas error → workaround → kernel crash → disable CUDA graph → OOM. Five links in a chain, each one requiring a full model load cycle (~3 minutes) to discover.
Transferable Diagnostic: Check Compute Capability Before Quantization Format
Before choosing a quantization format, check what your GPU actually supports at the silicon level:
python3 -c "import torch; print(torch.cuda.get_device_capability())"
If the output is (12, 1) — that's SM121, and NVFP4 will not run at native speed. Cross-reference against the CUDA compute capability table to verify which instructions your hardware supports.
Universal Pattern
Fewer bits does not mean faster inference. The path from weight format to actual computation passes through hardware instructions — and if the instruction doesn't exist, the software workaround will always be slower than a larger format with native support.
Conclusion
For DGX Spark (GB10/SM121) owners:
- Use FP8, not NVFP4. FP8 is 32% faster because it has native hardware support on SM121. NVFP4 does not.
- Update to driver 580.142. Free +4% on FP8, plus fixes for memory reporting and ISA fallback.
- Use vLLM, not SGLang — for NVFP4 models on SM121, SGLang crashes. For FP8, vLLM is the tested, stable path.
- Don't try to patch around it. Adding SM121 to vLLM's
FP4_ARCHSproduces garbage output. SGLang's native NVFP4 path crashes. The limitation is hardware.
When to revisit: If vLLM or SGLang announce SM121-native FP4 kernels, or if CUTLASS 4.x ships SM121 FP4 support via bit-manipulation workarounds. Until then, FP8 at 53.8 tok/s is the ceiling.
Also in this series: Part 7 — Gemma 4 NVFP4 at 52 tok/s · Part 8 — vLLM vs Ollama · Part 14 — Gemma 4 Complete Guide · Part 18 — Scaffold Transfers Three Models
FAQ
- Why is NVFP4 slower than FP8 on DGX Spark GB10?
- GB10's SM121 GPU lacks the cvt.rn.satfinite.e2m1x2.f32 PTX instruction for native FP4↔FP32 conversion. SM120 (RTX 5090) and SM100 (B200) have it in hardware. On SM121, NVFP4 either falls back to Marlin dequant (40.8 tok/s, -21%) or crashes. This is a silicon-level limitation — no driver or firmware update can fix it.
- What's the fastest Qwen 3.6 inference speed on DGX Spark?
- 53.8 tok/s with Qwen 3.6-35B-A3B FP8 on vLLM 0.19.1 with NVIDIA driver 580.142. Upgrading from driver 580.126 gave a free +4% boost.
- Does SGLang work on DGX Spark with Qwen 3.6?
- Not reliably. SGLang 0.5.4 (spark image) doesn't recognize Qwen 3.6's architecture. SGLang 0.5.10 recognizes it but crashes on CUDA kernel execution with NVFP4. FP8 on SGLang was not tested because the NVFP4 crashes blocked the pipeline.
- Should I use NVFP4 or FP8 on DGX Spark?
- FP8. Always. NVFP4 on GB10 (SM121) is 21-32% slower due to hardware limitations, and SGLang NVFP4 crashes entirely. FP8 on vLLM 0.19.1 with driver 580.142 gives 53.8 tok/s — that's the ceiling for now.