DGX Spark · part 32
NVFP4 is 1.5× FP8 on a DGX Spark — but it's compression, not the FP4 cores
❯ cat --toc
- Plain-Language Version: 4-bit Isn't "Computes Faster," It's "Smaller File"
- Preface
- The Goal: Is NVFP4's Slowness a Format Problem or a Kernel Problem?
- Why Pure Dense: a Hybrid's BF16 Floor Eats the FP4 Win
- Result: Both FP4 Variants Beat FP8 by ~1.5×
- The Fastest Path Never Touches the FP4 Cores
- cudagraph ≈ eager: the Speed Didn't Come From Graph Capture
- CUTLASS #3227 Is Fixed in 4.5.1 — and It Changes Nothing for Speed
- What Was Gained
- Conclusion
TL;DR
On a DGX Spark (GB10, SM121, 273 GB/s) at single-stream decode, NVFP4 beats FP8 on a pure dense Qwen3-8B by ~1.5×: FP8 25.65 tok/s, NVFP4 W4A4 38.59, NVFP4A16 W4A16 40.85 (N=3 median, kv-cache fp8). The win is bandwidth, not compute — the fastest path (W4A16) dequantizes FP4 weights to BF16 and never uses the FP4 tensor cores. Earlier I saw FP8 beat NVFP4 on a hybrid Qwen3.6-35B-A3B; that was the un-quantizable BF16 GDN layers diluting the format, not a property of NVFP4. Rule: FP4 speedup ∝ the weight bytes it actually shrinks.
Plain-Language Version: 4-bit Isn't "Computes Faster," It's "Smaller File"
NVFP4 is a format that squeezes an AI model down to 4 bits per weight — half the size of FP8. The intuition "smaller should be faster" is right, but where the speed comes from trips a lot of people up.
When a DGX Spark (NVIDIA's desktop AI box) generates text, the real bottleneck is how fast it can move data from memory into the chip — not how fast the chip can do math. So a model that's half the size means half the bytes to move per word generated, which means faster. That has nothing to do with the chip's dedicated 4-bit math unit (the FP4 tensor core). In my tests the fastest path doesn't use it at all — it just unpacks the 4-bit weights back to a normal format and computes the regular way.
There's a catch: this speedup only counts for the parts you can actually compress. A plain dense model compresses end to end, so it gets faster. But some hybrid models keep a big chunk uncompressable, and there FP4 doesn't help — FP8 is actually faster. So the answer isn't "NVFP4 is faster," it's "depends how much of your model is squeezable."
Preface
Four months ago I changed the oil on a DGX Spark and found its engine computer was from a different car — that was Part 1 (SM121 ≠ SM120). Part 19 then concluded: "NVFP4 is a trap on GB10, FP8 wins by 32%."
That conclusion was wrong — or more precisely, contaminated. Part 19 measured Qwen3.6-35B-A3B, a hybrid MoE with a large block of un-quantizable BF16 layers. This time I re-ran it on a pure dense model to remove that variable, and the result flipped. This is the payoff, and a lesson in clean baselines.
The Goal: Is NVFP4's Slowness a Format Problem or a Kernel Problem?
When NVFP4 underperforms on GB10, there are two very different explanations:
- Format problem — 4 bits is fundamentally a bad trade on this silicon.
- Kernel problem — the format is fine; the software hasn't matured enough to exploit the hardware.
Part 19 couldn't tell them apart because it tested on a hybrid model, which carries its own confound. To answer the question, you need a pure dense transformer so that "is this layer quantizable" stops being a variable.
Why Pure Dense: a Hybrid's BF16 Floor Eats the FP4 Win
GB10 single-stream decode is memory-bandwidth-bound (273 GB/s): each token's speed is roughly "bytes moved per token ÷ bandwidth." NVFP4 halves the weight bytes, so decode should roughly double — but only for the weights that are both quantizable and touched every step.
Qwen3.6-35B-A3B is hybrid: its GDN (Gated Delta Network) SSM layers are precision-sensitive and must stay BF16, excluded from quantization. My earlier profiling showed those BF16 layers ate ~59% of decode time. So whether you quantize the MoE experts to FP8 or FP4, that 59% doesn't move — FP4's benefit gets diluted, and FP8's mature kernel wins.
A pure dense model has no such floor: all of it compresses. So I grabbed Qwen3-8B (dense) and quantized three variants to compare.
Result: Both FP4 Variants Beat FP8 by ~1.5×
Same dense base, single-stream decode, N=3 median, kv-cache fp8, 400 tokens. To get W4A4's true cudagraph number, this set ran on an older stack that can compile the sm_121a FP4 kernel (vLLM 0.20.2 + FlashInfer 0.6.9 + cutlass-dsl 4.4.2).
| Format | Repo | Gen tok/s (median) | Kernel | vs FP8 |
|---|---|---|---|---|
| FP8 (W8A8) | RedHatAI/Qwen3-8B-FP8-dynamic | 25.65 | Cutlass FP8 | 1.00× |
| NVFP4 (W4A4) | RedHatAI/Qwen3-8B-NVFP4 | 38.59 | FlashInfer Cutlass NVFP4 (FP4 MMA) | 1.50× |
| NVFP4A16 (W4A16) | ELVISIO/Qwen3-8B-NVFP4A16 | 40.85 | Marlin-family (dequant→BF16) | 1.59× |
On dense, the format delivers: half the weight bytes, ~1.5× the throughput. My "compression should be faster" hypothesis held — once the hybrid contamination was gone.
One honest caveat: the W4A16 model is from a different publisher (ELVISIO) than the FP8/W4A4 (RedHatAI), so the ~6% gap between W4A16 and W4A4 could be the quantization recipe, not the kernel. Treat "W4A16 > W4A4" as soft. What's robust is that both 4-bit formats beat 8-bit by ~1.5×.
The Fastest Path Never Touches the FP4 Cores
Here's the part that keeps surprising me. The only path that actually runs FP4 math on GB10's tensor cores is W4A4 — and it came second (38.59). The winner, W4A16 (40.85), dequantizes the FP4 weights to BF16 and does a normal BF16 matmul. It uses FP4 purely as a storage format. It never fires the FP4 ALU.
That's the whole story of NVFP4 on this box in one sentence: the path that uses the fancy FP4 compute unit loses to the path that ignores it and treats FP4 as a smaller file. At batch=1 you're bandwidth-bound, so compute throughput — the thing FP4 tensor cores give you — isn't the bottleneck. You'd only see them pay off under heavy concurrency, which single-stream chat is not.
cudagraph ≈ eager: the Speed Didn't Come From Graph Capture
When I first ran this on the current stack, W4A4 could only serve in --enforce-eager (its compiled path was broken — more below), while FP8 and W4A16 ran on cudagraph. That's an unfair comparison, so it's worth checking how much it cost.
It cost almost nothing. W4A4 in enforce-eager: 39.62 tok/s. W4A4 on the cudagraph-capable stack: 38.59. Within noise. And the FP8 / W4A16 numbers barely moved across two completely different toolchains (FP8 26.27 → 25.65, W4A16 41.64 → 40.85). On a bandwidth-bound workload, cudagraph mostly removes kernel-launch overhead that single-stream decode doesn't expose. Worth knowing before spending a day rebuilding a stack to "fix" eager mode.
CUTLASS #3227 Is Fixed in 4.5.1 — and It Changes Nothing for Speed
When I first ran this, W4A4 wouldn't compile at all: the image shipped cutlass-dsl 4.5.0, which emits invalid PTX for the sm_121a FP4 MMA (ptxas fatal: Unexpected instruction types for '_mma') — CUTLASS #3227. So I first measured W4A4 on a reverted pre-bug stack (FlashInfer 0.6.9 + cutlass-dsl 4.4.2 + vLLM 0.20.2), then went back to check the obvious fix.
The clean fix is cutlass-dsl 4.5.1: it ships sm_121a in admissible_archs natively — no monkeypatch — and on it W4A4 compiles straight onto the b12x cudagraph path (FlashInferB12xNvFp4LinearKernel, no enforce-eager, zero ptxas errors). The vLLM b12x integration (PR #40082) is in the v0.22.0 pre-release and the dispatch already lands in recent 0.21.1 dev builds.
Here's the punchline. Retested on 4.5.1, W4A4 is 38.22 tok/s — within 1% of the old-stack 38.59, and still below W4A16's 40.41. The fix unlocks clean compilation; it does not move throughput. On a bandwidth-bound box, a properly-compiled FP4 MMA kernel still can't beat the memory ceiling that W4A16's dequant path hits anyway. So the "had to pin an old stack" problem is gone — and the conclusion it might have threatened (FP4 compute buys nothing single-stream) is, if anything, now nailed down on the current toolchain.
What Was Gained
What cost the most time: the rebuild to get W4A4 on cudagraph — which turned out to be the least important result. The cross-stack comparison it produced is methodologically muddy (different vLLM/FlashInfer/cutlass all at once). What redeemed it was the side effect: it proved cudagraph ≈ eager, which is the thing that actually mattered.
Transferable diagnostic: when a quantization format underperforms, separate the format from the kernel and the model. Run a pure dense transformer first, all on one frozen stack, before trusting any number from a hybrid or spec-decode path. The Part 19 "trap" conclusion was contaminated precisely because that step was skipped.
Universal pattern: fewer bits buy speed only for the bytes you actually shrink and read — not for the bytes a hybrid model keeps in BF16, and not through a compute unit that a bandwidth-bound workload never needs.
Conclusion
For single-user chat on a DGX Spark, as of 2026-05-30:
- NVFP4 beats FP8 by ~1.5× on a dense model — measured on Qwen3-8B single-stream. Use it.
- It's compression, not compute. W4A16 (dequant) is the fastest path and never touches the FP4 cores. Don't chase the FP4 ALU for single-stream.
- On hybrid MoE (e.g. the daily Qwen3.6), FP8 still wins — the BF16 GDN layers dilute FP4. Pick the format per model, not per format.
- Don't rebuild a stack to "un-eager" W4A4. cudagraph ≈ eager on this bandwidth-bound box.
- The compiled FP4 path was broken on cutlass-dsl 4.5.0 (CUTLASS #3227); 4.5.1 fixes it — W4A4 compiles on the b12x cudagraph path natively. But I retested on 4.5.1 and it's the same speed (38.22, still ≤ W4A16): the fix buys clean compilation, not throughput.
Scope: 8B dense, single-stream (batch=1), GB10. High-concurrency serving is a different regime — FP4 compute may matter there, and I haven't measured it.
Also in this series: Part 1 — Why your DGX Spark says "!!!!!" · Part 19 — NVFP4 Is a Trap on GB10 · Part 25 — Nemotron 3 Nano at 74 tok/s
FAQ
- Is NVFP4 faster than FP8 on a DGX Spark (GB10)?
- On a pure dense model at single-stream decode, yes — about 1.5× faster. On Qwen3-8B I measured FP8 at 25.65 tok/s and NVFP4 at 38.59–40.85 tok/s. But it depends on the model: on a hybrid MoE like Qwen3.6-35B-A3B, FP8 actually wins, because the un-quantizable BF16 layers dominate the per-token bytes.
- Does NVFP4 use the GB10's FP4 tensor cores?
- Only the W4A4 path does, and it's not the fastest. The quickest single-stream path on GB10 is W4A16 — it dequantizes FP4 weights to BF16 and computes in BF16, never touching the FP4 ALU. NVFP4's value on this box is smaller files (less bandwidth), not FP4 math.
- Why is NVFP4 faster on dense but slower on hybrid models?
- GB10 decode is memory-bandwidth-bound at batch=1. NVFP4 only helps for the bytes it actually shrinks. A dense transformer is 100% quantizable, so halving the weight bytes directly buys throughput. A hybrid model keeps its GDN/SSM layers in BF16 — on Qwen3.6 those were ~59% of decode time, so FP4 can't move the needle.
- Does cudagraph matter for NVFP4 decode on GB10?
- Barely. W4A4 in enforce-eager hit 39.62 tok/s; on a cudagraph-capable stack it was 38.59. On a bandwidth-bound workload, cudagraph mostly removes kernel-launch overhead that single-stream decode doesn't expose.