~/blog/dgx-spark-nvfp4-compression-not-compute

DGX Spark · part 32

NVFP4 is 1.5× FP8 on a DGX Spark — but it's compression, not the FP4 cores

cat --toc

TL;DR

On a DGX Spark (GB10, SM121, 273 GB/s) at single-stream decode, NVFP4 beats FP8 on a pure dense Qwen3-8B by ~1.5×: FP8 25.65 tok/s, NVFP4 W4A4 38.59, NVFP4A16 W4A16 40.85 (N=3 median, kv-cache fp8). The win is bandwidth, not compute — the fastest path (W4A16) dequantizes FP4 weights to BF16 and never uses the FP4 tensor cores. Earlier I saw FP8 beat NVFP4 on a hybrid Qwen3.6-35B-A3B; that was the un-quantizable BF16 GDN layers diluting the format, not a property of NVFP4. Rule: FP4 speedup ∝ the weight bytes it actually shrinks.

Plain-Language Version: 4-bit Isn't "Computes Faster," It's "Smaller File"

NVFP4 is a format that squeezes an AI model down to 4 bits per weight — half the size of FP8. The intuition "smaller should be faster" is right, but where the speed comes from trips a lot of people up.

When a DGX Spark (NVIDIA's desktop AI box) generates text, the real bottleneck is how fast it can move data from memory into the chip — not how fast the chip can do math. So a model that's half the size means half the bytes to move per word generated, which means faster. That has nothing to do with the chip's dedicated 4-bit math unit (the FP4 tensor core). In my tests the fastest path doesn't use it at all — it just unpacks the 4-bit weights back to a normal format and computes the regular way.

There's a catch: this speedup only counts for the parts you can actually compress. A plain dense model compresses end to end, so it gets faster. But some hybrid models keep a big chunk uncompressable, and there FP4 doesn't help — FP8 is actually faster. So the answer isn't "NVFP4 is faster," it's "depends how much of your model is squeezable."


Preface

Four months ago I changed the oil on a DGX Spark and found its engine computer was from a different car — that was Part 1 (SM121 ≠ SM120). Part 19 then concluded: "NVFP4 is a trap on GB10, FP8 wins by 32%."

That conclusion was wrong — or more precisely, contaminated. Part 19 measured Qwen3.6-35B-A3B, a hybrid MoE with a large block of un-quantizable BF16 layers. This time I re-ran it on a pure dense model to remove that variable, and the result flipped. This is the payoff, and a lesson in clean baselines.

The Goal: Is NVFP4's Slowness a Format Problem or a Kernel Problem?

When NVFP4 underperforms on GB10, there are two very different explanations:

  • Format problem — 4 bits is fundamentally a bad trade on this silicon.
  • Kernel problem — the format is fine; the software hasn't matured enough to exploit the hardware.

Part 19 couldn't tell them apart because it tested on a hybrid model, which carries its own confound. To answer the question, you need a pure dense transformer so that "is this layer quantizable" stops being a variable.

Why Pure Dense: a Hybrid's BF16 Floor Eats the FP4 Win

GB10 single-stream decode is memory-bandwidth-bound (273 GB/s): each token's speed is roughly "bytes moved per token ÷ bandwidth." NVFP4 halves the weight bytes, so decode should roughly double — but only for the weights that are both quantizable and touched every step.

Qwen3.6-35B-A3B is hybrid: its GDN (Gated Delta Network) SSM layers are precision-sensitive and must stay BF16, excluded from quantization. My earlier profiling showed those BF16 layers ate ~59% of decode time. So whether you quantize the MoE experts to FP8 or FP4, that 59% doesn't move — FP4's benefit gets diluted, and FP8's mature kernel wins.

A pure dense model has no such floor: all of it compresses. So I grabbed Qwen3-8B (dense) and quantized three variants to compare.

Result: Both FP4 Variants Beat FP8 by ~1.5×

Same dense base, single-stream decode, N=3 median, kv-cache fp8, 400 tokens. To get W4A4's true cudagraph number, this set ran on an older stack that can compile the sm_121a FP4 kernel (vLLM 0.20.2 + FlashInfer 0.6.9 + cutlass-dsl 4.4.2).

FormatRepoGen tok/s (median)Kernelvs FP8
FP8 (W8A8)RedHatAI/Qwen3-8B-FP8-dynamic25.65Cutlass FP81.00×
NVFP4 (W4A4)RedHatAI/Qwen3-8B-NVFP438.59FlashInfer Cutlass NVFP4 (FP4 MMA)1.50×
NVFP4A16 (W4A16)ELVISIO/Qwen3-8B-NVFP4A1640.85Marlin-family (dequant→BF16)1.59×

On dense, the format delivers: half the weight bytes, ~1.5× the throughput. My "compression should be faster" hypothesis held — once the hybrid contamination was gone.

One honest caveat: the W4A16 model is from a different publisher (ELVISIO) than the FP8/W4A4 (RedHatAI), so the ~6% gap between W4A16 and W4A4 could be the quantization recipe, not the kernel. Treat "W4A16 > W4A4" as soft. What's robust is that both 4-bit formats beat 8-bit by ~1.5×.

The Fastest Path Never Touches the FP4 Cores

Here's the part that keeps surprising me. The only path that actually runs FP4 math on GB10's tensor cores is W4A4 — and it came second (38.59). The winner, W4A16 (40.85), dequantizes the FP4 weights to BF16 and does a normal BF16 matmul. It uses FP4 purely as a storage format. It never fires the FP4 ALU.

That's the whole story of NVFP4 on this box in one sentence: the path that uses the fancy FP4 compute unit loses to the path that ignores it and treats FP4 as a smaller file. At batch=1 you're bandwidth-bound, so compute throughput — the thing FP4 tensor cores give you — isn't the bottleneck. You'd only see them pay off under heavy concurrency, which single-stream chat is not.

cudagraph ≈ eager: the Speed Didn't Come From Graph Capture

When I first ran this on the current stack, W4A4 could only serve in --enforce-eager (its compiled path was broken — more below), while FP8 and W4A16 ran on cudagraph. That's an unfair comparison, so it's worth checking how much it cost.

It cost almost nothing. W4A4 in enforce-eager: 39.62 tok/s. W4A4 on the cudagraph-capable stack: 38.59. Within noise. And the FP8 / W4A16 numbers barely moved across two completely different toolchains (FP8 26.27 → 25.65, W4A16 41.64 → 40.85). On a bandwidth-bound workload, cudagraph mostly removes kernel-launch overhead that single-stream decode doesn't expose. Worth knowing before spending a day rebuilding a stack to "fix" eager mode.

CUTLASS #3227 Is Fixed in 4.5.1 — and It Changes Nothing for Speed

When I first ran this, W4A4 wouldn't compile at all: the image shipped cutlass-dsl 4.5.0, which emits invalid PTX for the sm_121a FP4 MMA (ptxas fatal: Unexpected instruction types for '_mma') — CUTLASS #3227. So I first measured W4A4 on a reverted pre-bug stack (FlashInfer 0.6.9 + cutlass-dsl 4.4.2 + vLLM 0.20.2), then went back to check the obvious fix.

The clean fix is cutlass-dsl 4.5.1: it ships sm_121a in admissible_archs natively — no monkeypatch — and on it W4A4 compiles straight onto the b12x cudagraph path (FlashInferB12xNvFp4LinearKernel, no enforce-eager, zero ptxas errors). The vLLM b12x integration (PR #40082) is in the v0.22.0 pre-release and the dispatch already lands in recent 0.21.1 dev builds.

Here's the punchline. Retested on 4.5.1, W4A4 is 38.22 tok/s — within 1% of the old-stack 38.59, and still below W4A16's 40.41. The fix unlocks clean compilation; it does not move throughput. On a bandwidth-bound box, a properly-compiled FP4 MMA kernel still can't beat the memory ceiling that W4A16's dequant path hits anyway. So the "had to pin an old stack" problem is gone — and the conclusion it might have threatened (FP4 compute buys nothing single-stream) is, if anything, now nailed down on the current toolchain.

What Was Gained

What cost the most time: the rebuild to get W4A4 on cudagraph — which turned out to be the least important result. The cross-stack comparison it produced is methodologically muddy (different vLLM/FlashInfer/cutlass all at once). What redeemed it was the side effect: it proved cudagraph ≈ eager, which is the thing that actually mattered.

Transferable diagnostic: when a quantization format underperforms, separate the format from the kernel and the model. Run a pure dense transformer first, all on one frozen stack, before trusting any number from a hybrid or spec-decode path. The Part 19 "trap" conclusion was contaminated precisely because that step was skipped.

Universal pattern: fewer bits buy speed only for the bytes you actually shrink and read — not for the bytes a hybrid model keeps in BF16, and not through a compute unit that a bandwidth-bound workload never needs.

Conclusion

For single-user chat on a DGX Spark, as of 2026-05-30:

  1. NVFP4 beats FP8 by ~1.5× on a dense model — measured on Qwen3-8B single-stream. Use it.
  2. It's compression, not compute. W4A16 (dequant) is the fastest path and never touches the FP4 cores. Don't chase the FP4 ALU for single-stream.
  3. On hybrid MoE (e.g. the daily Qwen3.6), FP8 still wins — the BF16 GDN layers dilute FP4. Pick the format per model, not per format.
  4. Don't rebuild a stack to "un-eager" W4A4. cudagraph ≈ eager on this bandwidth-bound box.
  5. The compiled FP4 path was broken on cutlass-dsl 4.5.0 (CUTLASS #3227); 4.5.1 fixes it — W4A4 compiles on the b12x cudagraph path natively. But I retested on 4.5.1 and it's the same speed (38.22, still ≤ W4A16): the fix buys clean compilation, not throughput.

Scope: 8B dense, single-stream (batch=1), GB10. High-concurrency serving is a different regime — FP4 compute may matter there, and I haven't measured it.


Also in this series: Part 1 — Why your DGX Spark says "!!!!!" · Part 19 — NVFP4 Is a Trap on GB10 · Part 25 — Nemotron 3 Nano at 74 tok/s

FAQ

Is NVFP4 faster than FP8 on a DGX Spark (GB10)?
On a pure dense model at single-stream decode, yes — about 1.5× faster. On Qwen3-8B I measured FP8 at 25.65 tok/s and NVFP4 at 38.59–40.85 tok/s. But it depends on the model: on a hybrid MoE like Qwen3.6-35B-A3B, FP8 actually wins, because the un-quantizable BF16 layers dominate the per-token bytes.
Does NVFP4 use the GB10's FP4 tensor cores?
Only the W4A4 path does, and it's not the fastest. The quickest single-stream path on GB10 is W4A16 — it dequantizes FP4 weights to BF16 and computes in BF16, never touching the FP4 ALU. NVFP4's value on this box is smaller files (less bandwidth), not FP4 math.
Why is NVFP4 faster on dense but slower on hybrid models?
GB10 decode is memory-bandwidth-bound at batch=1. NVFP4 only helps for the bytes it actually shrinks. A dense transformer is 100% quantizable, so halving the weight bytes directly buys throughput. A hybrid model keeps its GDN/SSM layers in BF16 — on Qwen3.6 those were ~59% of decode time, so FP4 can't move the needle.
Does cudagraph matter for NVFP4 decode on GB10?
Barely. W4A4 in enforce-eager hit 39.62 tok/s; on a cudagraph-capable stack it was 38.59. On a bandwidth-bound workload, cudagraph mostly removes kernel-launch overhead that single-stream decode doesn't expose.