~/blog/dgx-spark-gemma4-12b-omni-nvfp4-weight-only

DGX Spark · part 36

[Benchmark] Gemma 4 12B Omni on DGX Spark: Weight-Only NVFP4 Beats W4A4 (and Keeps Multimodal)

cat --toc

TL;DR

I quantized Google's brand-new omni Gemma 4 12B (text + image + audio + video) on a DGX Spark GB10 and benchmarked it under vLLM. Headline: weight-only NVFP4 (W4A16) is the winner — 7.7 GB, 24.9 tok/s, and all four modalities still work. The "obvious" full NVFP4 (W4A4) buys you nothing: it's no faster (23.9 vs 24.9 tok/s — within run-to-run noise) and it breaks image/audio/video, because its activation quantization is calibrated on text and clips the out-of-distribution multimodal embeddings. FP8 sits in the middle at 15.9 tok/s / 13 GB. One trap: in plain transformers every quant looks slower than BF16 (no native kernel) — the real speedups only show up under vLLM.

Plain-language version

Gemma 4 12B is a single model that reads text, images, audio, and video. Quantizing it makes the file smaller and (on the right runtime) faster. The surprising part: the most aggressive 4-bit setting (quantize both the weights and the activations) actually breaks the image/audio/video understanding and isn't even faster. The setting that quantizes only the weights is smaller, faster, and keeps every capability. So for a multimodal model, less-aggressive quantization wins.


The numbers: NVFP4 weight-only is the winner

All on one DGX Spark (GB10, sm_121a, 273 GB/s), vLLM 0.22.1 native, single-stream decode, warm, EN+ZH prompts:

FormatDisktok/svs BF16Omni
BF1623 GB7.7
FP8 dynamic13 GB15.92.1×
NVFP4 W4A47.7 GB23.93.1×❌ broken
NVFP4 W4A16 (weight-only)7.7 GB24.93.2×

The weight-only build is the smallest, at least as fast as full W4A4, and the only 4-bit build that keeps the model's multimodal capabilities. The naive choice — full W4A4 — gives no speed upside (the gap is within noise) and loses multimodal entirely. Its extra activation quantization is pure downside here.

Why W4A4 breaks the omni capabilities

NVFP4 W4A4 quantizes both the weights and the activations to 4-bit. The activation scales come from a calibration pass — and I calibrated on text (UltraChat), like everyone does for an LLM.

That's the bug. When you feed an image or an audio clip, the encoder-free projectors push raw patch / waveform embeddings into the language model. Those activations live in a different distribution than text. The text-calibrated 4-bit activation range clips them, and the model falls apart: in my tests, image and video returned empty output and audio came back as garbled nonsense, while plain text stayed perfectly coherent.

The tell was the audio: it broke even though the audio projector itself was kept in BF16. The culprit wasn't a quantized projector — it was the text tower's activation quantization being out-of-distribution on multimodal activations.

Weight-only NVFP4 (W4A16) has no activation quantization — activations stay BF16 — so there's no calibration mismatch. Image, audio, and video all came back intact:

  • Image: "two cats in a recording studio, headphones, an 'ON AIR' neon sign, studio microphones, laptops."
  • Audio: transcribed a LibriSpeech clip — "Mr. Quilter is the apostle of the middle classes..."
  • Video: "a woman in a black bodysuit walks down a wet street at night."

And weight-only costs nothing on speed: on a bandwidth-bound dense model the W4A16 path (dequant to BF16) is, if anything, a hair ahead (24.9 vs 23.9 tok/s — call it a tie). So W4A4's activation quantization buys no throughput here. That tracks with what I found on a dense model in Part 32 — the win is bandwidth, not the FP4 ALU.

The HF-eager trap: quant looks slower until you serve it right

Here's the number that would have sent me down the wrong path. Running the exact same quantized weights through plain transformers (HF eager):

FormatHF eager tok/svLLM native tok/s
BF167.37.7
FP85.315.9
NVFP4 W4A164.724.9

In transformers, NVFP4 is the slowest — because there is no native FP4 kernel, so it decompresses the weights to BF16 on every forward pass: all the overhead of quantization, none of the bandwidth benefit. Switch to vLLM's native kernels and the ordering flips completely — NVFP4 becomes the fastest, a 5× swing on identical weights. Always benchmark the runtime you'll actually deploy.

Serving gemma4_unified on a GB10: two things you need

Gemma 4 12B is a new architecture (gemma4_unified, Gemma4UnifiedForConditionalGeneration). Two non-obvious requirements to serve it on a DGX Spark:

  1. A vLLM build with the native class (~0.22.x / main). Older builds fall back to vLLM's generic transformers backend, which crashes on the attention output projection — because Gemma 4's attention is non-square: head_dim 256 × 16 heads = 4096, which is not the 3840 hidden size (it's GQA, 8 KV heads). The generic backend assumes the square case and mis-shapes the projection.
  2. VLLM_ATTENTION_BACKEND=TRITON_ATTN — the backend that handles that non-square attention.
VLLM_ATTENTION_BACKEND=TRITON_ATTN \
vllm serve coolthor/gemma-4-12B-it-NVFP4A16 --max-model-len 4096

(Caveat: vLLM's generic multimodal wrapper is image-only today, so full audio/video serving is still pending upstream — but all four modalities work through transformers.)

The quant recipe that actually loads on vLLM

This part is fiddly and undocumented. vLLM's native gemma4_unified quantizes a specific set of modules, and if your checkpoint's quantized/un-quantized layers don't match exactly, it refuses to load. The recipe that works:

QuantizationModifier(targets="Linear", scheme="NVFP4A16",
    ignore=["lm_head", "re:.*embedding_projection.*"])

In words: quantize the text tower and the vision patch_dense; keep lm_head and both embedding_projections (vision and audio) in BF16. That mirrors how vLLM builds the model — patch_dense is a ColumnParallelLinear constructed with a quant config, while the projectors are plain BF16 linears. (Also: use llmcompressor's basic pipeline, not sequential — the sequential tracer hits a UserDict error on this brand-new arch.)

Models

Both are on the Hub, Apache 2.0, omni intact:

TL;DR

  • On a DGX Spark GB10, weight-only NVFP4 wins for Gemma 4 12B omni: 7.7 GB, 24.9 tok/s (3.2× BF16), all four modalities intact.
  • Full W4A4 is a trap for multimodal models — slower and it breaks image/audio/video (text-calibrated activation quant is out-of-distribution on multimodal embeddings).
  • HF eager lies — quant looks slower there; the real speedup needs vLLM's native kernels (a 5× swing).
  • To serve it: vLLM with native Gemma4Unified + TRITON_ATTN.

Also in this series: Part 32 — NVFP4 is compression, not the FP4 cores, Part 33 — NVFP4 W4A4 beats FP8 on a MoE, Part 34 — NVFP4 shrinks a video model with zero speed gain, Part 26 — Nemotron Omni on a GB10.

FAQ

Should I quantize a multimodal LLM with NVFP4 W4A4 or weight-only W4A16?
For an omni model, use weight-only NVFP4 (W4A16). On a DGX Spark GB10 with Gemma 4 12B, W4A16 ran at 24.9 tok/s and kept image/audio/video working, while full W4A4 (weight + activation 4-bit) was no faster (23.9 tok/s — within noise) AND broke the multimodal capabilities. W4A4 calibrates its activation scales on text, so the image/audio embeddings fall out of that distribution and get clipped — for no speed benefit.
How fast is Gemma 4 12B quantized on a DGX Spark GB10?
Single-stream decode under vLLM 0.22.1: BF16 7.7 tok/s (23 GB), FP8 dynamic 15.9 tok/s (13 GB), NVFP4 weight-only 24.9 tok/s (7.7 GB). NVFP4 is 3.2x BF16. The same quantized weights run slower than BF16 under plain transformers (HF eager) because there is no native FP4/FP8 kernel there — the speedups are real only under vLLM.
Why does NVFP4 look slower than BF16 in Hugging Face transformers?
transformers has no native FP4/FP8 kernel, so it dequantizes the weights back to BF16 on every forward pass — pure overhead, no bandwidth benefit. In HF eager, NVFP4 measured 4.7 tok/s (slowest). Under vLLM's native kernels the same weights hit 24.9 tok/s (fastest). It's a 5x swing on identical weights — always benchmark the runtime you'll actually serve with.
How do you serve Gemma 4 (gemma4_unified) on vLLM?
Use a vLLM build with the native Gemma4UnifiedForConditionalGeneration class (around 0.22.x / main) and set VLLM_ATTENTION_BACKEND=TRITON_ATTN. Gemma 4's attention is non-square (head_dim 256 x 16 heads = 4096, not equal to the 3840 hidden size; it's GQA with 8 KV heads), and TRITON_ATTN is the backend that handles that correctly. Older vLLM falls back to a transformers backend that crashes on the attention projection.