Does NVFP4 make video generation faster on a DGX Spark?

No — if anything a hair slower. At 832×480, a 3-second clip took ~18.0s warm in FP8 and ~19.1s in NVFP4 (N=3, zero variance) — about 6s of compute per 1s of video, NVFP4 ~6% slower, never faster. (The first gen after a cold start is ~2 min while kernels compile.) Video diffusion is compute-bound, and weight-only NVFP4 dequantizes back to BF16 before the matmul, so it never touches the FP4 cores. The win is size: 29.2 GB → 19.5 GB (−33%), same quality.

Why does NVFP4 speed up LLM decode but not video generation on the same box?

Roofline. Single-stream LLM decode is bandwidth-bound — 4-bit weights are half the bytes to stream, so it goes faster. Video diffusion is compute-bound — the bottleneck is matmul throughput, and weight-only NVFP4 unpacks to BF16 and computes the same way, so you only get the smaller file, not more speed.

Why does the NVFP4 video model produce blurry output?

It doesn't — that was a VAE artifact, not the quantization. The NVFP4 checkpoint's embedded VAE hangs on a GB10 (CPU dequant), so it's tempting to fall back to the tiny tae preview VAE, which is genuinely blurry (Laplacian ~45 vs ~205 for the full VAE). Use the extracted full LTX VAE and NVFP4 decodes as sharply as FP8.

What ComfyUI flags do you need to run NVFP4 video on a DGX Spark?

Launch ComfyUI with --disable-async-offload --disable-dynamic-vram. The joint audio+video path crashes otherwise with 'NoneType object has no attribute wait_stream' in model_prefetch.py — the async prefetch stream resolves device=None for the NVFP4 model. GB10's 128GB unified memory doesn't need offload anyway.

[Benchmark] NVFP4 shrinks a video model 33% on a DGX Spark — with zero speed gain

TL;DR

On a DGX Spark (GB10, 128GB unified memory), NVFP4 took a distilled Sulphur 2 (uncensored LTX-2.3) text-to-video model from 29.2 GB → 19.5 GB (−33%) with no speed gain (832×480 3s clip, warm: FP8 ~18.0s vs NVFP4 ~19.1s — NVFP4 is if anything ~6% slower) and no quality loss (sharp, Laplacian 205.79, synced audio). This is the mirror image of Part 33: there, NVFP4 made the LLM faster because decode is bandwidth-bound; here, video diffusion is compute-bound, so weight-only NVFP4 only shrinks the file — the speed has to come from a bigger GPU. The point of the 19.5 GB build: it now fits a 32 GB RTX 5090. On HuggingFace.

Plain-Language Version: 4-bit made the video model smaller, not faster — and that's the whole point

In Part 33 I shrank a chatbot to 4-bit and it got faster. So I tried the same trick on a video-generation model — and the speed didn't improve at all (if anything it got a touch slower). Same clip time, just a smaller file.

That sounds like a failure. It isn't, and the reason is the most useful idea in this whole series. A chatbot, when it writes one word at a time, spends its time moving data — so making the data smaller makes it faster. A video model spends its time doing math — and 4-bit weights get unpacked back to normal numbers before the math happens, so the math takes essentially as long (the unpacking even costs a hair). Smaller file, same clock.

So why bother? Because "smaller" is its own prize. The full-size model doesn't fit on a normal gaming GPU (an RTX 5090, 32GB). The 19.5 GB version does. The DGX Spark — NVIDIA's big-memory desktop box — is where I make the small version; the 5090, with far more raw compute, is where it can actually run faster. Big-memory box to shrink it, fast box to run it.

前言

Part 33 ended with NVFP4 winning on speed because LLM decode is bandwidth-bound. The obvious next question: does the same 4-bit trick help the other thing this box does all day — generating video?

Same hardware as the whole series: one DGX Spark, GB10, 128GB unified memory at 273 GB/s, running ComfyUI next to the LLM daily. The model is Sulphur 2, an uncensored fine-tune of Lightricks' LTX-2 video DiT (~22B params, with a Gemma-3-12B text encoder and joint audio generation). I quantized the distilled variant to NVFP4 myself — there was no prebuilt one — and the answer to "does it help" is a clean, instructive no.

19.5 GB vs 29.2 GB, 18.0s vs 19.1s: NVFP4 video is a size play, not a speed play

Same distilled model, same 8 steps at cfg 1.0, same extracted LTX VAE, single-stream on the GB10, 832×480, a 3.04s clip (73 frames @ 24fps). Warm, N=3, only the weight format changes:

Format	File size	Time (3s clip, warm)	Quality (Laplacian)
FP8 (distilled)	29.2 GB	18.0s	sharp
NVFP4 (distilled)	19.5 GB	19.1s	205.79 (sharp)

That's ~6s of compute per 1s of video, and NVFP4 is not faster — it's a reproducible ~6% slower (19.1 vs 18.0s, N=3 with essentially zero variance), exactly what a compute-bound workload plus a little dequant overhead predicts. Those are warm numbers; the first gen after a cold start takes ~2 minutes while the kernels compile. I'll say the quiet part: I earlier thought NVFP4 was ~28% faster here — that was an N=1 measurement comparing the wrong two models (a 30-step dev model against an 8-step distilled one), and the "fast" reruns were partly ComfyUI cache hits. Hold the model, steps, resolution, and VAE fixed, cache-bust the seed, and the only thing 4-bit changes is the file size: 9.7 GB less disk and memory.

Why the same 4-bit format speeds up the LLM but not the video — roofline

This is the payoff for reading both parts. NVFP4 here is weight-only: the 4-bit weights are unpacked back to BF16, then the matmul runs in BF16. The FP4 tensor cores never fire. So the only thing 4-bit can buy is fewer bytes to move.

LLM decode (Part 33) is bandwidth-bound. At batch=1 the GPU spends its time streaming weights from memory. Halve the weight bytes and you halve the dominant cost — NVFP4 wins on speed.
Video diffusion is compute-bound. The 8 denoise steps are dense matmuls that saturate the math units; memory traffic isn't the bottleneck. Unpacking 4-bit back to BF16 hands the math the same BF16 it always had, so the clock doesn't improve — and the dequant step even costs a few percent, which is why NVFP4 lands a hair behind FP8 here.

Same format, same box, opposite outcome — and you can predict which one you'll get by asking a single question: is this workload waiting on memory, or waiting on math? To actually speed up the video you'd need either FP4 compute (a W4A4 path that quantizes activations too and fires the FP4 cores — a calibration project I haven't landed) or simply more math units. Which is why the real home for this 19.5 GB file is an RTX 5090: video is compute-bound, the 5090 has far more compute than a GB10, and 19.5 GB is what makes it fit in 32 GB (the FP8 model at 29 GB plus a 9 GB text encoder does not).

The VAE that hangs — and why I almost blamed the quantization

For a while I thought NVFP4 had wrecked the quality: every NVFP4 clip came out blurry. It was a measurement artifact. The NVFP4 checkpoint's embedded VAE hangs on the GB10 — it falls into a CPU dequant path and sits there — so I'd been routing decode through taeltx2_3, a tiny preview VAE. That preview VAE is genuinely blurry: Laplacian ~45 versus ~205 for the full LTX VAE, roughly 5× softer.

The fix is to decode with the full VAE, not the tiny one. The embedded full VAE works fine for FP8 but hangs for NVFP4, so I extracted it into a standalone 1.45 GB file (strip the vae. prefix from the FP8 checkpoint's tensors) and point a VAELoader at it. With that, the same NVFP4 sampling that looked blurry through tae decodes at Laplacian 205.79 — indistinguishable from FP8. The 4-bit weights were never the problem; the preview VAE was.

(One unrelated trap: there's an ae.safetensors floating around in the VAE folder — that's a FLUX VAE, not LTX. It decodes garbage. Don't grab it by name.)

The crash that needs two ComfyUI flags

The first NVFP4 generation crashed at the sampler:

AttributeError: 'NoneType' object has no attribute 'wait_stream'
  in comfy/model_prefetch.py

Root cause: ComfyUI's async weight offloading (two CUDA streams) plus DynamicVRAM resolved device=None for the NVFP4 LTXAV model on the joint audio+video forward pass, so current_stream(None) returned None. Video-only generation doesn't hit it; the audio+video joint path does. The fix is to launch ComfyUI with both prefetch optimizations off:

python main.py --listen 0.0.0.0 --port 8188 --enable-cors-header \
  --disable-async-offload --disable-dynamic-vram

A GB10 has 128 GB of unified memory — it never needed to offload weights to host in the first place, so disabling it costs nothing here except a slightly higher cold-run peak (the cold video run dipped to ~10 GB free with the LLM daily resident, well clear of trouble).

On HuggingFace, with the caveats written down

The quantized model is up at coolthor/Sulphur-2-distilled-NVFP4 — the 19.5 GB NVFP4 weights, the extracted 1.45 GB ltx_full_vae.safetensors it requires, and the ComfyUI workflow. It carries the upstream LTX-2 Community License, which permits quantized redistribution; credit goes to Lightricks (LTX-2) and the Sulphur fine-tune. The model card leads with the two traps above, because a 4-bit checkpoint that silently decodes blurry or hangs on its own VAE is worse than no checkpoint.

Takeaways

最花時間的地方 — controlling the variable. The "NVFP4 is 28% faster / NVFP4 is blurry" detour cost the most, and both were the same mistake: changing two things at once. Faster-but-blurry was a 30-step dev model decoded through a preview VAE, compared against an 8-step distilled model. Hold the model, steps, and VAE fixed and change only the format, and the truth is boring and correct: no faster (a reproducible hair slower), same quality, smaller file.

可搬走的診斷方法 — ask "memory or math?" before quantizing for speed. Weight-only 4-bit only buys speed on a bandwidth-bound workload. LLM decode: yes. Diffusion sampling: no. The roofline question predicts the result before you run anything, and it's the same question that explains Part 33.

通用原則 — match the box to the bottleneck. Big-memory box (GB10) to hold and shrink a model; high-compute box (5090) to run the compute-bound part fast. NVFP4 is the bridge: it makes the 29 GB model fit the 32 GB card.

Conclusion

If you're running LTX-2 / Sulphur video on a GB10:

NVFP4 weight-only is a size win (−33%, 29.2 → 19.5 GB), not a speed win — video is compute-bound. Expect FP8's clock or a touch slower (~6% in my runs), never faster. Warm, a 3s 832×480 clip is ~18-19s (~6s of compute per 1s of video); the first gen after a cold start is ~2 min.
Decode with the full LTX VAE (extract it to a standalone file — the embedded one hangs under NVFP4). Never ship taeltx2_3 as the final decoder; it's a blurry preview.
Launch ComfyUI with --disable-async-offload --disable-dynamic-vram or the audio+video path crashes in model_prefetch.py.
Distilled settings: 8 steps, cfg 1.0; frame count (N×8)+1; audio latent video_frames×4−3.
Want it faster, not just smaller? That's an FP4-compute (W4A4) or bigger-GPU problem — the 19.5 GB file exists precisely so it fits a 5090.