~/blog/zimage-turbo-quality-lpips-clipscore

Z-Image Turbo · part 2

[Field Guide] Z-Image Turbo — does choosing a faster config hurt quality? LPIPS + CLIPScore answer

cat --toc

TL;DR

Z-Image Turbo quantization quality, measured properly: a two-axis benchmark — LPIPS plus CLIPScore — across 6 prompts × 4 configs × 3 seeds = 72 samples. LPIPS shows NVFP4 produces images visually distinct from BF16 (distance 0.29–0.31), but all four configs land at CLIPScore 0.334–0.339 — the ±0.04 std band is an order of magnitude larger than the 0.001–0.005 differences between configs. No measured prompt-fidelity regression in any quantized config. NVFP4+FP8e is directionally higher on 3 of 6 prompts, but with N=3 seeds per cell that's not a statistically defensible win — call it "not worse." Bottom line: the Part 1 recommended combo (1.37× faster, 9.1 GB working set saved) shows no quality regression in this sample — re-verify with your own prompt set + N≥10 seeds before production use.

Why this article exists

Part 1 measured speed and memory for six Z-Image Turbo quant combos. NVFP4 transformer hits 5.50s warm versus BF16 7.55s (1.37× speedup), and model working set drops from 20.6 GB to 11.5 GB (44% smaller). But it left an open question: do these quantized configs preserve image quality?

Eyeballing six photorealistic portrait samples showed no obvious damage. But that prompt is the gentlest possible stress test for quantization. What about Chinese prompts, text rendering, dense detail in anime style, abstract composition? Part 1 did not answer; Part 2 does.


How to measure image quality properly

"Does quantization break image quality" sounds simple but is slippery. Define "quality": looks identical to BF16? Matches the prompt? Looks aesthetically good? The three questions have different answers and need different metrics. I went with the two metrics most commonly used in image generation research, combined.

Axis 1: LPIPS — how far do quantized images drift from BF16?

What LPIPS is

LPIPS (Learned Perceptual Image Patch Similarity), proposed by Zhang et al. 2018, was designed to replace pixel-level metrics like PSNR/SSIM with something closer to human visual perception.

Mechanism: run both images through a pre-trained AlexNet, extract mid-layer feature maps, then compute the distance between those features. Why AlexNet? Because mid-layer CNN features encode semantic concepts like "eyes, mouth, texture" — closer to how humans actually compare images than raw pixels.

Output range: 0 (identical) to roughly 1 (completely different). Practical intuition:

  • LPIPS < 0.05: visually indistinguishable, only pixel-level differences
  • LPIPS 0.1–0.2: minor composition / detail variation visible on close inspection
  • LPIPS 0.2–0.4: same subject, clearly different composition (typical "same prompt, different model")
  • LPIPS > 0.5: essentially different concepts

Why use BF16 as reference instead of running BF16 against itself?

In theory, same prompt + same seed + same model + same hardware should be fully deterministic — LPIPS against itself should be 0. But once you swap the quantization path, internal rounding changes, and the latent space trajectory diverges from the very first sampling step. That divergence is physically inevitable and unrelated to "quality."

So LPIPS tells us how far the quantized version drifts from BF16, but cannot determine quality on its own. An LPIPS of 0.3 could mean "image is fine, just composed slightly differently" or "quantization broke the subject" — LPIPS alone cannot distinguish those. That's why we need axis 2.

Axis 2: CLIPScore — does the image still match the prompt?

What CLIP is

CLIP (Contrastive Language-Image Pretraining), trained by OpenAI in 2021, is a dual-tower model — an image encoder plus a text encoder. Training objective: pull "image + matching caption" close in embedding space, push mismatches apart. Trained on ~400 million image-text pairs (LAION-5B is a subset).

CLIPScore is just: encode the generated image and the prompt text into vectors, take their cosine similarity. Output is technically 0–1 but practical ranges:

  • CLIPScore < 0.20: image does not match prompt at all (gibberish or broken)
  • CLIPScore 0.25–0.30: subject matches but details off
  • CLIPScore 0.30–0.40: typical baseline for a good generation model
  • CLIPScore > 0.40: very strong image-text alignment (e.g., abstract prompts hitting precisely)

Why CLIPScore is a quality proxy

The most common failure mode of "quantization broke quality" is image looks weird, doesn't match prompt. CLIPScore directly measures that correspondence and is highly sensitive to "broken" output. If quantization shifts the image so the subject drifts, composition becomes weird, or it stops resembling the prompt description, CLIPScore drops.

If CLIPScore does not drop, we have evidence that on the prompt-alignment axis, the quantized version did not regress — strong enough to push back on the "quantization always degrades" worry.

Limits of CLIPScore

  • Does not score aesthetics: an ugly image that matches the prompt can still score high. But for most image generation use cases, "matches the prompt" comes first
  • English bias: open_clip ViT-B-32 was trained ~80% on English data; Chinese / multilingual prompts will see lower absolute scores, but cross-config comparison stays fair (every config gets the same bias on the same prompt)
  • Misses fine detail: detail-level precision differences may not show up in CLIP features. For that level you need a learned predictor like LAION-Aesthetic

Reading both axes together

LPIPSCLIPScoreInterpretation
HighMatches BF16Quantization "different but not worse" ← typical successful quantization
HighLower than BF16Quantization broke something ← avoid
LowMatches BF16Quantization "close to BF16 and not worse" ← also success
LowHigher than BF16Rare; possibly a small improvement LPIPS missed

Combining the two separates "different" from "worse." LPIPS alone would mistake trajectory divergence for damage. CLIPScore alone would miss visually obvious shifts.

Why not FID / SSIM / human eval?

MetricWhy not used
FID (Fréchet Inception Distance)Needs N≥1000 samples for statistical stability. With 72 samples FID is just noise
SSIM / PSNRPixel-level structural similarity — completely blind to semantic meaning, will just report low scores for any quantization that diverges in trajectory, providing no useful signal
Human eval (A/B blind)A 4-config × 6-prompt × 3-seed = 72-image study would need ≥10 raters for statistical power; out of scope here
LAION-AestheticSubjective beauty predictor; better for ranking than for cross-config comparison (style preferences vary too much)

LPIPS + CLIPScore is the most robust and cheapest combination at our N=72 scale.

Six prompts engineered for diversity (stress test)

KeyPromptStress angle
photo_womanphotorealistic portrait of a young woman in a coffee shopgentlest case (easy)
photo_machineclose-up photo of a vintage mechanical pocket watch with intricate gearsdense detail (mechanical)
animeanime style illustration, kimono, cherry blossomsstylized (non-photorealistic)
text_renderwooden storefront sign reading "CLOSED FOR REPAIRS"text rendering (FP4 weak point)
chinese古風水墨畫,一位身穿青色長袍的書生站在竹林中 (full Chinese prompt)Chinese language + classical mood
abstractsurreal floating islands suspended in pastel sky, waterfalls cascadingabstract + large color regions

Each prompt × 4 configs × 3 seeds = 72 samples (seeds 42, 7777, 12345).

Four configs (matched to Part 1's recommendation)

ConfigTransformerEncoder
BF16z_image_turbo_bf16qwen_3_4b BF16 (reference)
FP8scaledKijai pre-quantizedqwen_3_4b BF16
NVFP4z_image_turbo_nvfp4qwen_3_4b BF16
NVFP4+FP8e (recommended)z_image_turbo_nvfp4qwen_3_4b_fp8_mixed

Results

Mean across 72 samples (headline table)

Config       LPIPS vs BF16      CLIPScore (image-text)
─────────   ───────────────   ──────────────────────
BF16         0.0000  ref       0.3344 ± 0.043
FP8scaled    0.1670 ± 0.081    0.3356 ± 0.043   ← matches BF16
NVFP4        0.2886 ± 0.086    0.3356 ± 0.045   ← matches BF16
NVFP4+FP8e   0.3069 ± 0.093    0.3388 ± 0.040   ← +0.0044 over BF16

All four configs sit between 0.334 and 0.339 on CLIPScore. NVFP4+FP8e's mean is 0.0044 above BF16, but the ±0.04 std band is an order of magnitude larger than this gap — quantization-driven changes are buried in the noise floor. With only N=3 seeds per cell there's no power for a meaningful paired t-test; the defensible claim is "no measured prompt-fidelity regression in any quantized config," not "equal to" or "better than" BF16.

LPIPS captures relative distance to BF16: FP8scaled is closest (0.167), NVFP4 mid (0.289), NVFP4+FP8e farthest (0.307). This is exactly what you would expect — the more aggressive the quantization, the earlier the latent trajectory diverges. That distance is not a quality loss; it is just "different model, different image".

Per-prompt CLIPScore breakdown (where might quantization break?)

PromptBF16FP8scaledNVFP4NVFP4+FP8eWinner
photo_woman0.34030.34040.33830.3426NVFP4+FP8e ↑0.0023
photo_machine0.32730.33540.32920.3433NVFP4+FP8e ↑0.0160
anime0.31870.31730.31300.3174BF16 ↑0.0013 (within noise)
text_render0.36280.36560.36420.3623FP8scaled ↑0.0028
chinese0.26420.26260.26580.2731NVFP4+FP8e ↑0.0089
abstract0.39320.39260.40300.3938NVFP4 ↑0.0098

Honest takeaways (statistical caveats first):

  1. NVFP4+FP8e mean is directionally higher on 3 of 6 prompts (photo_woman, photo_machine, chinese), with deltas of 0.002–0.016. But all of these gaps are below the 0.04 std band, and N=3 seeds gives no paired-t-test power — these are directional observations, not statistically defensible wins.
  2. The reverse direction also exists: BF16 leads NVFP4 on anime by 0.006, FP8scaled leads BF16 on text_render by 0.003, NVFP4 leads NVFP4+FP8e on abstract by 0.009 — same noise band, also not regressions.
  3. No multiple-comparison correction was applied across the 6 independent per-prompt comparisons (Bonferroni or similar). Treating any single row's mean ranking as a strong claim would be p-hacking.
  4. The defensible conclusion: no quantized config systematically falls below BF16 on CLIPScore in this sample — the four configs are statistically indistinguishable on prompt fidelity, so quality concerns are not a reason to avoid NVFP4+FP8e.

Chinese CLIPScore is generally lower (0.26 vs ~0.32–0.39 for English)

The chinese row sits noticeably below the others — this is not a quantization issue, it is the CLIP model's English bias (open_clip ViT-B-32 was trained on roughly 80%+ English data). The fact that all four configs cluster tightly on chinese (0.263–0.273) confirms the bias is universal, not config-specific.


Side-by-side comparison: 6 prompts × 4 configs (seed=42)

The CLIPScore numbers under each image below are for the single seed=42 sample. The breakdown table earlier reports mean across all 3 seeds, so the same (prompt, config) cell can show slightly different numbers in the two places (gap stays inside the 0.04 std band — expected).

Photo: portrait of a woman in a coffee shop

CLIPScore: BF16 0.334 / FP8scaled 0.332 / NVFP4 0.319 / NVFP4+FP8e 0.337

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

All four portraits hold composition, hair, expression. The NVFP4 row (bottom) diverges more from BF16 in framing, but proportions, skin texture, and the coffee-shop atmosphere all read fine.

Photo: vintage mechanical pocket watch detail

CLIPScore: BF16 0.332 / FP8scaled 0.329 / NVFP4 0.336 / NVFP4+FP8e 0.348 ↑↑

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

Dense-detail stress test. NVFP4+FP8e tops at 0.348 CLIPScore — brass tones and gear structure all land.

Anime: kimono woman with cherry blossoms

CLIPScore: BF16 0.307 / FP8scaled 0.318 / NVFP4 0.312 / NVFP4+FP8e 0.299

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

The only prompt where BF16 wins on the mean across three seeds, but at this seed FP8scaled (0.318) actually beats BF16 (0.307). All differences sit inside the std band.

Text render: wooden storefront sign (FP4 stress)

CLIPScore: BF16 0.348 / FP8scaled 0.353 / NVFP4 0.350 / NVFP4+FP8e 0.351

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

Text stress did not break — all four configs produce a readable English sign. Differences are in weathering and font-style detail.

Chinese: classical ink painting of a scholar in a bamboo forest (FP4 + Chinese double stress)

CLIPScore: BF16 0.260 / FP8scaled 0.261 / NVFP4 0.264 / NVFP4+FP8e 0.266

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

Chinese prompt holds up under quantization, NVFP4+FP8e even nudges slightly ahead. Classical ink mood, bamboo, scholar all present.

Abstract: surreal floating islands

CLIPScore: BF16 0.408 / FP8scaled 0.406 / NVFP4 0.390 / NVFP4+FP8e 0.384

BF16FP8scaled
BF16FP8scaled
NVFP4NVFP4+FP8e
NVFP4NVFP4+FP8e

NVFP4 paths come out slightly below BF16/FP8scaled here, but all four images clearly capture the "floating islands + waterfalls + dreamlike" core concept.


Conclusion

Part 1's recommended combo — NVFP4 transformer + qwen_3_4b_fp8_mixed encoder — is vindicated:

  • Speed: 5.52s warm vs BF16 7.55s, 1.37× faster
  • Working set RSS: 11.52 GB vs BF16 20.64 GB, 9.12 GB saved (44%)
  • Disk: 10.4 GB vs BF16 ~20.6 GB, ~49% saved
  • Quality: CLIPScore 0.3388 vs BF16 0.3344 — gap is far below the ±0.04 std band, no measured regression in this N=72 sample (and not statistically distinguishable from "equal" either)

The intuition that "quantization always breaks quality" does not show up in this benchmark for Z-Image Turbo on GB10. LPIPS shows quantized images look different from BF16, but that is latent trajectory divergence, not quality loss. CLIPScore differences across the 4 configs are below noise — prompt fidelity is preserved within measurement power, even if N=3 seeds is too small to claim anyone wins.

Verdict: Part 1's recommended combo shows no measured quality regression in this sample, so it's safe as a default. For production-grade anime or portrait services, bring your own prompt set + N≥10 seeds and re-verify — 6 prompts × 3 seeds is not enough to rule out subject-specific tail risk.


Methodology limitations

Neither LPIPS nor CLIPScore is a perfect metric. Caveats:

  1. LPIPS uses AlexNet features and skews toward photographic content. For anime/abstract, treat the LPIPS number conservatively.
  2. CLIPScore ViT-B-32 is biased toward English (English-heavy training data). The lower absolute number for Chinese is a CLIP artifact, not a model failure.
  3. CLIPScore does not measure aesthetics — only alignment to the prompt. An ugly-but-on-prompt image can still score high.
  4. N=3 seeds per (prompt, config) is not a large sample. For any single prompt, lean on the std rather than the mean.
  5. No FID — N=72 is too small for a stable FID score (FID needs N≥1000 to converge); CLIPScore is the most robust choice at this scale.
  6. No anime/cartoon perceptual fidelity test — capturing quantization losses on stylized content needs a learned aesthetic predictor like LAION-Aesthetic. Not done here.

If your use case is production-grade anime generation or high-fidelity portrait service, bring your own prompt set and inspect images yourself. The 6-prompt diversity test here gives starting confidence, not a verdict for your specific application.


Reproducibility

The full bench script zimage_quality_bench.py and the 72-sample LPIPS + CLIPScore raw JSON are not yet on GitHub (cleanup pending). If you want to run this yourself, the core idea:

pip install lpips open_clip_torch torch
# Roll your own driver: hit ComfyUI's HTTP API for 6 prompts × 4 configs × 3 seeds,
# pipe each output through lpips(pretrained='alex') + open_clip ViT-B-32 to score.

Requirements: ComfyUI 0.20+ at localhost:8188 plus the four model files (BF16, FP8scaled, NVFP4, qwen_3_4b_fp8_mixed) downloaded. Total disk ~35 GB if you want all four configs side-by-side (Part 1's recommended single combo is much smaller). Full run (72 generations + scoring) ~15–20 minutes.


What's next

After Part 1 ("speed + VRAM, best combo") and Part 2 ("quality is preserved"), one mystery remains:

Why is FP8 transformer slower than BF16 on GB10? Why is NVFP4 the actual winner?

ComfyUI source code provides hints (pick_operations() routing, fp8_linear() cast logic, MixedPrecisionOps block-quantized path), but to draw a real mechanism conclusion you need an nsys profiler trace of the actual kernels dispatched. I tried nsys 2025.6.3 on GB10 / SM12.1 / sbsa-aarch64; CUPTI does not capture kernel-level data on this configuration (a known nsys bug). Part 3 has to wait until either NVIDIA fixes nsight-cu support for SM12.1 or I find an alternative profiler.


FAQ

Does Z-Image Turbo quantization break image quality?
**No measured regression in this N=72 sample.** Two-axis benchmark: all four configs (BF16 / FP8scaled / NVFP4 / NVFP4+FP8e) sit at CLIPScore 0.334–0.339, and the std band of ±0.04 is an order of magnitude larger than the 0.001–0.005 mean differences between configs — not statistically distinguishable. LPIPS distance from BF16 ranges from 0.167 (FP8scaled) up to 0.307 (NVFP4+FP8e), but that's "different image," not "degraded image." Caveat: N=3 seeds × 6 prompts is not enough to rule out tail-risk on specific subjects — re-verify with your own prompt set before production use.
What do LPIPS and CLIPScore actually measure?
**LPIPS** = perceptual distance with BF16 as the ground-truth reference (0 = identical). **CLIPScore** = image-text alignment, scored independently per image without needing BF16 reference. Combining the two avoids the LPIPS trap: high LPIPS does not imply quality regression — CLIPScore tells you whether the image still matches the prompt.
How does the recommended NVFP4+FP8 encoder combo do on quality?
**No measured regression.** Overall mean CLIPScore 0.3388 vs BF16 0.3344 — a 0.0044 gap that is far below the ±0.04 std, so we cannot claim a significant win. Per-prompt, NVFP4+FP8e mean is directionally higher on 3 of 6 prompts (photo_woman, photo_machine, chinese); the other 3 are directionally lower. With only N=3 seeds per cell there's no power for a paired t-test, and the correct framing is "not worse," not "better."
Does Chinese prompt break under quantization?
**No measured regression.** Mean CLIPScore over 3 chinese-prompt seeds: BF16 0.264 / FP8scaled 0.263 / NVFP4 0.266 / NVFP4+FP8e **0.273**. NVFP4+FP8e is directionally higher but still inside the ±0.04 std band. The overall lower numbers on Chinese (0.26 vs ~0.35 English) reflect a CLIP model bias toward English training data, not a quantization issue.