Z-Image Turbo 量化會崩品質嗎?

**這個 N=72 sample 沒測到品質 regression**。72 sample 雙軸實測,所有 4 個 config(BF16 / FP8scaled / NVFP4 / NVFP4+FP8e)的 CLIPScore 都在 0.334-0.339 範圍,std band ±0.04 比 config 之間 mean 差距(0.001-0.005)大一個數量級,沒有顯著差異。LPIPS vs BF16 從 0.167 到 0.307 都有,但這是「圖長得不一樣」不是「品質變差」。注意:N=3 seed × 6 prompt 不足以排除特定主題上的尾部 risk,production 用前自己拿你的 prompt set 再驗。

LPIPS 跟 CLIPScore 兩個指標分別測什麼?

**LPIPS** = perceptual distance,把 BF16 當 reference 量量化版偏離多少(0=完全相同)。**CLIPScore** = image-text 對齊度,獨立打分,不需要 BF16 reference。兩軸並用解 LPIPS 的 trap:LPIPS 高不等於品質差,CLIPScore 才是「這張圖對得上 prompt 嗎」。

推薦組合(NVFP4+FP8 encoder)的品質怎樣?

**沒測到 regression**。整體 mean CLIPScore 0.3388 vs BF16 0.3344,差距 0.0044,遠小於 std 0.04 — 不能宣稱顯著贏。看分 prompt:3 個方向上 NVFP4+FP8e mean 高(photo_woman / photo_machine / chinese),另外 3 個方向上 BF16 / FP8scaled / NVFP4 高,但全部都在 noise band 內。N=3 seed 不足以做 paired t-test,正確結論是「沒輸」而不是「贏」。

中文 prompt 量化會崩嗎?

**沒測到崩**。3 個 chinese prompt × 3 seed mean CLIPScore: BF16 0.264 / FP8scaled 0.263 / NVFP4 0.266 / NVFP4+FP8e **0.273**。NVFP4+FP8e 方向上略高但仍在 std 0.04 內。chinese 整體分數比英文低是 CLIP model 對英文偏好的 bias,不是量化問題。

[實戰] Z-Image Turbo 教戰守則:換配置會崩品質嗎?LPIPS + CLIPScore 雙軸驗證

TL;DR

Z-Image Turbo 量化品質實測:LPIPS + CLIPScore 雙軸跑 6 prompt × 4 config × 3 seed = 72 sample。LPIPS 顯示 NVFP4 跟 BF16 圖長得不一樣(distance 0.29-0.31),但 CLIPScore 4 config 全在 0.334-0.339,std band ±0.04 比 config 之間 mean 差距(0.001-0.005)大一個數量級 — 在這個 sample 沒測到任何量化路徑的 prompt fidelity regression。NVFP4+FP8e 在 3 個 prompt 方向上 mean 高於 BF16,但 N=3 seed 不足以做顯著性檢定,正確說法是「沒輸」不是「贏」。結論:Part 1 推薦的 NVFP4+FP8e(快 1.37×、省 9.1 GB working set)在這個 sample 沒測到品質 regression — 但 production 用前請自己拿你的 prompt set + N≥10 seed 再驗。

為什麼要做這篇

Part 1 量了 6 種量化組合的速度跟資源 — NVFP4 transformer warm 5.50s 比 BF16 7.55s 快 1.37×,模型工作集從 20.6 GB 降到 11.5 GB(省 44%)。但留了一個黑洞:這些量化版的圖,品質有崩嗎?

肉眼看 6 張寫實人像沒明顯崩壞,但這是 quantization 最弱的 stress test。中文 prompt、文字渲染、細節密的 anime、抽象風,量化敢不敢應付?Part 1 沒答,Part 2 答。

怎麼測量品質?

「量化會不會崩品質」這句話聽起來簡單,實際很滑。「品質」是什麼?圖長得跟 BF16 一樣?還是跟 prompt 對得上?還是看起來「美」?三個問題答案不一樣,需要不同 metric。我用業界 image gen 研究最常用的兩個指標,雙軸並用。

軸 1:LPIPS — 量化版的圖跟 BF16 差多遠?

LPIPS 是什麼

LPIPS(Learned Perceptual Image Patch Similarity)是 Zhang et al. 2018 提的指標,目的是取代 PSNR / SSIM 那種 pixel-level 比較,改用「人眼看起來像不像」。

實作上:把兩張圖各跑過一個預訓練的 AlexNet,擷取中間層 feature map,然後算 feature 之間的距離。為什麼用 AlexNet?因為 CNN 中層 feature 學會了「眼睛、嘴巴、紋理」之類的語意,比直接比 pixel 接近人眼判斷。

輸出範圍:0(兩張圖完全一樣)到約 1(完全不同)。實務上:

LPIPS < 0.05:幾乎看不出差別,只在 pixel level 微差
LPIPS 0.1-0.2:仔細看可以看出構圖 / 細節微異
LPIPS 0.2-0.4:同主題但構圖明顯不同(典型「同 prompt 不同 model」)
LPIPS > 0.5:基本是不同概念的圖

為什麼把 BF16 當 reference 而不是 BF16 自己跑兩次?

理論上,同 prompt + 同 seed + 同 model + 同硬體應該完全 deterministic — LPIPS 跟自己比應該是 0。但量化 path 換了之後,model 內部 rounding 不一樣,latent space 軌跡從第一個 sampling step 就開始發散。這個發散是物理必然,跟「品質好不好」無關。

所以 LPIPS 可以告訴我們「量化版偏離 BF16 多少」,但不能單看 LPIPS 下品質結論。0.3 的 LPIPS 可能是「圖完全 OK 只是構圖換了一個版本」,也可能是「量化崩到主體都跑掉」 — LPIPS 自己分不出兩者。要靠軸 2。

軸 2:CLIPScore — 圖跟 prompt 對得上嗎?

CLIP 是什麼

CLIP(Contrastive Language-Image Pretraining)是 OpenAI 2021 訓的雙塔模型 — 一個 image encoder + 一個 text encoder,訓練目標是讓「圖 + 對應的描述文字」在 embedding space 裡近,不對應的遠。訓練資料 4 億對 image-text(LAION 5B 是其中之一)。

CLIPScore 就是把一張生成圖 + 它對應的 prompt 文字各 encode 成 vector,算 cosine similarity。輸出範圍 0-1 但實務上有意義範圍是:

CLIPScore < 0.20:圖跟 prompt 完全對不上(亂生 / 崩了)
CLIPScore 0.25-0.30:有對到主題但細節偏差
CLIPScore 0.30-0.40:一般好的生成模型 baseline
CLIPScore > 0.40:image-text 對齊很強(例如 abstract prompt 命中很準)

為什麼 CLIPScore 是品質的代理?

「量化崩品質」最常見的 failure mode 是圖看起來怪、跟 prompt 對不上。CLIPScore 直接量這個對應關係,對「崩了」很敏感。如果量化把圖改到主體跑掉、構圖怪、看起來不像 prompt 描述的東西,CLIPScore 會掉。

如果 CLIPScore 沒掉,代表「就 prompt 對應這個維度,量化版沒輸」 — 這就足以推翻「量化必崩」的擔心。

CLIPScore 的限制

不評美感:醜但對得上 prompt 的圖 CLIPScore 可以很高。但 image gen 多數 use case 我們在乎的是「對得上」優先
英文偏向:open_clip ViT-B-32 訓練資料 80%+ 英文,中文 / 多語 prompt 整體絕對分數會偏低,但跨 config 比較還是公平的(同 prompt 各 config 都被同樣 bias 影響)
對 typo / 細節偏差不夠細:細節級的精度差異 CLIP 可能抓不到。要那種精度需要 LAION-Aesthetic 之類的 learned predictor

兩軸結合的判讀邏輯

LPIPS 軸	CLIPScore 軸	結論
高	同 BF16	量化「不一樣但沒輸」← 典型成功量化
高	比 BF16 低	量化崩了 ← 需要避免
低	同 BF16	量化「跟 BF16 像 + 沒輸」← 也是成功
低	比 BF16 高	罕見,可能是 LPIPS 沒抓到的 small improvement

雙軸並看,可以分清楚「不一樣 vs 變差」這兩件事。單看 LPIPS 會誤把 trajectory 發散當崩壞,單看 CLIPScore 會錯過視覺上明顯的偏移。

為什麼不用 FID / SSIM / 人眼?

指標	為何沒用
FID(Fréchet Inception Distance)	需要 N≥1000 sample 統計穩定。72 sample 跑 FID 是 noise
SSIM / PSNR	pixel-level 結構相似度,對量化的 rounding-driven trajectory 發散完全看不到語意,輸出永遠是低分,沒參考價值
人眼 A/B 評分	對 4 config × 6 prompt × 3 seed = 72 圖做 A/B blind eval 需要 ≥10 人才有 stat power,本篇沒這資源
LAION-Aesthetic	主觀美感 predictor,適合 ranking 不適合 cross-config 比較(對 style 偏好差異大)

LPIPS + CLIPScore 在我們 N=72 的尺度下是最 robust + 最便宜的選擇。

6 個多樣性 prompt(stress test 設計)

key	prompt	stress 重點
photo_woman	寫實女性人像在咖啡廳	量化最弱 stress(easy)
photo_machine	古董懷錶齒輪細節寫實照	細節密(機械結構)
anime	動漫和服女性 + 櫻花	風格化(非寫實)
text_render	木製商店招牌寫 "CLOSED FOR REPAIRS"	文字渲染(FP4 弱點)
chinese	古風水墨竹林書生(全中文 prompt)	中文 + 古風氣氛
abstract	超現實漂浮島 + 雲瀑布	抽象 + 大色塊

每 prompt × 4 config × 3 seed = 72 sample 全跑(seed=42, 7777, 12345)。

4 個 config(對照 Part 1 推薦組合)

Config	Transformer	Encoder
BF16	z_image_turbo_bf16	qwen_3_4b BF16(reference)
FP8scaled	Kijai 預量化	qwen_3_4b BF16
NVFP4	z_image_turbo_nvfp4	qwen_3_4b BF16
NVFP4+FP8e(推薦)	z_image_turbo_nvfp4	qwen_3_4b_fp8_mixed

結果

Mean across 72 sample(主表)

config       LPIPS vs BF16     CLIPScore (image-text)
─────────   ───────────────   ──────────────────────
BF16         0.0000  ref       0.3344 ± 0.043
FP8scaled    0.1670 ± 0.081    0.3356 ± 0.043   ← 跟 BF16 同分
NVFP4        0.2886 ± 0.086    0.3356 ± 0.045   ← 跟 BF16 同分
NVFP4+FP8e   0.3069 ± 0.093    0.3388 ± 0.040   ← 比 BF16 微高 0.0044

4 個 config 的 CLIPScore 全部在 0.334-0.339。NVFP4+FP8e 比 BF16 mean 高 0.0044,但整體 std band 是 ±0.04,比這個差距大一個數量級 — 量化路徑造成的 image-text 對齊變化被 noise 蓋過。N=3 seed 太少跑不了有意義的 paired t-test,這個 sample 能下的結論是「沒測到任何量化 config 的 prompt fidelity regression」,不是「等同 BF16」也不是「贏 BF16」。

LPIPS 反映「跟 BF16 相對距離」,FP8scaled 最接近(0.167)、NVFP4 中等(0.289)、NVFP4+FP8e 最遠(0.307)。但這是預期的 — 量化 aggressiveness 越強,跟 BF16 的 latent 軌跡越早發散。這個距離不是「品質損失」,是「不同 model 給不同圖」。

各 prompt CLIPScore breakdown(看哪個 prompt 量化會崩)

prompt	BF16	FP8scaled	NVFP4	NVFP4+FP8e	贏家
photo_woman	0.3403	0.3404	0.3383	0.3426	NVFP4+FP8e ↑0.0023
photo_machine	0.3273	0.3354	0.3292	0.3433	NVFP4+FP8e ↑0.0160
anime	0.3187	0.3173	0.3130	0.3174	BF16 ↑0.0013(within noise)
text_render	0.3628	0.3656	0.3642	0.3623	FP8scaled ↑0.0028
chinese	0.2642	0.2626	0.2658	0.2731	NVFP4+FP8e ↑0.0089
abstract	0.3932	0.3926	0.4030	0.3938	NVFP4 ↑0.0098

兩個 takeaway(誠實版,先把統計坑說清楚):

NVFP4+FP8e 在 3 個 prompt 上 mean 高於 BF16(photo_woman / photo_machine / chinese),差距 0.002-0.016。但這些「勝出」全部 < std band 0.04,N=3 seed 跑 paired t-test 沒有任何方向會顯著 — 「贏」是方向性的觀察,不是統計顯著的勝出。
反向看:anime 上 BF16 比 NVFP4 高 0.006,text_render 上 FP8scaled 比 BF16 高 0.003,abstract 上 NVFP4 比 NVFP4+FP8e 高 0.009 — 這 3 個方向同樣在 noise band 內,也不是 regression 的證據。
6 個 prompt 獨立比較沒做 multiple-comparison correction(Bonferroni 之類),任何一行單看 mean 高低都不能拿來做強推論。
能下的結論:這個 sample 沒測到任何量化路徑「崩」到讓 CLIPScore 系統性掉到 BF16 之下 — 4 個 config 在 prompt fidelity 上互相難分高下,品質擔心不是換 NVFP4+FP8e 的理由。

Chinese prompt 整體 CLIPScore 較低(0.26 vs 英文 0.32-0.39)

注意 chinese 那行的數字明顯比其他低 — 這 不是量化問題,是 CLIP model 本身對英文偏好的 bias(open_clip ViT-B-32 訓練資料英文佔比 80%+)。可以從 4 個 config 在 chinese 上同 tier(0.263-0.273)看出:量化沒崩中文,只是 CLIP scoring 對中文打分偏低是 universal。

6 prompt × 4 config 視覺對比(seed=42)

下面各圖標的 CLIPScore 是 seed=42 單張的分數,前面 breakdown table 是跨 3 個 seed 的 mean,所以兩處同 prompt 同 config 的數字會略有不同(差異在 std 0.04 範圍內,合理)。

Photo: 咖啡廳女性人像

CLIPScore: BF16 0.334 / FP8scaled 0.332 / NVFP4 0.319 / NVFP4+FP8e 0.337 ↑

BF16	FP8scaled

NVFP4	NVFP4+FP8e

四張人像的構圖、髮型、表情都站得住。NVFP4 路徑下的兩張(右下兩格)跟 BF16 構圖差較多,但人物比例、皮膚紋理、咖啡廳氛圍都正常。

Photo: 古董懷錶齒輪細節

CLIPScore: BF16 0.332 / FP8scaled 0.329 / NVFP4 0.336 / NVFP4+FP8e 0.348 ↑↑

BF16	FP8scaled

NVFP4	NVFP4+FP8e

齒輪細節 stress test。NVFP4+FP8e 那張 CLIPScore 最高 0.348,黃銅紋理跟齒輪結構都到位。

Anime: 和服女性 + 櫻花

CLIPScore: BF16 0.307 / FP8scaled 0.318 / NVFP4 0.312 / NVFP4+FP8e 0.299

BF16	FP8scaled

NVFP4	NVFP4+FP8e

唯一 BF16 平均贏的 prompt 類型,但這個 seed FP8scaled 0.318 還比 BF16 0.307 高。整體 std 範圍內。

Text render: 木製商店招牌(FP4 stress)

CLIPScore: BF16 0.348 / FP8scaled 0.353 / NVFP4 0.350 / NVFP4+FP8e 0.351

BF16	FP8scaled

NVFP4	NVFP4+FP8e

文字 stress 沒崩 — 4 個 config 都生出可讀的英文招牌。差別在風化質感跟字型風格細節。

Chinese: 古風水墨竹林書生(FP4 + 中文雙重 stress)

CLIPScore: BF16 0.260 / FP8scaled 0.261 / NVFP4 0.264 / NVFP4+FP8e 0.266 ↑

BF16	FP8scaled

NVFP4	NVFP4+FP8e

中文 prompt 量化沒崩。NVFP4+FP8e 在這個 seed 方向上略高,但仍在 noise band 內。古風水墨氣氛、竹林、書生都到位。

Abstract: 超現實漂浮島

CLIPScore: BF16 0.408 / FP8scaled 0.406 / NVFP4 0.390 / NVFP4+FP8e 0.384

BF16	FP8scaled

NVFP4	NVFP4+FP8e

NVFP4 系略低於 BF16/FP8scaled,但 4 張都明顯抓到「漂浮島 + 雲瀑布 + 夢幻」的核心概念。

結論

Part 1 推薦的 NVFP4 transformer + qwen_3_4b_fp8_mixed encoder 組合 vindicated:

速度:warm 5.52s vs BF16 7.55s,1.37× 快
工作集 RSS:11.52 GB vs BF16 20.64 GB,省 9.12 GB(44%)
Disk:10.4 GB vs BF16 20.6 GB,省 ~49%
品質:CLIPScore 0.3388 vs BF16 0.3344,差距遠小於 std band ±0.04,這個 N=72 sample 沒測到 regression(也不能宣稱顯著贏)

「量化必然崩品質」這個直覺對 Z-Image Turbo 在 GB10 上不成立。LPIPS 指出量化版圖跟 BF16 不一樣,但這只是 latent 軌跡發散,不是品質退化。CLIPScore 在 4 個 config 之間沒顯著差異,prompt fidelity 沒輸 — 但 N=3 seed 也不夠強到能宣稱誰贏。

結論:Part 1 推薦的 NVFP4+FP8e 組合在這個 sample 沒測到品質 regression,可以拿來當預設組合用 — 但如果是 production-grade anime / 人像服務,自己拿你的 prompt set + N≥10 seed 再驗一次。6 prompt × 3 seed 不足以排除特定主題上的尾部 risk。

方法學限制

LPIPS / CLIPScore 都不是完美指標,有以下限制:

LPIPS 用 AlexNet feature,對 photographic 比 anime/abstract 更敏感。anime / abstract 的 LPIPS 數字解讀要保守
CLIPScore ViT-B-32 對英文有 bias(訓練資料偏英文),chinese prompt 整體分數低不代表中文崩
CLIPScore 不評美感 — 只評 alignment to prompt。一張很醜但對得上 prompt 的圖 CLIPScore 仍可能很高
N=3 seed per (prompt, config) — 樣本不算大,單 prompt 結論該保守看 std 而不是看 mean
沒測 FID — 因為 N=72 太少跑 FID 沒意義(FID 要 N≥1000 圖才穩),CLIPScore 是這個 N 規模下最 robust 的選擇
沒測 anime / cartoon 的細節保真度 — 量化在風格化內容上的 perceptual 損失需要 LAION-Aesthetic 之類的 learned aesthetic predictor 才能捕捉,本篇沒做

如果你的 use case 是 production-grade anime gen 或 high-fidelity portrait service,建議自己拿你的 prompt set + 自己看圖比對 — 我這份 6 prompt 多樣性測試只能 give a starting confidence,不能替你的 use case 拍板。

實驗 reproducibility

完整 bench script zimage_quality_bench.py + 72 sample 的 LPIPS + CLIPScore raw JSON 還沒整理上 GitHub(尚未公開)。如果你想自己跑,核心邏輯:

執行方式:

pip install lpips open_clip_torch torch
# 自寫腳本:用 ComfyUI HTTP API 跑 6 prompt × 4 config × 3 seed,
# 結果丟進 lpips(pretrained='alex')+ open_clip ViT-B-32 計分

需要 ComfyUI 0.20+ 跑在 localhost:8188 + Z-Image Turbo BF16 / FP8scaled / NVFP4 / qwen_3_4b_fp8_mixed 4 個 model 已下載到 disk(因為要對比四個 config,所以四套都要,整套 disk 約 35 GB,Part 1 推薦只用其中一個組合的話不會這麼大)。跑完 72 個 prompt + scoring 約 15-20 分鐘。

接下來

Part 1「速度+資源 best combo」+ Part 2「量化品質 vindicated」之後,還剩一個謎:

為什麼 GB10 上 FP8 transformer 反而比 BF16 慢?為什麼 NVFP4 才是真贏家?

ComfyUI source code 給了一些 hint(pick_operations() 路由 / fp8_linear() cast / MixedPrecisionOps block-quantized path),但需要 nsys profiler 看 kernel 真實 dispatch 才能下 mechanism 結論。我試過 nsys 2025.6.3 在 GB10 / SM12.1 / sbsa-aarch64 環境下 CUPTI 不 capture kernel-level data(nsys 已知 bug),所以 Part 3 等 NVIDIA 修 nsight-cu 或我找到 alternative profiler 才能寫。