~/blog/dgx-spark-nvfp4-quant-tax-chinese-vs-english

DGX Spark · part 37

[Benchmark] NVFP4 Weight-Only Quantization Taxes Chinese ~2x Harder Than English (gemma-4-12B)

cat --toc

TL;DR

I quantized Google's gemma-4-12B three ways — BF16, FP8 dynamic, and NVFP4 weight-only — and scored each on English (MMLU) and Traditional Chinese (TMMLU+) on a DGX Spark GB10. The finding: NVFP4 weight-only quantization is not language-neutral. It costs −2.7 points on English but −6.0 points on Traditional Chinese (−3.5% vs −12.6% relative) — the Chinese tax is roughly twice the English tax in absolute terms. FP8 dynamic stays near-lossless on both (within 0.4 points). This refines my Part 36 result, where weight-only NVFP4 won on speed, size, and keeping multimodal — it does win those, but the accuracy bill is real and it lands harder on the weaker language.

Plain-language version

When you shrink an AI model with quantization, you are rounding its numbers to fewer bits to make it smaller and faster. The usual assumption is that this costs a little accuracy, evenly, across everything the model does. It turns out that is not true: when I shrank this model to 4-bit weights, it lost about twice as much accuracy on Traditional Chinese as it did on English. A gentler 8-bit setting (FP8) lost almost nothing on either language. So the cheap, aggressive quantization is not just "a bit worse" — it is unevenly worse, and it punishes the language the model was already weaker at. If you mostly serve English you may never notice; if you serve Chinese it matters.


Preface

Part 36 crowned a winner. I quantized gemma-4-12B (Google's new omni model) on the GB10 and concluded that weight-only NVFP4 was the build to ship: 7.7 GB, 24.9 tok/s, and — unlike full W4A4 — it kept image, audio, and video working. Speed, size, modalities: three checkmarks.

What I did not check in Part 36 was the bill on a fourth axis: accuracy. "Keeps multimodal" only means the modalities still respond; it says nothing about whether the answers got dumber. This post is me going back and reading that bill, paired across two languages. The short version: NVFP4 weight-only is not free, and it is not even-handed.

The numbers: NVFP4 drops Chinese 6 points but English only 3

Everything below is my own bench on one DGX Spark (GB10, sm_121a). I scored each format on MMLU (English, 57 subjects) and TMMLU+ (Traditional Chinese, 66 subjects) through lm-evaluation-harness, 5-shot, with the chat template applied (more on why that matters below), limit=30 per subject (N ≈ 1,710 English / 1,980 Chinese, ±~1.0 point standard error).

FormatMMLU (EN)TMMLU+ (TC)EN taxTC tax
BF1678.30%47.21%
FP8 dynamic77.95%46.97%−0.35−0.24
NVFP4 W4A1675.56%41.24%−2.74−5.97

Read the last two columns. FP8 is a rounding error on both languages — call it lossless. NVFP4 weight-only, though, drops English by 2.7 points and Traditional Chinese by 6.0. In relative terms that is −3.5% on English versus −12.6% on Chinese: the Chinese tax is ~2.2x the English tax in absolute points, ~3.6x in relative terms.

A note on what is solid and what is softer. The Chinese drop (−6.0 points) is about 3.8 standard errors — that one is real. The English drop (−2.7 points) is about 1.8 standard errors — smaller, and I would not bet the house on the exact magnitude, but it is not zero either. So the honest framing is not "NVFP4 is free on English and breaks Chinese." It is: both languages get taxed, and Chinese gets taxed about twice as hard.

FP8 is the near-lossless, language-symmetric option

The FP8 row is the quiet hero. −0.35 on English, −0.24 on Chinese — both inside the noise, and roughly equal. FP8 keeps 8 bits of weight precision and quantizes activations dynamically per-token, so there is no aggressive rounding of the weights and no calibration set to overfit to one language. The result is a tax you cannot measure and that does not care which language you feed it.

That symmetry is the interesting contrast. It is not that quantization is unfair to Chinese — it is that this particular aggressive 4-bit weight rounding is. Move from 4-bit back to 8-bit and the asymmetry vanishes.

Why the cheaper language pays more: a fragility hypothesis

I cannot prove the mechanism from two benchmark columns, so label this a hypothesis. English is enormously over-represented in pretraining data. Over-representation buys redundancy — the same knowledge encoded across many partly-redundant directions in weight space. When you round those weights to 4 bits, redundancy absorbs the damage: enough of the signal survives. Traditional Chinese gets a far smaller, thinner slice of pretraining, so its representations are less redundant and more brittle. The same 4-bit rounding that English shrugs off knocks Chinese off balance.

This rhymes with what 4-bit did to the modalities in Part 36. There, full W4A4 (which also quantizes activations) completely broke image, audio, and video, because the image/audio embeddings were out-of-distribution for a text-calibrated 4-bit range. Different mechanism — that was activation quantization, this is weight quantization — but the same theme: the most fragile thing the model does is the first thing 4-bit takes away. For W4A4 that was the non-text modalities; for weight-only NVFP4 it is the lower-resource language.

For scale: gemma-4-12B at 47% Traditional Chinese ties the bigger gemma-4-26B (46.30% in my earlier TMMLU+ run) and sits far behind Qwen 3.6 35B (75.07%). The Gemma family is just weak at Traditional Chinese regardless of precision — that is the backdrop. The new point here is the quantization asymmetry on top of it.

The methodology trap that almost cost me a night: no chat template = below random

This experiment nearly died at the start, and the reason is worth stealing.

My first pass scored every format at ~23% aggregate — below the 25% random floor for 4-choice questions. BF16, FP8, NVFP4: all ~23%. That looks exactly like a broken model or a broken quantization, and I briefly believed it.

It was neither. The default lm-eval path scores multiple-choice by raw-completion loglikelihood with no chat template. An instruct/omni model fed a bare completion prompt never "enters answering mode" — it does not engage the question, and the loglikelihoods over A/B/C/D end up worse than chance. The fix is one flag pair:

lm_eval --model hf \
  --model_args pretrained=<model>,trust_remote_code=True,dtype=bfloat16 \
  --tasks tmmluplus --num_fewshot 5 --limit 30 --batch_size 4 \
  --apply_chat_template --fewshot_as_multiturn \
  --trust_remote_code

With --apply_chat_template --fewshot_as_multiturn, BF16 jumped from 24.3% to 46.1%. The model was fine the whole time; I was feeding it the wrong format. The reusable lesson: a below-random aggregate on an instruct model is the signature of a missing chat template, not a broken model. Check that before you blame the weights.

One more snag specific to this model: vLLM's /v1/completions (loglikelihood) path 500s and kills the engine on gemma4_unified, so I could not run the eval through a vLLM server at all. I ran it through transformers directly (--model hf), which needs a transformers new enough to know the gemma4_unified class — I ran 5.10.1 built from main (PyPI stable was 5.9.0 at the time, so a plain pip install transformers==5.10.1 won't find it) — plus datasets==2.21 pinned so the TMMLU+ builder script still loads. HF eager is slow — dequant-per-forward, no native 4-bit kernel — but for an accuracy eval, slow is fine; the logits are the same.

What I got out of this

Where the time went. Not the compute — the chat-template trap. Watching three formats all land below random and not immediately knowing whether it was the model, the quant, the eval harness, or the brand-new architecture cost me the most. The compute was a few unattended hours overnight.

Transferable diagnostics. Two that apply far beyond this model: (1) below-random aggregate on an instruct model → suspect the chat template before the weights; (2) when a serving runtime's loglikelihood path is broken on a fresh architecture, fall back to transformers --model hf for the accuracy eval — you lose speed, not correctness. And the design pattern that made the asymmetry visible at all: always pair the benchmark across the axis you suspect. A single-language number would have shown "NVFP4 costs 6 points" and I would have called it a generic quant tax. Running English alongside Chinese is what turned a number into a finding.

The one-liner. 4-bit weight quantization does not spend its accuracy budget evenly — it spends it first on whatever the model is already weakest at. Here that was the lower-resource language. Pair your evals across the axis you care about, or you will not see it.

Conclusion

  • FP8 dynamic is near-lossless and language-symmetric on gemma-4-12B (−0.35 EN, −0.24 TC). If you cannot afford an accuracy hit, this is the safe quant.
  • NVFP4 weight-only is not free. It still wins speed (24.9 vs 15.9 tok/s) and size (7.7 vs 13 GB) and keeps multimodal — but it costs ~2.7 points on English and ~6.0 on Traditional Chinese.
  • The 4-bit tax is uneven. Chinese paid ~2x the English tax in absolute points (~3.6x relative). If you serve a non-English or lower-resource language, measure it before you ship 4-bit.
  • If you see a below-random MMLU/TMMLU+ aggregate, add --apply_chat_template --fewshot_as_multiturn before you blame the model or the quant.
  • Scope: one model family (gemma-4-12B omni), one quant recipe (llmcompressor NVFP4A16 weight-only), limit=30. Indicative, not a universal law — but a clean enough data point to change how I pick a quant for Chinese-heavy work.

Also in this series: Gemma 4 12B Omni: Weight-Only NVFP4 Beats W4A4 · Qwen 3.6 vs Gemma 4 on Traditional Chinese (TMMLU+)

FAQ

Does NVFP4 weight-only quantization hurt accuracy?
Yes, measurably — and unevenly by language. On gemma-4-12B I measured NVFP4 weight-only (W4A16) dropping English MMLU by 2.7 points (78.3 to 75.6) but Traditional Chinese TMMLU+ by 6.0 points (47.2 to 41.2). That is a ~3.5% relative hit on English versus ~12.6% on Chinese. FP8 dynamic, by contrast, was near-lossless on both (within 0.4 points).
Is FP8 or NVFP4 better for quantizing an LLM?
It depends on what you optimize for. NVFP4 weight-only is smaller (7.7 GB vs 13 GB) and faster (24.9 vs 15.9 tok/s on a GB10), but on gemma-4-12B it cost ~6 points of Traditional Chinese accuracy. FP8 dynamic was near-lossless on both English and Chinese. If your workload is heavy in a non-English or lower-resource language, FP8 is the conservative choice; if you mostly serve English and want the speed and memory, NVFP4 is reasonable.
Why does my lm-eval MMLU score come out below random (25%)?
Almost always because you are scoring an instruct or chat model with raw-completion loglikelihood and no chat template. Add --apply_chat_template --fewshot_as_multiturn. On gemma-4-12B that single flag moved the aggregate from 24.3% (below the 25% random floor) to 46.1%. A below-random aggregate is the signature of a missing chat template, not a broken model or a broken quantization.
Does 4-bit quantization affect non-English languages more than English?
In this one careful experiment, yes. NVFP4 weight-only on gemma-4-12B degraded Traditional Chinese about twice as much as English in absolute points, and about 3.5x as much in relative terms. The likely reason is that English is heavily over-represented in pretraining, so its internal representations are more redundant and survive 4-bit weight rounding, while lower-resource language representations are more fragile. This is one model and one quant recipe, so treat it as indicative, not a universal law.