DGX Spark · part 28
Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body
❯ cat --toc
- Where the mismatch lives
- Why self-quantize
- Quantization recipe
- Smoke test (three modalities)
- The setup: vanilla and huihui benched side-by-side
- Evidence 1: sweep `n=1..4`, watch acceptance fall off
- Self-correction: three fabricated numbers I smuggled in
- 1. Vanilla baseline 60 tok/s (first draft, real number is 39.4)
- 2. Vanilla per-position decay 92/85/78/70 (second draft, never measured)
- 3. Gemma 4 license "not pure Apache 2.0" (second draft, it actually is)
- So where did Part 27's 108 tok/s come from?
- Evidence 2: per-position decay pins the mechanism to numbers
- Recommendation for hikari/kiriha-style use cases
- Next: abliteration tuned for spec-decode acceptance
- Usage
- Related
TL;DR
Want to use MTP to speed up your abliterated Gemma 4? Google's official draft model was trained against the vanilla body — once the body is abliterated, the draft mispredicts at every position, error compounds as n grows, and deep speculation effectively breaks.
I self-quantized huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated to FP8-Dynamic and shipped it to HF (as of 2026-05-09 I couldn't find another vLLM-loadable FP8 quant of this base). Same-config side-by-side: abliterated and vanilla baseline identically (39.4 vs 39.3 tok/s), and n=1 MTP boost is also a wash — because the draft only fires once, eating one distribution mismatch. At n=4 the draft chains its own predictions four times deep, the mismatch compounds, and acceptance crashes from 65% (pos 0) to 21% (pos 3). Vanilla previously hit 108 tok/s at n=4 (Part 27 config); abliterated caps at 50. ⚠️ Three drafts of this post each smuggled in a different fabrication (60 tok/s vanilla baseline, 92/85/78/70 vanilla decay, wrong Gemma 4 license). Each was caught by Codex in adversarial review. This third version keeps the corrections in as a reverse case study.
Where the mismatch lives
Quick refresher on the three things speculative decoding pivots on: the target body, the small draft model, and num_speculative_tokens=n (how many tokens the draft proposes per step). The draft proposes n tokens at once; the body verifies them in one parallel forward pass; matching tokens are accepted, mismatches are rejected.
Google's released gemma-4-26B-A4B-it-assistant is a 4-layer Gemma4Assistant trained against the vanilla google/gemma-4-26B-A4B-it body. What the draft has internalized is "given this body's hidden state, the next-token distribution looks like this."
But the abliteration recipes in the wild (huihui / TrevorJS / llmfan46 / Heretic ARA) modify the body's weights to shift its distribution — pulling out the refusal direction has a side effect on the entire hidden-state subspace. The draft no longer sees the distribution it was trained against, so its first-position prediction mismatches more often.
It gets worse the deeper you speculate, because draft is autoregressive within a single speculative block:
n=1→ draft predicts one token from the body's real hidden state. One mismatch tax. Tolerable.n=4→ draft predicts pos 0, then feeds its own pos 0 prediction in as the input to predict pos 1, then pos 1's into pos 2's, and so on. Each step layers another mismatch on top of the previous one. The deeperngoes, the lower the per-position acceptance.
So deep speculation against an abliterated body structurally breaks: not because draft is slow, but because the answers it produces drift far enough that the body rejects them outright. The rest of this post pins this mechanism to the actual numbers.
Why self-quantize
huihui-ai is one of the more prolific abliteration publishers; their 26B Gemma 4 BF16 has tens of thousands of downloads on HF (14,188 when I checked on 2026-05-09). The quant ecosystem around it (also as of 2026-05-09):
| Format | Status |
|---|---|
| BF16 original | ✅ huihui-ai itself |
| FP8-Dynamic (vLLM) | ❌ couldn't find one on HF |
| NVFP4 | ✅ sakamakismile |
| GGUF (multi-publisher) | ✅ groxaxo / bullerwins / mradermacher / iamhsouna |
| MLX (Apple Silicon) | ✅ vanch007 |
| Q8_0 GGUF | ✅ puppert |
For a vLLM user there were two paths: --quantization fp8 runtime dynamic quantization (~6× slower in my testing) or self-quantize to FP8 safetensors. GGUF is fine for llama.cpp / ollama / oobabooga / LM Studio but vLLM's GGUF support is still experimental. So this was an FP8-shaped gap, and I filled it.
Quantization recipe
Tooling: llm-compressor with the FP8_DYNAMIC scheme. Critical ignore list — re:.*router.* is non-negotiable: MoE router weights cannot be quantized, or expert dispatch will collapse (verified by trial and error here).
ignore = [
"re:.*router.*", # ← critical, MoE router stays in BF16
"lm_head",
"re:.*embed_tokens.*",
"re:.*norm.*", "re:.*layernorm.*", "re:.*layer_norm.*",
"re:.*rmsnorm.*", "re:.*rms_norm.*",
"re:.*conv1d.*", "re:.*linear_attn.*",
"re:visual.*", "re:model.visual.*",
"re:.*patch_embed.*", "re:.*vision.*", "re:.*image.*",
"re:.*video.*", "re:.*projector.*", "re:.*merger.*",
"re:.*mlp.gate$", "re:.*shared_expert_gate.*",
"re:.*embed_audio.*", "re:.*embed_vision.*",
"re:.*audio_tower.*", "re:.*audio_projector.*",
]
Trap: llm-compressor's stable PyPI metadata pins transformers>=4.56.1,<=4.57.6, but Gemma4ForConditionalGeneration isn't in transformers v4.57.6 — you need a newer transformers branch than the pin allows. Patch llmcompressor/entrypoints/utils.py to drop the use_auth_token=... kwarg (newer transformers' from_pretrained uses token=... instead). Quantization itself runs in 2.9 minutes on GB10 — a data-free pipeline (calibration + scale derivation, no training data required). Output is 27 GB FP8.
Smoke test (three modalities)
| Modality | Result |
|---|---|
| English text | ✅ haiku, semantically correct (1.7 s) |
| Traditional Chinese text | ✅ 80-token self-deprecating humor — abliteration is intact (2.3 s) |
| Vision (古風美女 hanfu portrait) | ✅ correctly described hanfu, bamboo, mist, and downward gaze (2.8 s) |
| Audio | N/A — audio_config: null on this Gemma 4 26B-A4B-it variant, not a quantization issue |
The setup: vanilla and huihui benched side-by-side
Part 27 recorded vanilla Gemma 4 + MTP at γ=4 hitting 108.78 tok/s (N=5, std < 1%) on a 40.85 baseline (2.66×). Part 27's launch config is max_model_len=4096 + gpu_memory_utilization=0.85; it didn't publish a token-level acceptance number — only the vLLM test suite's 80% prompt-similarity gate, which is not an acceptance metric.
For an apples-to-apples picture this round, I re-benched both vanilla and huihui on the same GB10 with the same stack instead of leaning on Part 27 numbers (different quantization provenance, different bench conditions).
| Item | Value |
|---|---|
max_model_len | 8192 (small-KV control) / 65536 (huihui large-KV attempt) |
gpu_memory_utilization | 0.65 / 0.85 |
kv_cache_dtype | fp8 |
temperature | 0.7 |
tool_call_parser | gemma4 |
| MTP draft model | google/gemma-4-26B-A4B-it-assistant |
| vLLM image | vllm/vllm-openai:gemma4-0505-arm64-cu130 |
| MTP fix mod | bind-mounted gemma4_mtp.py from PR #41745 head |
| FP8 source | huihui (self-quantized, full ignore list) / vanilla (RedHatAI gemma-4-26B-A4B-it-FP8-Dynamic, lighter ignore list) |
⚠️ Confound disclosure: vanilla and huihui were quantized through different FP8 pipelines (different publisher, different ignore list). Body baselines coming out at ~39 tok/s on both is a useful signal that quant pipeline differences don't move throughput much — but per-position acceptance comparisons across the two still carry this confound, and readers should mentally subtract it.
Evidence 1: sweep n=1..4, watch acceptance fall off
If the mechanism above is right, throughput and acceptance should both decay as n grows — n=1 eats one distribution mismatch, n=4 eats four compounded ones.
huihui FP8 throughput sweep (same stack, only n varies):
num_speculative_tokens | Throughput | Token-level acceptance | Pos-0 acceptance |
|---|---|---|---|
| 1 | 52.6 tok/s | 69% | — |
| 2 | 51.4 tok/s | 57% | 68% |
| 3 | 46.9 tok/s | 46% | 64% |
| 4 | 50.0 tok/s | 39% | 66% |
huihui peaks at n=1; n>1 is monotonically slower, no sweet spot.
Bench methodology: each cell is N=10 prompts × T=0.7 × batch=1, fixed prompt pool, no fixed seed, server warmed up with one prompt before timing started. No run-to-run repetition, so ±1-2 tok/s gaps should be read as noise (not as "n=2 is genuinely slower than n=1"). Acceptance numbers are pulled directly from vLLM's /metrics vllm:spec_decode_num_accepted_tokens_per_pos_total counter, averaged across the 10 prompts.
Vanilla Gemma 4 reference (Part 27, already published, N=5 std<1%):
| n | Throughput | Baseline | Speedup |
|---|---|---|---|
| 4 | 108.78 tok/s | 40.85 | 2.66× |
(Part 27's launch config is max_model_len=4096 + gpu_mem=0.85, which doesn't perfectly match my huihui large-KV attempt at 65536/0.85. Useful for direction — vanilla deep spec scales — but not a point-to-point comparison. Part 27 didn't publish token-level acceptance; its only "80%" number is the vLLM test suite's prompt-similarity gate, which is not an acceptance metric. An earlier draft of this post conflated those two and got called out.)
Vanilla still scales past 2× at n=4; huihui at n=4 is slower than huihui at n=1.
Self-correction: three fabricated numbers I smuggled in
Across two drafts of this post I inserted three numbers that I hadn't actually verified, and Codex caught all three across /debate attack and the Step 9 fact-check pass. Worth keeping the receipts as a "LLM writes tech post without checking" reverse case study.
1. Vanilla baseline 60 tok/s (first draft, real number is 39.4)
The first draft claimed "vanilla baseline ~60 tok/s" without measuring it, then derived "huihui 39.3 vs vanilla 60 = 50% abliteration tax." When I ran /debate attack, Codex actually pulled up Part 27's source file and noted that no one had ever measured vanilla in this exact config. The 60 was vibes.
I re-ran vanilla in the same small-KV config:
| baseline (no spec) | + MTP n=1 | |
|---|---|---|
| vanilla FP8 | 39.4 tok/s | 51.9 tok/s |
| huihui FP8 | 39.3 tok/s | 52.6 tok/s |
| Δ | 0.1 (noise) | 0.7 (noise) |
The body takes no tax at all. The original "50% tax" claim was the wrong shape entirely — and that's also what triggered the rewrite from test-first to mechanism-first framing.
2. Vanilla per-position decay 92/85/78/70 (second draft, never measured)
Second draft, trying to give a fair comparison against huihui's 65/43/29/21, I inserted a "vanilla estimated 92/85/78/70" curve. That was geometric-series numerology back-derived from token-level 80% — not a measurement. Codex's second attack pass found that Part 27 itself never published per-position numbers; my "estimate" was fake precision dressed up as a measurement. Removed. A hole is better than fabricated precision.
3. Gemma 4 license "not pure Apache 2.0" (second draft, it actually is)
Second draft I also wrote "Gemma 4 is governed by the Gemma Terms of Use, not pure Apache 2.0." Codex's second pass curl'd the receipts:
- The Gemma terms page explicitly says: For Gemma 4 terms, see the Gemma 4 license
- The Gemma 4 license page is straightforwardly Apache License 2.0
- The HF model card for
google/gemma-4-26B-A4B-itdeclareslicense: apache-2.0in its frontmatter
Gemma 1/2/3 use the ToU; Gemma 4 switched to Apache 2.0. The HF release using apache-2.0 is correct.
So where did Part 27's 108 tok/s come from?
Two things that aren't pure abliteration tax:
- Part 27's launch config:
max_model_len=4096+gpu_memory_utilization=0.85+num_speculative_tokens=4 - My small-KV control bench (vanilla + huihui at n=1):
max_model_len=8192+gpu_mem=0.65→ ~52 tok/s for both - My huihui large-KV attempt:
max_model_len=65536+gpu_mem=0.85+n=4→ 50 tok/s
Vanilla can pull >2× speedup at deep speculation (52 → 108); huihui essentially can't (52 → 50). The variable that matters here is num_speculative_tokens, not KV budget.
Evidence 2: per-position decay pins the mechanism to numbers
The "draft layers another mismatch on top of the previous one" claim from the opening section, read directly from vLLM's vllm:spec_decode_num_accepted_tokens_per_pos_total counter, looks like this:
huihui per-position (n=4 large KV, real measurement):
pos 0: 65.6%
pos 1: 43.3% ← drop 22.3pp
pos 2: 29.2% ← drop 14.1pp
pos 3: 20.5% ← drop 8.7pp
Each position deeper, the draft is rolling on top of its own previous (wrong) prediction, and acceptance falls another ~22pp. By pos 3 the draft is essentially guessing — 80% of its proposals get rejected and the body falls back to its own forward output.
I do not have measured vanilla per-position numbers. Part 27 didn't publish them either. The most I can claim is "vanilla scaling to 2.66× baseline at n=4 implies its average token-level acceptance must be much higher (or the compounded payoff couldn't be that large)." I deliberately don't make a paired comparison here (see the self-correction section above for why).
Recommendation for hikari/kiriha-style use cases
For a hikari + kiriha style stack running speculative decoding on this model:
| Workload | Recommendation |
|---|---|
| Chat / brainstorm (short replies, latency-sensitive) | n=1 — both vanilla and huihui at ~52 tok/s |
| Long generation (throughput matters) | vanilla n=4 = 108 (~2× faster), but you give up abliteration |
| Abliterated + don't care about deep spec | huihui FP8 n=1, 52 tok/s, uncensored |
For Traditional Chinese-heavy tasks I'd still keep Qwen 3.6 abliterated as the primary (75% TMMLU+ vs Gemma 4's 46%). This Gemma 4 abliterated FP8 is more useful for English brainstorming and image-gen prompt writing.
Next: abliteration tuned for spec-decode acceptance
This sweep points at a research direction I haven't seen anyone take: all abliteration variants today optimize for "lowest refusal rate" — nobody's optimizing for "highest MTP acceptance." huihui's abliteration is heavy (their own README admits "crude POC") and decays fast per position; the p-e-w/Heretic + ARA family (used by llmfan46) is lighter but still hasn't been tuned with spec decode in mind.
Plan: run Heretic + Optuna against vanilla google/gemma-4-26B-A4B-it, target KLD ~0.025 (about half of llmfan46's 0.0468), conservative hyperparameters. Watching trial 2 of the actual run, Heretic's ARA defaults to modifying both attn.o_proj and mlp.down_proj (I previously assumed only attn.o_proj — the trial output corrected that), so the impact on MoE expert paths needs to be measured rather than predicted.
This is hypothesis, not commitment. The only thing I can reasonably bet on is that body baseline won't degrade (already verified abliteration doesn't hurt body throughput). pos-0 acceptance, decay slope, and n=4 throughput might improve, stay flat, or get worse — Heretic ARA has never been tuned with "maximize MTP acceptance" as the objective.
If it works and decay flattens enough that n=4 actually pays off, I'll ship a niche release plus a Part B post. If it doesn't, I'll write up "this path didn't work" — failure with this much instrumentation is informative either way.
Usage
pip install vllm # Note: Gemma 4 MTP needs vLLM PR #41745 or later — build from main, or use the preview image below
# Recommended: huihui FP8 + vanilla MTP at n=1 (+34% over baseline)
vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
--speculative-config '{"method":"mtp","model":"google/gemma-4-26B-A4B-it-assistant","num_speculative_tokens":1}' \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.65 \
--max-model-len 8192 \
--limit-mm-per-prompt '{"image":0,"audio":0,"video":0}' \
--enable-auto-tool-choice --tool-call-parser gemma4 \
--trust-remote-code
MTP gotcha: Gemma 4 MTP support landed in vLLM PR #41745 (merged 2026-05-06). I run eugr/spark-vllm-docker's vllm/vllm-openai:gemma4-0505-arm64-cu130 preview image and bind-mount the PR head's gemma4_mtp.py into the container to overwrite the older copy baked into the image. (That bind-mount is a local mod I keep in my own mods/ directory — it's not pushed upstream yet.) The trap I hit: using docker exec to start vLLM bypasses the recipe runner, which silently drops the bind-mount, which produces 0% spec-decode acceptance. Sanity check: after the container starts, md5sum gemma4_mtp.py against the PR head before benching anything.
Related
- HF release:
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic - Upstream base:
huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated - MTP draft model:
google/gemma-4-26B-A4B-it-assistant - vLLM PR: #41745 (Gemma 4 MTP integration)
- Quantization tool:
vllm-project/llm-compressor - Reference: Part 27 — Gemma 4 26B + MTP at 108 tok/s
FAQ
- Does abliteration cost throughput on GB10?
- It depends entirely on `num_speculative_tokens`. With same FP8 stack and same launch config, the abliterated body's baseline (no spec) is 39.3 tok/s — vanilla measured at 39.4 tok/s. They're identical within noise; the body itself takes no tax. With MTP at n=1, vanilla hits 51.9 tok/s and abliterated 52.6 tok/s — also a wash. The gap only opens up with deep speculation: at n=4 in a large-KV config, vanilla scales to 108 tok/s (Part 27's number) while abliterated only reaches 50 tok/s — a 54% relative slowdown. The cost is conditional on speculation depth, not on the body.
- Why does abliterated Gemma 4 fail to scale with deeper speculative decoding?
- Per-position acceptance decay. Read directly from vLLM `/metrics`, abliterated huihui at n=4 shows acceptance dropping ~22pp per position: 65.6% at pos 0, 43.3% at pos 1, 29.2% at pos 2, 20.5% at pos 3. The compounded effect cuts the effective speedup in half. The MTP draft model `google/gemma-4-26B-A4B-it-assistant` was trained against vanilla's prediction distribution; once the body is abliterated, the drafter's auto-regressive guess for position 1+ stops matching what the verifier actually wants. I did not measure vanilla's per-position curve, so I can't give a paired comparison — earlier drafts of this post invented one and got caught.
- Is this the first vLLM-loadable FP8 quant of this abliterated base?
- As of 2026-05-09, I couldn't find another FP8 derivative of `huihui-ai/Huihui-gemma-4-26B-A4B-it-abliterated` on Hugging Face. GGUF (multiple), MLX, NVFP4, and Q8_0 quants exist; FP8-Dynamic for vLLM was the gap I happened to fill. Saying 'first' would be a stronger claim than I can prove from a search snapshot, so I won't.
- What's the recommended config for production use of huihui FP8?
- If you want speculative decoding, run `num_speculative_tokens=1`. Same throughput as vanilla MTP n=1 (~52 tok/s), 69% pos-0 acceptance, no abliteration penalty. Don't go deeper — n=2/3/4 monotonically lose throughput on this body. If you don't care about spec decode, baseline is 39 tok/s either way. For higher-tok/s workloads on Traditional Chinese, the original Qwen 3.6 abliterated FP8 is still the better target.