Part 30 跟 Part 28/29 差在哪?

Part 28 是 mechanism(為什麼 vanilla draft 跟不上 abliterated body),Part 29 是 deploy recipe(n=1 開箱 +34%),Part 30 是 retrain 嘗試的 round 1 結果 — 我們 fine-tune EAGLE-3 drafter 對齊 abliterated body 的 distribution,目標是 unlock n=2/3/4 deep speculation。

結果有 unlock n=4 嗎?

有,大勝。Inference bench pos 3 acceptance 從 vanilla 的 20.5% → 72.7%(+52pp),n=4 throughput 從 ~50 tok/s → 100.36 tok/s aggregate(2.01×;per-prompt mean 107.59 tok/s)。Part 28 證實的「deep speculation 在 abliterated body 上結構性失效」被 retrain drafter 解掉,acceptance 從 65→43→29→20% 的陡峭衰減曲線變成 84→75→74→73% 的近平坦曲線。

Phase 0 抓到別人已經做過類似工作?

對。6 個 HF repo 在 Part 28 publish 前就 ship 過 abliterated body + spec decode drafter 的組合(guglxni / AEON-7 ×2 / OptimizeLLM / huginnfork / llmfan46)。但 EAGLE-3 + huihui Gemma 4 26B-A4B abliterated 這個 specific 配對我們搜不到別人公開過,加上我們 publish per-position acceptance 數字,這是 narrow novelty。

Speculators 那個 bug 是什麼?

vllm-project/speculators 的 `create_empty_sample()` 預設用 `torch.empty()` 不帶 dtype 參數 → fp32。當 vLLM extraction request timeout 時 fallback 用這個 empty sample,downstream BF16 layers(fc / verifier_lm_head)dtype mismatch crash 整個 train。我們的 patch 把預設改成 `torch.bfloat16`,上游 PR 準備中。

Fine-tune EAGLE-3 drafter 在 abliterated Gemma 4 上 — n=4 throughput 翻倍到 100 tok/s

TL;DR

Part 28 的「壞消息」被解掉了。 在 huihui Gemma 4 26B-A4B abliterated FP8 上 fine-tune RedHatAI 預訓 EAGLE-3 drafter 1 epoch / 50k Magpie samples / DGX Spark GB10 約 11h 訓練。Inference bench 關鍵數字:pos 3 acceptance 從 vanilla draft 的 20.5% → 72.7%(+52pp);n=4 throughput 從 ~50 tok/s → 100.36 tok/s aggregate(107.59 per-prompt mean)= 約 2.0× speedup。Part 28 機制論證(deep speculation 在 abliterated body 上結構性失效)依舊正確,但這個 bottleneck 透過 retrain drafter 對齊新 distribution 就能解。

副產品:Speculators 上游 bug — create_empty_sample() 預設 fp32 placeholder 撞 BF16 模型 → train 偶發 crash。我們的 patch 把預設改成 torch.bfloat16,PR 準備中。

TL;DR

目標:Part 28 證實 vanilla MTP draft 在 abliterated body 上 deep speculation 失效(pos 0/1/2/3 acceptance 65/43/29/20%);Part 30 試圖用 fine-tune EAGLE-3 drafter 把 drafter 拉回 abliterated body 的 distribution
結果:WIN。Inference acceptance 變成 84/75/74/73%(decay 從 22 pp/step 變成幾乎沒 decay),n=4 throughput 50 → 100.36 tok/s aggregate ≈ 2.0×(per-prompt mean 107.59)。Drafter shipped 到 HF:coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft
副產品:Speculators upstream create_empty_sample dtype bug 找到 + 修了 + PR 給上游

Phase 0 prior art(誠實列 + Part 28 erratum)

寫這篇前跑 Phase 0,Codex 抓到 6 個 HF repo predates 我們(Part 28 publish 2026-05-09 之前):

Repo	Created	內容
`OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4`	2026-04-11	heretic + MTP grafted,~190 tok/s
`AEON-7/DFlash-Qwen3.5-27B-Uncensored`	2026-04-12	uncensored + external z-lab DFlash,33.2 vs 12.2 tok/s
`guglxni/Qwen3.5-9B-abliterated-DFlash`	2026-04-15	最直接 — fine-tune DFlash drafter on abliterated activations,跟 Part 28 mechanism 一模一樣
`AEON-7/supergemma4-26b-dflash-pilot`	2026-04-15	DFlash 5K-sample pilot,5.79% top-1,明說 negative speedup
`huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp`	2026-04-26	heretic + MTP
`llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved`	2026-05-06	Part 28 publish 前 3 天,我們文章內還點名說「未調」

✏️ Part 28 修正:當時 Part 28 寫「abliteration 流派都在壓 refusal rate,幾乎沒人在調 MTP acceptance」+「llmfan46 用的 Heretic + ARA 流派仍未針對 spec decode 調過」— 兩句都錯。Phase 0 抓到上面 6 個 repo,llmfan46 自己在 Part 28 publish 前 3 天就 ship 了 Native-MTP-Preserved 版本。本篇 Part 30 建立在這個更正後的事實上。

我們的 narrow novelty(Codex 確認):

公開沒有 EAGLE-3 + abliterated 配對(其他都是 DFlash 或 native MTP)
沒人公開 per-position acceptance 數字(Part 28 是社群第一份,Part 30 加上 retrain 後對照組)
huihui Gemma 4 26B-A4B family 在 vLLM 圈最廣 publish 範圍裡的 EAGLE-3 fine-tune

Pipeline

Training stack

元件	設定
Hardware	NVIDIA GB10(DGX Spark),sm_12.1,121 GB unified,273 GB/s
Framework	`vllm-project/speculators` `v0.5.0.dev0`
Verifier(body)	`coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic`(我們自量化的 huihui base)
Drafter 起點	`RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3`(vanilla-trained pretrained)
Training data	Magpie 50k(`Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered` 取 instruction),用 huihui FP8 重生 response 後當 (input, output) 配對
vLLM 角色	同時跑 `extract_hidden_states` speculative_config + `ExampleHiddenStatesConnector` 作為 hidden states producer,gpu_memory_utilization=0.5 留給 trainer
Epochs / seq_len	1 epoch / 4096 packed

Data pipeline

response_regeneration/script.py 把 Magpie 50k instructions 餵 huihui FP8 重新生成 responses(~24h)
prepare_data.py 把 jsonl 轉成 Arrow dataset(token_freq.pt + masked positions)
train.py --on-missing generate --on-generate delete:trainer 跟 vLLM 同時跑,要 hidden states 時 trainer call vLLM,vLLM 寫 safetensors 到 shared dir,trainer 讀完刪掉

Speculators upstream bug(side artifact)

跑到 step ~9485(83%)時 train crash,RuntimeError dtype mismatch。Root cause:

# data.py 的 create_empty_sample(原本)
return {
    "hidden_states": torch.empty(0, 3 * hidden_size),  # ← 預設 fp32!
    "verifier_last_hidden_states": torch.empty(0, hidden_size),  # ← 也是 fp32
    ...
}

什麼時候 hit:vLLM extraction request timeout(15s default)→ 全 retry 失敗 → collate_fn fallback create_empty_sample() → 下游 eagle3/core.py:fc() 跟 :verifier_lm_head() 期待 BF16 → mismatch 整個 train 死。

修法:

def create_empty_sample(hidden_size: int, dtype: torch.dtype = torch.bfloat16):
    return {
        "hidden_states": torch.empty(0, 3 * hidden_size, dtype=dtype),
        "input_ids": torch.empty(0, dtype=torch.long),
        "verifier_last_hidden_states": torch.empty(0, hidden_size, dtype=dtype),
        ...
    }

PR:準備中,publish 完補上連結。(Patch 本身已驗證 — v4 train 撐到底沒 crash,內部 data.py:67 改動 4 行)

還有一個 fragility 沒修但值得提:speculators 預設 --checkpoint-freq 單位是 epoch,1-epoch run mid-step crash = 0 checkpoint。我們第一次 crash 失去 9 小時訓練。值得另開 issue 加 step-level checkpoint。

Training trajectory(v4 run)

⚠️ Metric 釐清:trainer 同時 log full_acc_N 跟 cond_acc_N。full_acc = 無條件「pos N 預測對」的機率,對應 vLLM /metrics 跑起來時的 per-position acceptance;cond_acc = 條件機率「前面 0..N-1 全對情況下 pos N 還對」。對比 Part 28 vanilla baseline(65.6 / 43.3 / 29.2 / 20.5)要用 full_acc。本篇所有 acceptance 都用 full_acc 報。

Loss + full_acc 曲線(single-step samples,unsmoothed)

Step	Loss	full_acc_0	full_acc_1	full_acc_2
1k	7.51	66.8%	39.9%	25.1%
2k	6.66	69.5%	44.0%	27.9%
4k	4.74	77.0%	55.9%	42.6%
6k	7.18	64.8%	39.0%	24.8%
8k	7.39	65.7%	38.9%	24.5%
Final val(N=1266 batches)	6.94	66.8%	41.4%	26.4%

註:trainer 預設 ttt_steps=3 所以 training 只看 pos 0/1/2,沒有 full_acc_3。Inference 時 pos 3 是 drafter 外推(下節有 bench)。

收斂觀察

Loss 軌跡 bouncy 不 monotonic:1k 開始就 7.5 → 4k 達低點 4.74 → 後段回 7 區段。原因是 cosine LR schedule:warmup 完 4k 附近 LR 達 peak,model 在 high-LR 區間 jump-around 但學到 distribution shape;後段 LR decay 進入 fine refinement 階段,單步看起來 noisier。
Train metric 看似不亮眼:val full_acc_2 才 26.4%,跟 Part 28 vanilla baseline 29.2% 打平、甚至略輸。我做完 train 看 val 一度以為這篇 Part 30 要寫 "NO WIN"。
但 train ≠ inference:val 是 teacher-forced argmax 對 Magpie ground truth(嚴),inference 是 rejection sampling 對 body 實際 distribution T=0.7(寬鬆)。下節 inference bench 才是 final verdict。

Inference bench(真正的考驗)

Train 完載新 drafter 進 vLLM(跟 Part 29 同 config,只把 draft model 換成我們的 fine-tune 版),量 per-position acceptance + throughput。

Setup

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --enable-prefix-caching --trust-remote-code

N=10 prompts × T=0.7 × batch=1,max_tokens=200
同 vLLM container、同 prompt pool,只切換 draft model 跑 vanilla MTP vs fine-tuned EAGLE-3 兩組
Part 28 vanilla baseline 是 prior publish 的數字(同硬體、同 body,單獨 paired run 我們手邊還沒做,但 acceptance 差距 19-52pp 之大不可能是 prompt-set 變化造成)

Per-position acceptance(關鍵數字)

Position	Part 28 vanilla draft	Fine-tuned drafter (本篇)	Δ
pos 0	65.6%	84.4%	+18.8 pp
pos 1	43.3%	74.9%	+31.6 pp
pos 2	29.2%	74.1%	+44.9 pp
pos 3	20.5%	72.7%	+52.2 pp 🚀

Vanilla draft 在 abliterated body 上的 decay 是 −22/−14/−9 pp/step;我們 fine-tune 後 decay 變成 −9/−1/−1 pp/step,幾乎平的曲線。

Throughput sweep

`num_spec_tokens`	Part 28 vanilla(huihui body)	Fine-tuned EAGLE-3(本篇)	Speedup
0(no spec)	39.3 tok/s	39.3 tok/s	1.00×
1	52.6 tok/s	59.04 tok/s	1.12×
2	51.4 tok/s	66.96 tok/s	1.30×
3	46.9 tok/s	74.90 tok/s	1.60×
4	50.0 tok/s	100.36 tok/s aggregate / 107.59 per-prompt mean	~2.01×

(Bench 細節:gpu_memory_utilization=0.85、max-model-len=8192、kv_cache_dtype=fp8、temperature=0.7、N=10 prompts × max_tokens=200,batch=1)

對 vanilla draft 來說,n>1 就 throughput drop(deeper speculation 接受率太低,verify overhead 超過 spec 收益)。對我們 fine-tune drafter 來說,throughput 從 n=1 → n=4 一路爬,n=4 是甜蜜點。這正是 Part 28 mechanism 論證的反面 — drafter 對齊 body distribution 後,deep speculation 重新可用。

Verdict: WIN

Fine-tune EAGLE-3 drafter 在 abliterated body 上 unlock n=4 deep speculation。

Inference acceptance:vanilla draft 在 abliterated body 上 65→20% 陡降 → 我們 drafter 84→73% 近平坦(+52pp 在 pos 3)
Throughput @ n=4:vanilla ~50 tok/s → 我們 100.36 tok/s aggregate(2.01×)/ 107.59 tok/s per-prompt mean(~2.15×)
Drafter shipped 到 HF(coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft),1.86 GB safetensors,跟 vanilla MTP assistant 一樣概念
Part 28 機制論證沒被推翻 — vanilla draft 確實會在 abliterated body 上 decay,但這個 bottleneck 透過 retrain 對齊 distribution 就能解

Production recipe(daily-use config):

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 65536 \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching --trust-remote-code

跟 Part 29 的 n=1 recipe 比,四個 flag 不同:method 從 mtp → eagle3、model 換成我們的 drafter、num_speculative_tokens 從 1 → 4、max-model-len 從 8192 → 65536(daily-use 要支援 hermes 長 context)。其餘 vLLM flag 一樣。

Round 2 計畫(Part 31 預告)

WIN 拿到後,Round 2 的方向:

跨 workload bench:現在 N=10 prompts 是英文 instruction-following。實測繁中 / code / podcast 摘要 / image gen prompt 寫作幾種 hikari/kiriha 實際 use case,看 acceptance 是不是同樣高
TurboQuant KV cache 3-bit + EAGLE-3 同時開(GB10 stack 4-phase upgrade 的 Phase 2),看 KV budget 大 4× 是否帶來額外收益(預期單用戶 chat 無感,batch ≥ 4 才有意義)
ttt_steps=4/5 訓練:目前 drafter 只 train pos 0/1/2,inference n=4 是外推。雖然實測 pos 3 acceptance 72.7% 還很好,但 native train 到 pos 3 應該更穩
DFlash 對照組:看 guglxni 路線(DFlash on abliterated)在 Gemma 4 上會不會跑得更快(不同 architecture,trade-off 不同)

給讀者的當下建議

想 inference 加速 abliterated Gemma 4 → 走本篇 production recipe 拿 n=4 ≈ 100 tok/s aggregate(舊的 Part 29 n=1 recipe 還是有效,但 n=4 是新最佳)
想自己 fine-tune drafter → 看本篇 pipeline + 注意 Speculators create_empty_sample bug + 我們 patch(speculators/train/data.py:67 把 dtype 預設改成 bfloat16)
想關注 abliterated body + spec decode 全社群進度 → 上面 Phase 0 6 個 repo 都值得追,加上我們 round 2 結果

Fine-tune EAGLE-3 drafter 在 abliterated Gemma 4 上 — n=4 throughput 翻倍到 100 tok/s

TL;DR

Phase 0 prior art(誠實列 + Part 28 erratum)

Pipeline

Training stack

Data pipeline

Speculators upstream bug(side artifact)

Training trajectory(v4 run)

Loss + full_acc 曲線(single-step samples,unsmoothed)

收斂觀察

Inference bench(真正的考驗)

Setup

Per-position acceptance(關鍵數字)

Throughput sweep

Verdict: WIN

Round 2 計畫(Part 31 預告)

給讀者的當下建議

相關

常見問題