Round 2 跟 v1 差異在哪?

Data:v1 用 50K Magpie EN instruction + huihui body 重生 response;Round 2 用 50K EN + 30K ZH(Magpie-Qwen2-Pro-200K-Chinese 抽樣 + huihui body 重生)。Training:1 epoch,ttt_steps=3,跟 v1 同 hyperparams。原本還規劃 Train C(ttt=4)但收尾不跑。

Round 2 chat 數字到底多少?

Round 2 B drafter @ n=4 chat:EN 45 tok/s pos-0 56%,ZH 29 tok/s pos-0 15%。對照組 vanilla MTP n=4 chat EN 53 tok/s pos-0 71% / ZH 45 tok/s pos-0 57%。Round 2 沒贏,EN 還微輸 vanilla MTP 8 tok/s。

為什麼加 30K 中文 data 對 ZH 改善這麼小(+2 tok/s)?

EAGLE-3 small head 架構在 chat 上的天花板大概在 pos-0 60%,加更多 data 也很難破。Vanilla MTP `gemma4-26b-a4b-it-assistant` 是 Google 預訓的「full Gemma layer」big drafter,結構就比我們的 EAGLE-3 head 大很多 — 那是物理性差距,training data 量補不回來。

Train C(ttt=4)為什麼沒跑?

Train B 訓 41h + Train C 還要再 30h = daily 斷 70h 以上。Round 2 B 跑完已經很明顯:就是架構天花板。就算 Train C 把 ttt=2/3 那些長尾改善撈回來,也很難贏過 vanilla MTP 53/45。這輪先收在這裡,ttt=4 之後有資源再跑。

vLLM Gemma 4 preview image 的 scheduler deadlock 是什麼?

在 long-run extract_hidden_states 用法下(trainer 持續 query vLLM 拿 hidden states),vLLM Gemma 4 preview image(`vllm/vllm-openai:gemma4-0505-arm64-cu130`,內部 build `0.20.2rc1.dev49+g9b4e83934`)的 scheduler 偶爾會進入「Running: N reqs, generation throughput: 0.0」永久 hang 狀態 — 同 deadlock pattern 在 ZH regen 兩次,在 train 中段一次。Workaround:寫 watchdog 監測 `throughput=0.0 + running>0` 持續 3 分鐘自動 docker stop + relaunch。值得提 upstream issue,但 minimum reproducer 還在收斂。

~/blog/dgx-spark-eagle3-round2-null-result

DGX Spark · part 31

Round 2 EAGLE-3 retrain 沒打破天花板 — 60 小時訓練的 null result + 教訓

2026-05-219 分鐘閱讀#gemma-4 #abliteration #eagle-3 #speculative-decoding English

❯ cat --toc

TL;DR
Round 2 出發點
計畫 vs 實際
Final paired bench(2026-05-20,chat completions paired EN/ZH)
為什麼架構大小決定一切
副產品:vLLM Gemma 4 preview image 的 scheduler deadlock
Workaround:watchdog
給讀者的建議
HF repo 狀態
Pivot:從 training 轉到 harness
整個系列的 narrative arc
相關

TL;DR

Round 2 結論:沒打破天花板。

Part 30 的 endpoint correction 之後,我們知道 v1 retrain 在 chat workload 上沒有 2× speedup。Round 2 加 30K 中文 instruction(原計畫 50K,vLLM scheduler 卡死兩次只拿到 30K)+ huihui body 重生 response,Train B 跑 41 小時。

結果:Round 2 B drafter chat EN 45 tok/s / ZH 29 tok/s,跟 v1 基本相同(EN 46 / ZH 27),遠輸 vanilla MTP n=4 chat 的 EN 53 / ZH 45 tok/s。原計畫 Train C(ttt=4)收尾不跑,Round 2 整體宣告 null result。

收穫:

✓ 確認 EAGLE-3 small head 在 abliterated body 上的架構天花板(big drafter > more data,在這個 scale)
✓ Production recipe:vanilla MTP gemma4-26b-a4b-it-assistant + num_speculative_tokens=4,不用 retrain
✓ 找到 vLLM 0.20.2 scheduler deadlock in long-run hsext(三次踩到 + watchdog 補完)
✓ 更重要的教訓:單卡 GB10 訓練太慢(一個變因要 41 小時)。比起繼續加訓練,該把時間花在 paired bench、watchdog、refusal-rate 這種回饋快的工具上。Round 3 不開,後續資源轉到這類短迴圈 infra

TL;DR

目標:Round 2 試圖透過加中文 instruction data + huihui body 重生 response,訓出比 v1 強的 EAGLE-3 drafter,目標 chat workload 上 match / beat vanilla MTP n=4(EN 53 / ZH 45 tok/s)
結果:Train B(80K samples: 50K EN + 30K ZH,ttt=3)完成,inference 數字 chat EN 45 / ZH 29 tok/s = 跟 v1 平手,沒贏 vanilla MTP。Train C(ttt=4)不跑收尾
真兇:EAGLE-3 small head 架構天花板。Vanilla MTP gemma4-26b-a4b-it-assistant 是 full Gemma layer 大 drafter,我們的 EAGLE-3 head 在 chat workload 上沒辦法靠 more data 跨越這個結構差距
副產品:vLLM 0.20.2 scheduler 在 long-run extract_hidden_states 用法下會卡死(三個事件 + 寫了 watchdog),值得提 upstream issue

Round 2 出發點

Part 30 publish 兩天後,我們在重 bench 時發現原本 bench script 用 /v1/completions raw endpoint,跟 Part 28 baseline 用的 /v1/chat/completions 不一致 — 文章原本宣告的「2× speedup」實際只在 raw endpoint 上成立,production chat workload 上 v1 retrained drafter 只比 pure body 快 ~15%。完整 errata + 重新對照表都加在 Part 30 文首。

那篇修完後留下一個 open question:v1 沒贏 vanilla MTP n=4(chat EN 46 vs 53,ZH 27 vs 45)是因為 training data 不夠(尤其中文 OOD),還是因為 EAGLE-3 small head 架構就是輸給 full Gemma layer?

Round 2 設計來分離這兩個假設:加中文 + 更多 data,如果 chat 數字接近或超過 vanilla MTP → 是 data 問題。如果仍打不過 → 是架構問題。

計畫 vs 實際

階段	計畫	實際
ZH dataset 來源	Magpie-Qwen2-Pro-200K-Chinese	✓ 下載 462 MB / 200K 樣本
ZH response regen(用 huihui body)	50K	30K(vLLM scheduler 卡死兩次,part 1 跑出 25K ok / part 2 補 5K ok)
Train B(EN 50K + ZH 50K,ttt=3)	~20h	41h(含 6h44m validation;step 5000-6000 期間 hsext hang ~1000 步空轉)
Train C(EN 50K + ZH 50K,ttt=4)	~30h	不跑(B 結果已指向架構天花板,user 決定收尾)

Final paired bench(2026-05-20,chat completions paired EN/ZH)

同 vLLM container、同 prompt set、同 max_tokens=200 T=0.7 batch=1。全部走 /v1/chat/completions chat template,跟 Part 30 endpoint correction 同個 methodology。

Config	EN chat tok/s	EN pos-0 acc	EN acc/draft	ZH chat tok/s	ZH pos-0 acc	ZH acc/draft
Pure body Gemma 4 huihui FP8 (no spec)	40	—	—	~22	—	—
Vanilla MTP n=1 (`gemma4-it-assistant`)	51	70.6%	0.71/1	—	—	—
Vanilla MTP n=4 (`gemma4-it-assistant`)	53	71%	1.81/4	45	57%	1.27/4
v1 retrained EAGLE-3 n=4(Part 30 ship 版)	46	57%	1.09/4	27	12%	0.20/4
Round 2 B retrained EAGLE-3 n=4	45	56%	1.04/4	29	15%	0.22/4
Qwen 3.6 abliterated FP8 (no spec, ref)	50	—	—	50	—	—

讀法:

Round 2 B vs v1:完全平手。EN 微輸 1 tok/s,ZH 微贏 2 tok/s。加 30K 中文 data 帶來的真實改善幾乎為零。
Round 2 B vs vanilla MTP n=4:輸 8 tok/s(EN)/ 16 tok/s(ZH)。差距明顯。
兩個 EAGLE-3 drafter(v1 / Round 2 B)的 chat pos-0 acceptance 都卡在 ~57%,vanilla MTP 在 71%。14 pp 的 acceptance gap 是架構性的,training data 量補不回來。

為什麼架構大小決定一切

Speculative decoding 的速度大致是:每次 verify forward pass 產出 (1 + 平均接受的 draft token 數) 個 token。num_speculative_tokens=4 下平均接受數最多是 4:

Vanilla MTP n=4:平均每次 pass 接受 1.81 個 draft token → 每次 verify 產出 1 + 1.81 = 2.81 tokens
Round 2 B n=4:平均每次 pass 接受 1.04 個 → 每次 verify 產出 2.04 tokens

差距核心:vanilla MTP 的 pos-1/2/3 acceptance(49/35/26%)比我們 EAGLE-3 head 的(28/14/7%)高一倍。深 speculation 上,vanilla MTP 的大架構保留了更多 distribution 資訊,EAGLE-3 small head 在 pos > 0 之後就接不住。

這跟 Part 28 講的「deep speculation acceptance scatters on abliterated body」是同一個 mechanism — 只是當時以為 retrain drafter 可以救,Round 2 確認架構太小就救不回來。

副產品:vLLM Gemma 4 preview image 的 scheduler deadlock

Round 2 走完發現 vLLM Gemma 4 preview image(vllm/vllm-openai:gemma4-0505-arm64-cu130,內部 build 是 0.20.2rc1.dev49+g9b4e83934 — 2026-05-05 push,比 v0.20.2 release 還早 5 天)在 long-run concurrent extract_hidden_states 用法下會卡死。三次踩到:

時間	情況	症狀
ZH regen part 1(2026-05-18 ~02:30)	concurrency=32 跑 6h 後	engine log `generation throughput: 0.0, Running: 31 reqs`,GET /v1/models 仍 200 OK
ZH regen part 2(2026-05-18 ~05:00)	concurrency=16 跑 5h 後	同 pattern
Train B mid-run(2026-05-19 ~05:00)	trainer 持續 query hsext 14h+ 後	trainer 收 1000 步 empty_sample fallback(gradient=0 空轉)

KV cache 用 1.1%,不是 memory 問題,是 scheduler 內部的 deadlock。

Workaround:watchdog

寫了個 30-line 的 watchdog:

# 偵測 docker logs 中 "Avg generation throughput: 0.0" + "Running: N>0" 連續 3 分鐘
# → docker stop + bash run_hsext.sh 重起

實測在 Train B 後段穩定運作,沒再卡到底。完整腳本在 /tmp/hsext_watchdog.sh(將整理成 gist 補在系列 reference)。

這個問題值得去 upstream 開 issue,但要先整理出 minimum reproducer。現在這套流程太雜(speculators trainer + extract_hidden_states + hsext path 糾在一起),還不好切。未來實驗會繼續觀察能不能把條件 narrow 下來。

給讀者的建議

你現在的情況	建議
abliterated Gemma 4 production chat workload	vanilla MTP `gemma4-26b-a4b-it-assistant` + `num_speculative_tokens=4`。Chat EN 53 / ZH 45 tok/s,零訓練成本,Google 預訓 drafter
想 inference 加速但不需要 abliteration	直接用 vanilla `gemma-4-26B-A4B-it` + MTP n=4 → ~108 tok/s(Part 27 的數字)
仍想 fine-tune 自己 drafter	知道你會跑進 EAGLE-3 small head 天花板,別期待 chat 上贏過 vanilla MTP。空間是 ~10% 不是 ~100%
要跑 EAGLE-3 trainer	加 watchdog(本篇的或類似邏輯)

HF repo 狀態

v1 drafter coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft:README 已加 2026-05-17 endpoint correction
Round 2 B drafter:不另 publish 新 repo。跟 v1 chat 數字相同,沒實質差異。內部 checkpoint 留 /home/coolthor/data/eagle3_round2_B/0/,有興趣再 release。

Pivot:從 training 轉到 harness

Round 2 跑完最明顯的感覺是:單卡 GB10 訓練真的太慢。一輪 1 epoch / 80K samples 要 41 小時 + 6h44 validation,中間踩到 hsext deadlock 還會空轉 5%。Round 1 的 50K 樣本也跑 ~11 小時。改一個變因就要等兩天,節奏跟不上。

反過來看,跟量測有關的幾件事回饋快很多:

Paired chat bench harness(這篇對照表用的)— 15 分鐘出全套 baseline
Watchdog — 30 行 shell 直接修了一個 upstream bug 暴露的 production 問題
Endpoint methodology audit(Part 30 errata 抓出的)— 30 分鐘就抓到原本宣告 2× 其實是 measurement bug

Round 3 不開。後續這個系列的資源轉到回饋快的 infra 上:

Quick refusal-rate experiment:抽 hikari/kiriha 過去 1 個月 ~500 個 prompt,用 vanilla Gemma 4 跑統計真實 refuse 率。可以很快看出 abliteration -50% throughput tax 到底值不值得。1-2 小時就有答案,比再跑 30h 的 Train C 划算得多
換 base model 評估:剛實測 Qwen 3.6 abliterated MoE 35B-A3B 的 chat throughput 是 EN 50 / ZH 50 tok/s(沒 spec decode,跨語言一致),跟 Gemma 4 MTP n=4 的 53/45 平手。⚠️ 文章原本寫的 ~91 tok/s 是理論 bandwidth ceiling,不是實測 — 已更正。Qwen 3.6 的真正優勢在中文品質(TMMLU+ 75% vs Gemma 4 46%)、跨語言一致性、沒 spec decode 配置複雜度,不在 raw throughput。考慮把 hermes sib(hikari/kiriha)主力切到 Qwen 3.6 是合理的,但 justification 不該是「比較快」
把 bench harness 整理成 skill:這次 paired bench 寫成可重用的 skill,下次換新 drafter / 模型直接跑
vLLM upstream issue:整理 scheduler deadlock 的 minimum reproducer

Train C(ttt=4):預期最多再 +3-5 tok/s,跨不過 vanilla MTP 53/45 的線。Checkpoint 留著,有人想驗 ttt scaling 自己跑。我們不再花 30h 在這上面。

整個系列的 narrative arc

Part	主軸	結論
Part 28	Mechanism — vanilla draft 跟不上 abliterated body 的深 speculation	acceptance scatters at pos > 0,結構性問題
Part 29	Deploy recipe — n=1 開箱 +34%	n=1 是 safe sweet spot,deeper 不划算
Part 30	Round 1 retrain + 兩天後 errata	acceptance 拉平,但 throughput 在 chat 沒翻倍(measurement bug)
Part 31(本篇)	Round 2 retrain — null result + pivot 到 harness	vanilla MTP n=4 是 sweet spot;single GB10 訓練 cycle 太慢,leverage 在 measurement 而非 training

四篇加總:我們撞牆學到 abliterated body + spec decode 的真實 trade-off + GB10 上 training cycle 跟不上 iteration 需求。誠實寫下來給社群,省掉別人重複踩同樣 60+ 小時的時間。下篇起這個系列 pivot 到 measurement harness 跟 quick experiments。

常見問題

Round 2 跟 v1 差異在哪?: Data:v1 用 50K Magpie EN instruction + huihui body 重生 response;Round 2 用 50K EN + 30K ZH(Magpie-Qwen2-Pro-200K-Chinese 抽樣 + huihui body 重生)。Training:1 epoch,ttt_steps=3,跟 v1 同 hyperparams。原本還規劃 Train C(ttt=4)但收尾不跑。
Round 2 chat 數字到底多少?: Round 2 B drafter @ n=4 chat:EN 45 tok/s pos-0 56%,ZH 29 tok/s pos-0 15%。對照組 vanilla MTP n=4 chat EN 53 tok/s pos-0 71% / ZH 45 tok/s pos-0 57%。Round 2 沒贏,EN 還微輸 vanilla MTP 8 tok/s。
為什麼加 30K 中文 data 對 ZH 改善這麼小(+2 tok/s)?: EAGLE-3 small head 架構在 chat 上的天花板大概在 pos-0 60%,加更多 data 也很難破。Vanilla MTP `gemma4-26b-a4b-it-assistant` 是 Google 預訓的「full Gemma layer」big drafter,結構就比我們的 EAGLE-3 head 大很多 — 那是物理性差距,training data 量補不回來。
Train C(ttt=4)為什麼沒跑?: Train B 訓 41h + Train C 還要再 30h = daily 斷 70h 以上。Round 2 B 跑完已經很明顯:就是架構天花板。就算 Train C 把 ttt=2/3 那些長尾改善撈回來,也很難贏過 vanilla MTP 53/45。這輪先收在這裡,ttt=4 之後有資源再跑。
vLLM Gemma 4 preview image 的 scheduler deadlock 是什麼?: 在 long-run extract_hidden_states 用法下(trainer 持續 query vLLM 拿 hidden states),vLLM Gemma 4 preview image(`vllm/vllm-openai:gemma4-0505-arm64-cu130`,內部 build `0.20.2rc1.dev49+g9b4e83934`)的 scheduler 偶爾會進入「Running: N reqs, generation throughput: 0.0」永久 hang 狀態 — 同 deadlock pattern 在 ZH regen 兩次,在 train 中段一次。Workaround:寫 watchdog 監測 `throughput=0.0 + running>0` 持續 3 分鐘自動 docker stop + relaunch。值得提 upstream issue,但 minimum reproducer 還在收斂。

接著讀

2026-05-16
Fine-tune EAGLE-3 drafter 在 abliterated Gemma 4 上 — Round 1 拉平 acceptance 曲線(+ 一個 measurement lesson)
在 DGX Spark GB10 上把 RedHatAI EAGLE-3 drafter fine-tune 對齊 huihui Gemma 4 26B-A4B abliterated FP8 body 的 distribution。1 epoch / 50k Magpie samples / 11h 訓練。Inference bench(raw `/v1/completions`)pos 3 acceptance 從 vanilla 的 20.5% → 72.7%、n=4 throughput 從 50 → 100.36 tok/s aggregate。**後續 paired bench 發現原 throughput 比較 baseline 跟 retrain 用了不同 endpoint(chat vs raw)— production chat workload 上 retrain drafter 的真實提升遠小於 2×,詳見文首 endpoint correction**。Part 28 證實的「abliterated body deep speculation acceptance 散開」這個機制觀察仍成立。順帶找到 Speculators upstream create_empty_sample dtype bug + Phase 0 整理 6 個社群 prior art。
2026-05-09
想用 MTP 加速 abliterated Gemma 4?vanilla draft 對不上被改過的 body
自量化 huihui Gemma 4 26B-A4B abliterated 成 FP8 ship 上 HF。完整 n=1..4 sweep 後發現:abliterated body 跟 vanilla baseline 完全一樣快,n=1 上 MTP 加成也一樣;但 n=4 deep speculation 上 huihui 因為 per-position decay 陡(每 step 22pp)而被 vanilla 拉開兩倍。Tax 的真實樣貌是 conditional on num_speculative_tokens,不是固定百分比。
2026-05-14
在 DGX Spark 上 30 行 docker 拿 +34%:huihui Gemma 4 FP8 + vanilla MTP n=1 部署 recipe
Part 28 是 mechanism,這篇是 recipe:abliterated Gemma 4 26B-A4B FP8 跑在 GB10 上,搭官方 vanilla draft 開 num_speculative_tokens=1,baseline 39.3 → 52.6 tok/s (+34%),不用重訓 drafter。30 行 docker run + bind-mount PR #41745 head 的 gemma4_mtp.py 就能拿到。包含 sanity check 跟什麼時候 n=1 不夠用的判斷。
2026-05-06
火箭起飛:Gemma 4 在 DGX Spark 跑出 670 tok/s 總吞吐(單流 108 tok/s)
Google 2026-05-05 發 Multi-Token Prediction drafter,vLLM PR 同日開、官方 preview docker 同日有。DGX Spark 上實測 Gemma 4 26B-A4B-it FP8 + MTP γ=4:單流 108 tok/s(2.66× baseline)、8 路並行 674 tok/s 總吞吐。一個沒寫進文件的雷:drafter 不能配 base model,要配 -it。

ShareReddit LinkedIn X Facebook

不想錯過新文章?

訂閱我確保不漏接!

隨時一鍵退訂。

← 返回文章列表

TL;DR

Round 2 出發點

計畫 vs 實際

Final paired bench(2026-05-20,chat completions paired EN/ZH)

為什麼架構大小決定一切

副產品:vLLM Gemma 4 preview image 的 scheduler deadlock

Workaround:watchdog

給讀者的建議

HF repo 狀態

Pivot:從 training 轉到 harness

整個系列的 narrative arc

相關

常見問題

接著讀

不想錯過新文章?