How does Round 2 differ from v1?

Data: v1 used 50K Magpie EN instructions with responses regenerated through the huihui body. Round 2 used 50K EN + 30K ZH (sampled from Magpie-Qwen2-Pro-200K-Chinese, responses regenerated through the same body). Training: 1 epoch, ttt_steps=3, same hyperparameters as v1. A Train C run (ttt_steps=4) was planned but skipped during wrap-up.

What chat numbers did Round 2 actually deliver?

Round 2 B drafter at n=4 on chat: EN 45 tok/s with pos-0 56%, ZH 29 tok/s with pos-0 15%. The control point is vanilla MTP n=4 on chat: EN 53 tok/s / pos-0 71%, ZH 45 tok/s / pos-0 57%. Round 2 lost — EN by ~8 tok/s and ZH by ~16 tok/s.

Why did 30k Chinese samples barely move the ZH numbers (+2 tok/s)?

EAGLE-3 small head architecture appears to top out around pos-0 ~60% acceptance on chat, regardless of training data. Vanilla MTP `gemma4-26b-a4b-it-assistant` is essentially a full Gemma layer trained by Google as a drafter — it's a much bigger drafter than our EAGLE-3 head. That structural size gap isn't something more training data can close.

Why didn't you run Train C (ttt_steps=4)?

Train B took 41h, and Train C was projected at another ~30h, with daily inference offline the whole time. The Round 2 B result already pointed clearly at an architectural ceiling, and bumping ttt_steps from 3 to 4 isn't plausibly going to add the 8-16 tok/s needed to cross vanilla MTP's line. The checkpoint config is documented; anyone curious can run it. We chose to stop here and write up the result.

What's the vLLM scheduler deadlock in the Gemma 4 preview image?

Under sustained long-run concurrent `extract_hidden_states` workloads (trainer continuously querying vLLM for hidden states), the vLLM Gemma 4 preview image (`vllm/vllm-openai:gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`, pushed 2026-05-05) sometimes enters a state where the engine logs `Running: N reqs, generation throughput: 0.0` and never recovers. We hit this pattern three times — twice during ZH regen and once mid-training. The mitigation is a small watchdog that watches the docker logs for `throughput=0.0 + running>0` sustained beyond 3 minutes, then docker stop + relaunch. It's worth opening an upstream issue once the minimum reproducer is narrowed down.

~/blog/dgx-spark-eagle3-round2-null-result

DGX Spark · part 31

Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup

2026-05-2110 min read#gemma-4 #abliteration #eagle-3 #speculative-decoding 中文版

❯ cat --toc

TL;DR
Where Round 2 started
Plan vs. actual
Final paired bench (2026-05-20, chat completions, paired EN/ZH)
Why drafter size dominates
Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image
Mitigation: a small watchdog
Recommendations for readers
HF repo status
Pivot: from training to harness
What the whole series looks like
Related

TL;DR

Round 2 verdict: the ceiling didn't move.

After Part 30's endpoint correction made clear that Round 1's "2x speedup" framing was a measurement artifact on the raw endpoint, Round 2 was supposed to test whether adding 30K Chinese instructions (originally 50K — vLLM scheduler hung twice and only 30K samples cleared) plus body-regenerated responses could let an EAGLE-3 fine-tune match or beat vanilla MTP n=4 on chat workloads. Train B ran for 41 hours.

The result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45 tok/s. The planned Train C (ttt_steps=4) is shelved.

Takeaways:

✓ Confirmed an architectural ceiling on EAGLE-3 small-head drafters against an abliterated body — a bigger drafter beats more training data at this scale
✓ Production recipe: vanilla MTP gemma4-26b-a4b-it-assistant + num_speculative_tokens=4. No retrain needed
✓ Found a vLLM 0.20.2 scheduler deadlock in long-running extract_hidden_states (three hits + watchdog mitigation)
✓ The bigger meta-lesson: a 41h-per-experiment training cadence on a single GB10 doesn't keep up with research iteration. The leverage for this series is in measurement harness, not more training. Round 3 isn't on the roadmap; the remaining bandwidth goes into paired-bench tooling, watchdogs, and quick experiments

TL;DR

Goal: Round 2 tried to close the chat-workload gap between our retrained EAGLE-3 drafter and vanilla MTP n=4 (EN 53 / ZH 45 tok/s) by adding Chinese instruction data and body-regenerated responses
Result: Train B (80K samples — 50K EN + 30K ZH, ttt=3) completed; inference delivers chat EN 45 / ZH 29 tok/s — essentially identical to v1. Train C (ttt=4) is shelved
Cause: EAGLE-3 small head sits below vanilla MTP gemma4-26b-a4b-it-assistant, which is a full Gemma layer. More training data doesn't close a structural drafter-size gap at this scale
Side artifact: A vLLM 0.20.2 scheduler deadlock in long-running concurrent extract_hidden_states use (three hits + a watchdog mitigation), worth opening an upstream issue once the reproducer is narrowed

Where Round 2 started

Two days after Part 30 published, we re-ran the bench and discovered the original throughput script was calling /v1/completions (raw) while the Part 28 baseline was /v1/chat/completions — the headline "2x speedup" only existed on the raw endpoint. On production chat workloads, the v1 retrained drafter was only ~15% faster than the pure body. The full errata is in Part 30.

That left an open question: was v1 underperforming vanilla MTP n=4 (chat EN 46 vs 53, ZH 27 vs 45) because of insufficient training data (especially Chinese OOD), or because the EAGLE-3 small head is structurally weaker than the full Gemma layer used by vanilla MTP?

Round 2 was designed to disambiguate. Add Chinese, train more, and see what happens. A clear improvement would have indicted data. A flat result would point at architecture.

Plan vs. actual

Stage	Planned	Actual
ZH dataset	Magpie-Qwen2-Pro-200K-Chinese	✓ Downloaded, 462 MB / 200K samples
ZH response regeneration through huihui body	50K	30K. vLLM scheduler hung twice during the regen (part 1: 25K successful, part 2: ~5K additional successes)
Train B (EN 50K + ZH 50K, ttt=3)	~20h	41h — including 6h44m of validation, plus a stretch of ~1000 "empty sample fallback" steps during a mid-training hsext hang
Train C (EN 50K + ZH 50K, ttt=4)	~30h	Shelved. Train B's signal was clear enough; the user called it during wrap-up

Final paired bench (2026-05-20, chat completions, paired EN/ZH)

Same vLLM container, same prompt set, max_tokens=200, T=0.7, batch=1. All numbers are from /v1/chat/completions, matching the Part 30 endpoint correction methodology.

Config	EN chat tok/s	EN pos-0 acc	EN acc/draft	ZH chat tok/s	ZH pos-0 acc	ZH acc/draft
Pure body Gemma 4 huihui FP8 (no spec)	40	—	—	~22	—	—
Vanilla MTP n=1 (`gemma4-it-assistant`)	51	70.6%	0.71/1	—	—	—
Vanilla MTP n=4 (`gemma4-it-assistant`)	53	71%	1.81/4	45	57%	1.27/4
v1 retrained EAGLE-3 n=4 (Part 30 ship)	46	57%	1.09/4	27	12%	0.20/4
Round 2 B retrained EAGLE-3 n=4	45	56%	1.04/4	29	15%	0.22/4
Qwen 3.6 abliterated FP8 (no spec, ref)	50	—	—	50	—	—

How to read it:

Round 2 B vs v1: statistical tie. EN one tok/s slower, ZH two tok/s faster. The 30K Chinese samples produced essentially zero throughput improvement
Round 2 B vs vanilla MTP n=4: loses by 8 tok/s on EN, 16 tok/s on ZH. Not close
Both EAGLE-3 drafters' chat pos-0 acceptance sit around 57%, while vanilla MTP holds 71%. That 14 pp acceptance gap is structural, not something more training data can patch

Why drafter size dominates

The speculative-decoding speedup formula, roughly:

speedup ≈ 1 + (acc_per_draft × draft_token_cost_ratio)

In chat workloads, with num_speculative_tokens=4:

Vanilla MTP n=4: 1.81 accepted draft tokens per pass on average (out of a max of 4) → 1 + 1.81 = 2.81 tokens produced per verify pass
Round 2 B n=4: 1.04 accepted draft tokens per pass on average → 1 + 1.04 = 2.04 tokens per verify pass

The gap lives in the deeper positions: vanilla MTP's pos-1/2/3 acceptance (49/35/26%) is roughly double our EAGLE-3 head's (28/14/7%). At depth, the bigger drafter holds onto more of the body's distribution; the smaller EAGLE-3 head can't keep up past pos-0.

This is the same mechanism Part 28 identified — deep speculation acceptance scatters on an abliterated body — except Round 2 makes clear that a smaller drafter can't rescue it.

Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image

Three times across Round 2, the vLLM Gemma 4 preview image (vllm/vllm-openai:gemma4-0505-arm64-cu130, internal build 0.20.2rc1.dev49+g9b4e83934 — pushed 2026-05-05, predating the v0.20.2 release tag) got stuck under sustained concurrent extract_hidden_states use:

When	Where	Symptom
ZH regen part 1 (2026-05-18 ~02:30)	concurrency=32 after ~6h	Engine logs `generation throughput: 0.0, Running: 31 reqs`; `GET /v1/models` still 200 OK
ZH regen part 2 (2026-05-18 ~05:00)	concurrency=16 after ~5h	Same pattern
Train B mid-run (2026-05-19 ~05:00)	Trainer querying hsext continuously for 14h+	Trainer hit ~1000 `create_empty_sample` fallback steps (zero gradient, no model update)

KV cache stayed at ~1% during all three hangs — this isn't a memory issue, it's a scheduler deadlock.

Mitigation: a small watchdog

A ~30 line shell watchdog handles it:

# Watch docker logs for "Avg generation throughput: 0.0" + "Running: N>0"
# sustained 3 min → docker stop + relaunch via the hsext script

It ran reliably through the second half of Train B without further full-on hangs. The script lives in /tmp/hsext_watchdog.sh on our box; we'll consolidate it into the series reference appendix.

Worth opening upstream, but the minimum reproducer is currently entangled with speculators training + extract_hidden_states + a specific shared-storage path. We'll keep narrowing it on future experiments.

Recommendations for readers

Your scenario	What to do
abliterated Gemma 4 on production chat workload	Vanilla MTP `gemma4-26b-a4b-it-assistant` + `num_speculative_tokens=4`. Chat EN 53 / ZH 45 tok/s, no training needed
Want inference speedup but don't need abliteration	Run vanilla `gemma-4-26B-A4B-it` + MTP n=4 directly → ~108 tok/s (Part 27)
Still want to fine-tune your own drafter	Expect the EAGLE-3 small-head ceiling. Realistic gain over a vanilla MTP baseline is ~10%, not ~100%
Running speculators with extract_hidden_states	Add a watchdog (ours, or write your own)

HF repo status

v1 drafter coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft: README already carries the 2026-05-17 endpoint correction
Round 2 B drafter: not publishing a separate repo. Chat numbers are statistically indistinguishable from v1, and putting another drafter on HF with the same caveats would just confuse downloads. The internal checkpoint at /home/coolthor/data/eagle3_round2_B/0/ stays; ask if you want a copy

Pivot: from training to harness

The most important meta-lesson from Round 2 is about cadence: a single GB10 is too slow for training-driven research iteration. One ttt-step variable takes ~41h to run end-to-end (plus a 5% loss to the hsext deadlock). Round 1 was 11h on 50K samples. The cycle from "change one variable" to "see the answer" is two days minimum, which doesn't keep up with anything that's actually iterating.

Compare that to measurement-level leverage we've already gotten in this same series:

Paired chat bench harness (the one this article uses) — 15 minutes to produce the full comparison table
Watchdog — 30 lines of shell that fixed a production-blocking upstream issue
Endpoint methodology audit (the Part 30 errata) — 30 minutes to find that the original 2x speedup claim was a measurement bug

Round 3 is not on the roadmap. The remaining bandwidth in this thread goes into faster-iteration infrastructure:

Quick refusal-rate experiment: sample ~500 prompts from a month of hikari/kiriha traffic, run them through vanilla Gemma 4, count hard refusals. Decides whether the abliteration -50% throughput tax is actually worth paying. 1-2 hours of work vs 30h of Train C — 30x leverage
Different base model evaluation: Just measured Qwen 3.6 abliterated MoE 35B-A3B chat throughput at EN 50 / ZH 50 tok/s (no spec decode, consistent across languages), which is a tie with Gemma 4 MTP n=4 (53/45) — not a win on throughput. ⚠️ This post originally cited ~91 tok/s, but that was the theoretical bandwidth ceiling (3B active × FP8 ÷ 273 GB/s), not a measured number — corrected. The real reasons to consider switching the Hermes sibs to Qwen 3.6 are better Chinese quality (TMMLU+ 75% vs Gemma 4's 46%), cross-language consistency (no EN-vs-ZH drop), and simpler vLLM config (no spec decode tuning) — not raw throughput
Productize the bench harness: turn this paired bench into a reusable skill, so the next drafter or model swap takes 15 minutes rather than half a day
vLLM upstream issue: narrow the scheduler deadlock reproducer and file it

Train C (ttt=4): at the projected +3-5 tok/s, it doesn't close the gap to vanilla MTP 53/45. The checkpoint config is documented; anyone who wants to verify ttt scaling can run it themselves. We're not spending 30h of GPU time on it.

What the whole series looks like

Part	Subject	Conclusion
Part 28	Mechanism — vanilla draft can't track an abliterated body at depth	Acceptance scatters past pos-0 — structural problem
Part 29	Deploy recipe — n=1, +34% out of the box	n=1 is the safe sweet spot; deeper doesn't pay off
Part 30	Round 1 retrain + endpoint errata	Acceptance flattens, but throughput doesn't 2x on chat (measurement bug)
Part 31 (this post)	Round 2 retrain — null result + pivot to harness	Vanilla MTP n=4 is the sweet spot. Single-GB10 training cycle is too slow; the leverage is in measurement, not more training

Four parts together: we hit a wall and learned the real trade-offs of abliterated body + spec decode, and we learned that the GB10 training loop is the wrong tool for fast iteration. The honest writeup is worth more to the community than another claimed breakthrough — it saves the next person from repeating the same 60+ hours.

Part 30 — Round 1 retrain + endpoint errata
Part 29 — n=1 deploy recipe
Part 28 — Mechanism
Part 27 — Vanilla Gemma 4 + MTP at 108 tok/s baseline
HF drafter: coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft (v1 is the only published version; Round 2 B is not separately released)

FAQ

How does Round 2 differ from v1?: Data: v1 used 50K Magpie EN instructions with responses regenerated through the huihui body. Round 2 used 50K EN + 30K ZH (sampled from Magpie-Qwen2-Pro-200K-Chinese, responses regenerated through the same body). Training: 1 epoch, ttt_steps=3, same hyperparameters as v1. A Train C run (ttt_steps=4) was planned but skipped during wrap-up.
What chat numbers did Round 2 actually deliver?: Round 2 B drafter at n=4 on chat: EN 45 tok/s with pos-0 56%, ZH 29 tok/s with pos-0 15%. The control point is vanilla MTP n=4 on chat: EN 53 tok/s / pos-0 71%, ZH 45 tok/s / pos-0 57%. Round 2 lost — EN by ~8 tok/s and ZH by ~16 tok/s.
Why did 30k Chinese samples barely move the ZH numbers (+2 tok/s)?: EAGLE-3 small head architecture appears to top out around pos-0 ~60% acceptance on chat, regardless of training data. Vanilla MTP `gemma4-26b-a4b-it-assistant` is essentially a full Gemma layer trained by Google as a drafter — it's a much bigger drafter than our EAGLE-3 head. That structural size gap isn't something more training data can close.
Why didn't you run Train C (ttt_steps=4)?: Train B took 41h, and Train C was projected at another ~30h, with daily inference offline the whole time. The Round 2 B result already pointed clearly at an architectural ceiling, and bumping ttt_steps from 3 to 4 isn't plausibly going to add the 8-16 tok/s needed to cross vanilla MTP's line. The checkpoint config is documented; anyone curious can run it. We chose to stop here and write up the result.
What's the vLLM scheduler deadlock in the Gemma 4 preview image?: Under sustained long-run concurrent `extract_hidden_states` workloads (trainer continuously querying vLLM for hidden states), the vLLM Gemma 4 preview image (`vllm/vllm-openai:gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`, pushed 2026-05-05) sometimes enters a state where the engine logs `Running: N reqs, generation throughput: 0.0` and never recovers. We hit this pattern three times — twice during ZH regen and once mid-training. The mitigation is a small watchdog that watches the docker logs for `throughput=0.0 + running>0` sustained beyond 3 minutes, then docker stop + relaunch. It's worth opening an upstream issue once the minimum reproducer is narrowed down.

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime.

← back to blog

TL;DR

Where Round 2 started

Plan vs. actual

Final paired bench (2026-05-20, chat completions, paired EN/ZH)

Why drafter size dominates

Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image

Mitigation: a small watchdog

Recommendations for readers

HF repo status

Pivot: from training to harness

What the whole series looks like

Related

FAQ

Read next

Don't miss the next one