DGX Spark · part 31
Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup
❯ cat --toc
- TL;DR
- Where Round 2 started
- Plan vs. actual
- Final paired bench (2026-05-20, chat completions, paired EN/ZH)
- Why drafter size dominates
- Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image
- Mitigation: a small watchdog
- Recommendations for readers
- HF repo status
- Pivot: from training to harness
- What the whole series looks like
- Related
TL;DR
Round 2 verdict: the ceiling didn't move.
After Part 30's endpoint correction made clear that Round 1's "2x speedup" framing was a measurement artifact on the raw endpoint, Round 2 was supposed to test whether adding 30K Chinese instructions (originally 50K — vLLM scheduler hung twice and only 30K samples cleared) plus body-regenerated responses could let an EAGLE-3 fine-tune match or beat vanilla MTP n=4 on chat workloads. Train B ran for 41 hours.
The result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45 tok/s. The planned Train C (ttt_steps=4) is shelved.
Takeaways:
- ✓ Confirmed an architectural ceiling on EAGLE-3 small-head drafters against an abliterated body — a bigger drafter beats more training data at this scale
- ✓ Production recipe: vanilla MTP
gemma4-26b-a4b-it-assistant+num_speculative_tokens=4. No retrain needed - ✓ Found a vLLM 0.20.2 scheduler deadlock in long-running extract_hidden_states (three hits + watchdog mitigation)
- ✓ The bigger meta-lesson: a 41h-per-experiment training cadence on a single GB10 doesn't keep up with research iteration. The leverage for this series is in measurement harness, not more training. Round 3 isn't on the roadmap; the remaining bandwidth goes into paired-bench tooling, watchdogs, and quick experiments
TL;DR
- Goal: Round 2 tried to close the chat-workload gap between our retrained EAGLE-3 drafter and vanilla MTP n=4 (EN 53 / ZH 45 tok/s) by adding Chinese instruction data and body-regenerated responses
- Result: Train B (80K samples — 50K EN + 30K ZH, ttt=3) completed; inference delivers chat EN 45 / ZH 29 tok/s — essentially identical to v1. Train C (ttt=4) is shelved
- Cause: EAGLE-3 small head sits below vanilla MTP
gemma4-26b-a4b-it-assistant, which is a full Gemma layer. More training data doesn't close a structural drafter-size gap at this scale - Side artifact: A vLLM 0.20.2 scheduler deadlock in long-running concurrent
extract_hidden_statesuse (three hits + a watchdog mitigation), worth opening an upstream issue once the reproducer is narrowed
Where Round 2 started
Two days after Part 30 published, we re-ran the bench and discovered the original throughput script was calling /v1/completions (raw) while the Part 28 baseline was /v1/chat/completions — the headline "2x speedup" only existed on the raw endpoint. On production chat workloads, the v1 retrained drafter was only ~15% faster than the pure body. The full errata is in Part 30.
That left an open question: was v1 underperforming vanilla MTP n=4 (chat EN 46 vs 53, ZH 27 vs 45) because of insufficient training data (especially Chinese OOD), or because the EAGLE-3 small head is structurally weaker than the full Gemma layer used by vanilla MTP?
Round 2 was designed to disambiguate. Add Chinese, train more, and see what happens. A clear improvement would have indicted data. A flat result would point at architecture.
Plan vs. actual
| Stage | Planned | Actual |
|---|---|---|
| ZH dataset | Magpie-Qwen2-Pro-200K-Chinese | ✓ Downloaded, 462 MB / 200K samples |
| ZH response regeneration through huihui body | 50K | 30K. vLLM scheduler hung twice during the regen (part 1: 25K successful, part 2: ~5K additional successes) |
| Train B (EN 50K + ZH 50K, ttt=3) | ~20h | 41h — including 6h44m of validation, plus a stretch of ~1000 "empty sample fallback" steps during a mid-training hsext hang |
| Train C (EN 50K + ZH 50K, ttt=4) | ~30h | Shelved. Train B's signal was clear enough; the user called it during wrap-up |
Final paired bench (2026-05-20, chat completions, paired EN/ZH)
Same vLLM container, same prompt set, max_tokens=200, T=0.7, batch=1. All numbers are from
/v1/chat/completions, matching the Part 30 endpoint correction methodology.
| Config | EN chat tok/s | EN pos-0 acc | EN acc/draft | ZH chat tok/s | ZH pos-0 acc | ZH acc/draft |
|---|---|---|---|---|---|---|
| Pure body Gemma 4 huihui FP8 (no spec) | 40 | — | — | ~22 | — | — |
Vanilla MTP n=1 (gemma4-it-assistant) | 51 | 70.6% | 0.71/1 | — | — | — |
Vanilla MTP n=4 (gemma4-it-assistant) | 53 | 71% | 1.81/4 | 45 | 57% | 1.27/4 |
| v1 retrained EAGLE-3 n=4 (Part 30 ship) | 46 | 57% | 1.09/4 | 27 | 12% | 0.20/4 |
| Round 2 B retrained EAGLE-3 n=4 | 45 | 56% | 1.04/4 | 29 | 15% | 0.22/4 |
| Qwen 3.6 abliterated FP8 (no spec, ref) | 50 | — | — | 50 | — | — |
How to read it:
- Round 2 B vs v1: statistical tie. EN one tok/s slower, ZH two tok/s faster. The 30K Chinese samples produced essentially zero throughput improvement
- Round 2 B vs vanilla MTP n=4: loses by 8 tok/s on EN, 16 tok/s on ZH. Not close
- Both EAGLE-3 drafters' chat pos-0 acceptance sit around 57%, while vanilla MTP holds 71%. That 14 pp acceptance gap is structural, not something more training data can patch
Why drafter size dominates
The speculative-decoding speedup formula, roughly:
speedup ≈ 1 + (acc_per_draft × draft_token_cost_ratio)
In chat workloads, with num_speculative_tokens=4:
- Vanilla MTP n=4: 1.81 accepted draft tokens per pass on average (out of a max of 4) → 1 + 1.81 = 2.81 tokens produced per verify pass
- Round 2 B n=4: 1.04 accepted draft tokens per pass on average → 1 + 1.04 = 2.04 tokens per verify pass
The gap lives in the deeper positions: vanilla MTP's pos-1/2/3 acceptance (49/35/26%) is roughly double our EAGLE-3 head's (28/14/7%). At depth, the bigger drafter holds onto more of the body's distribution; the smaller EAGLE-3 head can't keep up past pos-0.
This is the same mechanism Part 28 identified — deep speculation acceptance scatters on an abliterated body — except Round 2 makes clear that a smaller drafter can't rescue it.
Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image
Three times across Round 2, the vLLM Gemma 4 preview image (vllm/vllm-openai:gemma4-0505-arm64-cu130, internal build 0.20.2rc1.dev49+g9b4e83934 — pushed 2026-05-05, predating the v0.20.2 release tag) got stuck under sustained concurrent extract_hidden_states use:
| When | Where | Symptom |
|---|---|---|
| ZH regen part 1 (2026-05-18 ~02:30) | concurrency=32 after ~6h | Engine logs generation throughput: 0.0, Running: 31 reqs; GET /v1/models still 200 OK |
| ZH regen part 2 (2026-05-18 ~05:00) | concurrency=16 after ~5h | Same pattern |
| Train B mid-run (2026-05-19 ~05:00) | Trainer querying hsext continuously for 14h+ | Trainer hit ~1000 create_empty_sample fallback steps (zero gradient, no model update) |
KV cache stayed at ~1% during all three hangs — this isn't a memory issue, it's a scheduler deadlock.
Mitigation: a small watchdog
A ~30 line shell watchdog handles it:
# Watch docker logs for "Avg generation throughput: 0.0" + "Running: N>0"
# sustained 3 min → docker stop + relaunch via the hsext script
It ran reliably through the second half of Train B without further full-on hangs. The script lives in /tmp/hsext_watchdog.sh on our box; we'll consolidate it into the series reference appendix.
Worth opening upstream, but the minimum reproducer is currently entangled with speculators training + extract_hidden_states + a specific shared-storage path. We'll keep narrowing it on future experiments.
Recommendations for readers
| Your scenario | What to do |
|---|---|
| abliterated Gemma 4 on production chat workload | Vanilla MTP gemma4-26b-a4b-it-assistant + num_speculative_tokens=4. Chat EN 53 / ZH 45 tok/s, no training needed |
| Want inference speedup but don't need abliteration | Run vanilla gemma-4-26B-A4B-it + MTP n=4 directly → ~108 tok/s (Part 27) |
| Still want to fine-tune your own drafter | Expect the EAGLE-3 small-head ceiling. Realistic gain over a vanilla MTP baseline is ~10%, not ~100% |
| Running speculators with extract_hidden_states | Add a watchdog (ours, or write your own) |
HF repo status
- v1 drafter
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft: README already carries the 2026-05-17 endpoint correction - Round 2 B drafter: not publishing a separate repo. Chat numbers are statistically indistinguishable from v1, and putting another drafter on HF with the same caveats would just confuse downloads. The internal checkpoint at
/home/coolthor/data/eagle3_round2_B/0/stays; ask if you want a copy
Pivot: from training to harness
The most important meta-lesson from Round 2 is about cadence: a single GB10 is too slow for training-driven research iteration. One ttt-step variable takes ~41h to run end-to-end (plus a 5% loss to the hsext deadlock). Round 1 was 11h on 50K samples. The cycle from "change one variable" to "see the answer" is two days minimum, which doesn't keep up with anything that's actually iterating.
Compare that to measurement-level leverage we've already gotten in this same series:
- Paired chat bench harness (the one this article uses) — 15 minutes to produce the full comparison table
- Watchdog — 30 lines of shell that fixed a production-blocking upstream issue
- Endpoint methodology audit (the Part 30 errata) — 30 minutes to find that the original 2x speedup claim was a measurement bug
Round 3 is not on the roadmap. The remaining bandwidth in this thread goes into faster-iteration infrastructure:
- Quick refusal-rate experiment: sample ~500 prompts from a month of hikari/kiriha traffic, run them through vanilla Gemma 4, count hard refusals. Decides whether the abliteration -50% throughput tax is actually worth paying. 1-2 hours of work vs 30h of Train C — 30x leverage
- Different base model evaluation: Just measured Qwen 3.6 abliterated MoE 35B-A3B chat throughput at EN 50 / ZH 50 tok/s (no spec decode, consistent across languages), which is a tie with Gemma 4 MTP n=4 (53/45) — not a win on throughput. ⚠️ This post originally cited ~91 tok/s, but that was the theoretical bandwidth ceiling (3B active × FP8 ÷ 273 GB/s), not a measured number — corrected. The real reasons to consider switching the Hermes sibs to Qwen 3.6 are better Chinese quality (TMMLU+ 75% vs Gemma 4's 46%), cross-language consistency (no EN-vs-ZH drop), and simpler vLLM config (no spec decode tuning) — not raw throughput
- Productize the bench harness: turn this paired bench into a reusable skill, so the next drafter or model swap takes 15 minutes rather than half a day
- vLLM upstream issue: narrow the scheduler deadlock reproducer and file it
Train C (ttt=4): at the projected +3-5 tok/s, it doesn't close the gap to vanilla MTP 53/45. The checkpoint config is documented; anyone who wants to verify ttt scaling can run it themselves. We're not spending 30h of GPU time on it.
What the whole series looks like
| Part | Subject | Conclusion |
|---|---|---|
| Part 28 | Mechanism — vanilla draft can't track an abliterated body at depth | Acceptance scatters past pos-0 — structural problem |
| Part 29 | Deploy recipe — n=1, +34% out of the box | n=1 is the safe sweet spot; deeper doesn't pay off |
| Part 30 | Round 1 retrain + endpoint errata | Acceptance flattens, but throughput doesn't 2x on chat (measurement bug) |
| Part 31 (this post) | Round 2 retrain — null result + pivot to harness | Vanilla MTP n=4 is the sweet spot. Single-GB10 training cycle is too slow; the leverage is in measurement, not more training |
Four parts together: we hit a wall and learned the real trade-offs of abliterated body + spec decode, and we learned that the GB10 training loop is the wrong tool for fast iteration. The honest writeup is worth more to the community than another claimed breakthrough — it saves the next person from repeating the same 60+ hours.
Related
- Part 30 — Round 1 retrain + endpoint errata
- Part 29 — n=1 deploy recipe
- Part 28 — Mechanism
- Part 27 — Vanilla Gemma 4 + MTP at 108 tok/s baseline
- HF drafter:
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft(v1 is the only published version; Round 2 B is not separately released)
FAQ
- How does Round 2 differ from v1?
- Data: v1 used 50K Magpie EN instructions with responses regenerated through the huihui body. Round 2 used 50K EN + 30K ZH (sampled from Magpie-Qwen2-Pro-200K-Chinese, responses regenerated through the same body). Training: 1 epoch, ttt_steps=3, same hyperparameters as v1. A Train C run (ttt_steps=4) was planned but skipped during wrap-up.
- What chat numbers did Round 2 actually deliver?
- Round 2 B drafter at n=4 on chat: EN 45 tok/s with pos-0 56%, ZH 29 tok/s with pos-0 15%. The control point is vanilla MTP n=4 on chat: EN 53 tok/s / pos-0 71%, ZH 45 tok/s / pos-0 57%. Round 2 lost — EN by ~8 tok/s and ZH by ~16 tok/s.
- Why did 30k Chinese samples barely move the ZH numbers (+2 tok/s)?
- EAGLE-3 small head architecture appears to top out around pos-0 ~60% acceptance on chat, regardless of training data. Vanilla MTP `gemma4-26b-a4b-it-assistant` is essentially a full Gemma layer trained by Google as a drafter — it's a much bigger drafter than our EAGLE-3 head. That structural size gap isn't something more training data can close.
- Why didn't you run Train C (ttt_steps=4)?
- Train B took 41h, and Train C was projected at another ~30h, with daily inference offline the whole time. The Round 2 B result already pointed clearly at an architectural ceiling, and bumping ttt_steps from 3 to 4 isn't plausibly going to add the 8-16 tok/s needed to cross vanilla MTP's line. The checkpoint config is documented; anyone curious can run it. We chose to stop here and write up the result.
- What's the vLLM scheduler deadlock in the Gemma 4 preview image?
- Under sustained long-run concurrent `extract_hidden_states` workloads (trainer continuously querying vLLM for hidden states), the vLLM Gemma 4 preview image (`vllm/vllm-openai:gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`, pushed 2026-05-05) sometimes enters a state where the engine logs `Running: N reqs, generation throughput: 0.0` and never recovers. We hit this pattern three times — twice during ZH regen and once mid-training. The mitigation is a small watchdog that watches the docker logs for `throughput=0.0 + running>0` sustained beyond 3 minutes, then docker stop + relaunch. It's worth opening an upstream issue once the minimum reproducer is narrowed down.