~/blog/dgx-spark-eagle3-round2-null-result

DGX Spark · part 31

Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup

cat --toc

TL;DR

Round 2 verdict: the ceiling didn't move.

After Part 30's endpoint correction made clear that Round 1's "2x speedup" framing was a measurement artifact on the raw endpoint, Round 2 was supposed to test whether adding 30K Chinese instructions (originally 50K — vLLM scheduler hung twice and only 30K samples cleared) plus body-regenerated responses could let an EAGLE-3 fine-tune match or beat vanilla MTP n=4 on chat workloads. Train B ran for 41 hours.

The result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45 tok/s. The planned Train C (ttt_steps=4) is shelved.

Takeaways:

  • ✓ Confirmed an architectural ceiling on EAGLE-3 small-head drafters against an abliterated body — a bigger drafter beats more training data at this scale
  • ✓ Production recipe: vanilla MTP gemma4-26b-a4b-it-assistant + num_speculative_tokens=4. No retrain needed
  • ✓ Found a vLLM 0.20.2 scheduler deadlock in long-running extract_hidden_states (three hits + watchdog mitigation)
  • The bigger meta-lesson: a 41h-per-experiment training cadence on a single GB10 doesn't keep up with research iteration. The leverage for this series is in measurement harness, not more training. Round 3 isn't on the roadmap; the remaining bandwidth goes into paired-bench tooling, watchdogs, and quick experiments

TL;DR

  • Goal: Round 2 tried to close the chat-workload gap between our retrained EAGLE-3 drafter and vanilla MTP n=4 (EN 53 / ZH 45 tok/s) by adding Chinese instruction data and body-regenerated responses
  • Result: Train B (80K samples — 50K EN + 30K ZH, ttt=3) completed; inference delivers chat EN 45 / ZH 29 tok/s — essentially identical to v1. Train C (ttt=4) is shelved
  • Cause: EAGLE-3 small head sits below vanilla MTP gemma4-26b-a4b-it-assistant, which is a full Gemma layer. More training data doesn't close a structural drafter-size gap at this scale
  • Side artifact: A vLLM 0.20.2 scheduler deadlock in long-running concurrent extract_hidden_states use (three hits + a watchdog mitigation), worth opening an upstream issue once the reproducer is narrowed

Where Round 2 started

Two days after Part 30 published, we re-ran the bench and discovered the original throughput script was calling /v1/completions (raw) while the Part 28 baseline was /v1/chat/completions — the headline "2x speedup" only existed on the raw endpoint. On production chat workloads, the v1 retrained drafter was only ~15% faster than the pure body. The full errata is in Part 30.

That left an open question: was v1 underperforming vanilla MTP n=4 (chat EN 46 vs 53, ZH 27 vs 45) because of insufficient training data (especially Chinese OOD), or because the EAGLE-3 small head is structurally weaker than the full Gemma layer used by vanilla MTP?

Round 2 was designed to disambiguate. Add Chinese, train more, and see what happens. A clear improvement would have indicted data. A flat result would point at architecture.

Plan vs. actual

StagePlannedActual
ZH datasetMagpie-Qwen2-Pro-200K-Chinese✓ Downloaded, 462 MB / 200K samples
ZH response regeneration through huihui body50K30K. vLLM scheduler hung twice during the regen (part 1: 25K successful, part 2: ~5K additional successes)
Train B (EN 50K + ZH 50K, ttt=3)~20h41h — including 6h44m of validation, plus a stretch of ~1000 "empty sample fallback" steps during a mid-training hsext hang
Train C (EN 50K + ZH 50K, ttt=4)~30hShelved. Train B's signal was clear enough; the user called it during wrap-up

Final paired bench (2026-05-20, chat completions, paired EN/ZH)

Same vLLM container, same prompt set, max_tokens=200, T=0.7, batch=1. All numbers are from /v1/chat/completions, matching the Part 30 endpoint correction methodology.

ConfigEN chat tok/sEN pos-0 accEN acc/draftZH chat tok/sZH pos-0 accZH acc/draft
Pure body Gemma 4 huihui FP8 (no spec)40~22
Vanilla MTP n=1 (gemma4-it-assistant)5170.6%0.71/1
Vanilla MTP n=4 (gemma4-it-assistant)5371%1.81/44557%1.27/4
v1 retrained EAGLE-3 n=4 (Part 30 ship)4657%1.09/42712%0.20/4
Round 2 B retrained EAGLE-3 n=44556%1.04/42915%0.22/4
Qwen 3.6 abliterated FP8 (no spec, ref)5050

How to read it:

  • Round 2 B vs v1: statistical tie. EN one tok/s slower, ZH two tok/s faster. The 30K Chinese samples produced essentially zero throughput improvement
  • Round 2 B vs vanilla MTP n=4: loses by 8 tok/s on EN, 16 tok/s on ZH. Not close
  • Both EAGLE-3 drafters' chat pos-0 acceptance sit around 57%, while vanilla MTP holds 71%. That 14 pp acceptance gap is structural, not something more training data can patch

Why drafter size dominates

The speculative-decoding speedup formula, roughly:

speedup ≈ 1 + (acc_per_draft × draft_token_cost_ratio)

In chat workloads, with num_speculative_tokens=4:

  • Vanilla MTP n=4: 1.81 accepted draft tokens per pass on average (out of a max of 4) → 1 + 1.81 = 2.81 tokens produced per verify pass
  • Round 2 B n=4: 1.04 accepted draft tokens per pass on average → 1 + 1.04 = 2.04 tokens per verify pass

The gap lives in the deeper positions: vanilla MTP's pos-1/2/3 acceptance (49/35/26%) is roughly double our EAGLE-3 head's (28/14/7%). At depth, the bigger drafter holds onto more of the body's distribution; the smaller EAGLE-3 head can't keep up past pos-0.

This is the same mechanism Part 28 identified — deep speculation acceptance scatters on an abliterated body — except Round 2 makes clear that a smaller drafter can't rescue it.

Side artifact: vLLM scheduler deadlock in the Gemma 4 preview image

Three times across Round 2, the vLLM Gemma 4 preview image (vllm/vllm-openai:gemma4-0505-arm64-cu130, internal build 0.20.2rc1.dev49+g9b4e83934 — pushed 2026-05-05, predating the v0.20.2 release tag) got stuck under sustained concurrent extract_hidden_states use:

WhenWhereSymptom
ZH regen part 1 (2026-05-18 ~02:30)concurrency=32 after ~6hEngine logs generation throughput: 0.0, Running: 31 reqs; GET /v1/models still 200 OK
ZH regen part 2 (2026-05-18 ~05:00)concurrency=16 after ~5hSame pattern
Train B mid-run (2026-05-19 ~05:00)Trainer querying hsext continuously for 14h+Trainer hit ~1000 create_empty_sample fallback steps (zero gradient, no model update)

KV cache stayed at ~1% during all three hangs — this isn't a memory issue, it's a scheduler deadlock.

Mitigation: a small watchdog

A ~30 line shell watchdog handles it:

# Watch docker logs for "Avg generation throughput: 0.0" + "Running: N>0"
# sustained 3 min → docker stop + relaunch via the hsext script

It ran reliably through the second half of Train B without further full-on hangs. The script lives in /tmp/hsext_watchdog.sh on our box; we'll consolidate it into the series reference appendix.

Worth opening upstream, but the minimum reproducer is currently entangled with speculators training + extract_hidden_states + a specific shared-storage path. We'll keep narrowing it on future experiments.

Recommendations for readers

Your scenarioWhat to do
abliterated Gemma 4 on production chat workloadVanilla MTP gemma4-26b-a4b-it-assistant + num_speculative_tokens=4. Chat EN 53 / ZH 45 tok/s, no training needed
Want inference speedup but don't need abliterationRun vanilla gemma-4-26B-A4B-it + MTP n=4 directly → ~108 tok/s (Part 27)
Still want to fine-tune your own drafterExpect the EAGLE-3 small-head ceiling. Realistic gain over a vanilla MTP baseline is ~10%, not ~100%
Running speculators with extract_hidden_statesAdd a watchdog (ours, or write your own)

HF repo status

Pivot: from training to harness

The most important meta-lesson from Round 2 is about cadence: a single GB10 is too slow for training-driven research iteration. One ttt-step variable takes ~41h to run end-to-end (plus a 5% loss to the hsext deadlock). Round 1 was 11h on 50K samples. The cycle from "change one variable" to "see the answer" is two days minimum, which doesn't keep up with anything that's actually iterating.

Compare that to measurement-level leverage we've already gotten in this same series:

  • Paired chat bench harness (the one this article uses) — 15 minutes to produce the full comparison table
  • Watchdog — 30 lines of shell that fixed a production-blocking upstream issue
  • Endpoint methodology audit (the Part 30 errata) — 30 minutes to find that the original 2x speedup claim was a measurement bug

Round 3 is not on the roadmap. The remaining bandwidth in this thread goes into faster-iteration infrastructure:

  • Quick refusal-rate experiment: sample ~500 prompts from a month of hikari/kiriha traffic, run them through vanilla Gemma 4, count hard refusals. Decides whether the abliteration -50% throughput tax is actually worth paying. 1-2 hours of work vs 30h of Train C — 30x leverage
  • Different base model evaluation: Just measured Qwen 3.6 abliterated MoE 35B-A3B chat throughput at EN 50 / ZH 50 tok/s (no spec decode, consistent across languages), which is a tie with Gemma 4 MTP n=4 (53/45) — not a win on throughput. ⚠️ This post originally cited ~91 tok/s, but that was the theoretical bandwidth ceiling (3B active × FP8 ÷ 273 GB/s), not a measured number — corrected. The real reasons to consider switching the Hermes sibs to Qwen 3.6 are better Chinese quality (TMMLU+ 75% vs Gemma 4's 46%), cross-language consistency (no EN-vs-ZH drop), and simpler vLLM config (no spec decode tuning) — not raw throughput
  • Productize the bench harness: turn this paired bench into a reusable skill, so the next drafter or model swap takes 15 minutes rather than half a day
  • vLLM upstream issue: narrow the scheduler deadlock reproducer and file it

Train C (ttt=4): at the projected +3-5 tok/s, it doesn't close the gap to vanilla MTP 53/45. The checkpoint config is documented; anyone who wants to verify ttt scaling can run it themselves. We're not spending 30h of GPU time on it.

What the whole series looks like

PartSubjectConclusion
Part 28Mechanism — vanilla draft can't track an abliterated body at depthAcceptance scatters past pos-0 — structural problem
Part 29Deploy recipe — n=1, +34% out of the boxn=1 is the safe sweet spot; deeper doesn't pay off
Part 30Round 1 retrain + endpoint errataAcceptance flattens, but throughput doesn't 2x on chat (measurement bug)
Part 31 (this post)Round 2 retrain — null result + pivot to harnessVanilla MTP n=4 is the sweet spot. Single-GB10 training cycle is too slow; the leverage is in measurement, not more training

Four parts together: we hit a wall and learned the real trade-offs of abliterated body + spec decode, and we learned that the GB10 training loop is the wrong tool for fast iteration. The honest writeup is worth more to the community than another claimed breakthrough — it saves the next person from repeating the same 60+ hours.

FAQ

How does Round 2 differ from v1?
Data: v1 used 50K Magpie EN instructions with responses regenerated through the huihui body. Round 2 used 50K EN + 30K ZH (sampled from Magpie-Qwen2-Pro-200K-Chinese, responses regenerated through the same body). Training: 1 epoch, ttt_steps=3, same hyperparameters as v1. A Train C run (ttt_steps=4) was planned but skipped during wrap-up.
What chat numbers did Round 2 actually deliver?
Round 2 B drafter at n=4 on chat: EN 45 tok/s with pos-0 56%, ZH 29 tok/s with pos-0 15%. The control point is vanilla MTP n=4 on chat: EN 53 tok/s / pos-0 71%, ZH 45 tok/s / pos-0 57%. Round 2 lost — EN by ~8 tok/s and ZH by ~16 tok/s.
Why did 30k Chinese samples barely move the ZH numbers (+2 tok/s)?
EAGLE-3 small head architecture appears to top out around pos-0 ~60% acceptance on chat, regardless of training data. Vanilla MTP `gemma4-26b-a4b-it-assistant` is essentially a full Gemma layer trained by Google as a drafter — it's a much bigger drafter than our EAGLE-3 head. That structural size gap isn't something more training data can close.
Why didn't you run Train C (ttt_steps=4)?
Train B took 41h, and Train C was projected at another ~30h, with daily inference offline the whole time. The Round 2 B result already pointed clearly at an architectural ceiling, and bumping ttt_steps from 3 to 4 isn't plausibly going to add the 8-16 tok/s needed to cross vanilla MTP's line. The checkpoint config is documented; anyone curious can run it. We chose to stop here and write up the result.
What's the vLLM scheduler deadlock in the Gemma 4 preview image?
Under sustained long-run concurrent `extract_hidden_states` workloads (trainer continuously querying vLLM for hidden states), the vLLM Gemma 4 preview image (`vllm/vllm-openai:gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`, pushed 2026-05-05) sometimes enters a state where the engine logs `Running: N reqs, generation throughput: 0.0` and never recovers. We hit this pattern three times — twice during ZH regen and once mid-training. The mitigation is a small watchdog that watches the docker logs for `throughput=0.0 + running>0` sustained beyond 3 minutes, then docker stop + relaunch. It's worth opening an upstream issue once the minimum reproducer is narrowed down.