DGX Spark · part 30
EAGLE-3 fine-tune against an abliterated Gemma 4 body — Round 1 flattens the acceptance curve (plus a measurement lesson)
❯ cat --toc
- TL;DR
- Phase 0 prior art (and a Part 28 correction)
- Pipeline
- Training stack
- Data pipeline
- Speculators upstream bug (the side artifact)
- Training trajectory (v4 run)
- Loss + full_acc trajectory (single-step samples, unsmoothed)
- Convergence observations
- Inference bench (the real test)
- Setup
- Per-position acceptance (the headline numbers)
- Throughput sweep
- Verdict: mechanism WIN, production framing needs correction
- Round 2 plan (Part 31 preview)
- Today's recommendation for readers
- Related
⚠️ 2026-05-17 endpoint correction: the throughput numbers in this post (100.36 tok/s aggregate / 107.59 per-prompt mean / per-position acceptance 84/75/74/73%) come from a bench script that calls
/v1/completionswith raw prompts. Re-benching revealed that this endpoint skips the chat template, so the instruct-tuned body collapses into ASCII repetition; the drafter then trivially predicts that degenerate sequence and inflates both acceptance and tokens/sec. On a real production/v1/chat/completionsworkload, this retrained drafter delivers ~46 tok/s at pos-0 57%, roughly on par with (or slightly below) vanilla MTP n=1 at 51 tok/s / 70.6%. The mechanism the post describes (retraining flattens the acceptance curve and unlocks deep speculation when the target is predictable) still holds — but the headline "2x speedup" only applies to the raw endpoint, not to anyone calling an OpenAI-compatible chat API. Full paired bench and root-cause analysis will go in Part 31.
TL;DR
Good news — the Part 28 bottleneck is dealt with.
Part 28 mapped a structural failure: on the huihui Gemma 4 26B-A4B abliterated FP8 body, vanilla EAGLE-3 acceptance collapses past the first speculative position. For Round 1 we fine-tuned RedHatAI's pretrained drafter against 50k Magpie samples — instruction set kept, responses regenerated on the abliterated body — for one epoch, ~11h on a single DGX Spark GB10.
That bottleneck flattens out: pos 3 acceptance climbs from vanilla's 20.5% to 72.7% (+52pp), and n=4 throughput goes from ~50 tok/s to 100.36 tok/s aggregate (107.59 per-prompt mean), about a 2.0x speedup. Part 28's mechanism still holds — deep speculation really does scatter on an abliterated distribution — but retraining the drafter against that distribution closes the gap.
Along the way we hit a Speculators upstream bug: create_empty_sample() returns fp32 placeholders, which crash BF16 models on every vLLM extraction timeout fallback. Our fork patches it to torch.bfloat16; upstream PR is open at vllm-project/speculators#527.
TL;DR
- Goal: Part 28 showed that vanilla MTP draft structurally fails on an abliterated body — pos 0/1/2/3 acceptance collapses 65/43/29/20%. Part 30 retrains the drafter to realign with the abliterated body's distribution.
- Result: WIN. Inference acceptance becomes 84/75/74/73% (decay almost gone) and n=4 throughput goes 50 → 100.36 tok/s aggregate (~2.0×) with a 107.59 tok/s per-prompt mean. Drafter shipped to HF:
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft - Side artifact: A real upstream
create_empty_sampledtype bug found and patched in vllm-project/speculators; upstream PR #527 is open.
Phase 0 prior art (and a Part 28 correction)
Before drafting this post we ran a Phase 0 prior-art sweep. Codex turned up 6 public HF repos that predate Part 28 (which we published 2026-05-09):
| Repo | Created | Pattern |
|---|---|---|
OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP4 | 2026-04-11 | heretic body + grafted MTP, ~190 tok/s |
AEON-7/DFlash-Qwen3.5-27B-Uncensored | 2026-04-12 | uncensored body + external z-lab DFlash drafter, 33.2 vs 12.2 tok/s |
guglxni/Qwen3.5-9B-abliterated-DFlash | 2026-04-15 | Most direct prior art — DFlash drafter explicitly fine-tuned on abliterated activations to restore acceptance. Same mechanism story as Part 28. |
AEON-7/supergemma4-26b-dflash-pilot | 2026-04-15 | DFlash 5K-sample pilot on SuperGemma abliterated; 5.79% top-1; readme explicitly admits "expect negative speedup" |
huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp | 2026-04-26 | heretic + MTP |
llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved | 2026-05-06 | Shipped 3 days before Part 28. Part 28 named-and-shamed llmfan46 as "hasn't tuned for spec decode" — they had, 3 days earlier. |
✏️ Correction to Part 28: at publish time, Part 28 wrote "all abliteration variants today optimize for 'lowest refusal rate' — nobody's optimizing for 'highest MTP acceptance'" and called out llmfan46 specifically as "hasn't tuned for spec decode". Both statements were already false at publish time. Phase 0 surfaced 6 repos predating Part 28, including llmfan46's own native-MTP-preserved release shipped 3 days before our article. Part 30 is built on the corrected version of those facts.
Narrow novelty (what we could find as of publish):
- We didn't find a public HF repo pairing EAGLE-3 with an abliterated body — adjacent work uses DFlash or native MTP. (Easy to disprove with a counter-example.)
- Part 28 was, to the best of our search, the first published per-position acceptance breakdown for spec decode against an abliterated body. Part 30 adds the retrained-drafter comparison column.
- huihui's Gemma 4 26B-A4B abliterated is the largest-distribution abliterated repo we found in this niche, which is partly why we targeted it.
Round 1's headline numbers are a single-config improvement against a re-used Part 28 baseline — promising rather than a clean causal proof. A paired same-prompt rerun + a no-abliteration control is queued for Round 2.
Pipeline
Training stack
| Component | Setting |
|---|---|
| Hardware | NVIDIA GB10 (DGX Spark), sm_12.1, 121 GB unified memory, 273 GB/s |
| Framework | vllm-project/speculators v0.5.0.dev0 |
| Verifier (body) | coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic (our self-quantized huihui base) |
| Drafter starting point | RedHatAI/gemma-4-26B-A4B-it-speculator.eagle3 (vanilla-trained pretrained) |
| Training data | Magpie 50k (Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered, instruction split), regenerated through huihui FP8 to produce (input, response) pairs against the abliterated body |
| vLLM dual role | Serves extract_hidden_states speculative_config + ExampleHiddenStatesConnector as a hidden-states producer for the trainer. gpu_memory_utilization=0.5 leaves room for the trainer |
| Epochs / seq length | 1 epoch / 4096 packed |
Data pipeline
response_regeneration/script.pyreruns the 50k Magpie instructions through huihui FP8 to produce on-distribution targets (~24h)prepare_data.pyconverts the jsonl to an Arrow dataset with token_freq.pt and assistant-token loss maskstrain.py --on-missing generate --on-generate delete— trainer and vLLM run concurrently; on each batch the trainer requests hidden states from vLLM, vLLM writes a safetensors file to a shared directory, trainer reads it and deletes immediately
Speculators upstream bug (the side artifact)
At training step ~9485 (83% through epoch 0), the run died with a dtype mismatch. Root cause:
# data.py — create_empty_sample, original
return {
"hidden_states": torch.empty(0, 3 * hidden_size), # ← defaults to fp32!
"verifier_last_hidden_states": torch.empty(0, hidden_size), # ← also fp32
...
}
When it triggers: a vLLM extraction request times out (default 15s) → all retries fail → collate_fn falls back to create_empty_sample() → downstream eagle3/core.py:fc() and :verifier_lm_head() expect BF16 → RuntimeError: expected mat1 and mat2 to have the same dtype → entire run dies.
Fix:
def create_empty_sample(hidden_size: int, dtype: torch.dtype = torch.bfloat16):
return {
"hidden_states": torch.empty(0, 3 * hidden_size, dtype=dtype),
"input_ids": torch.empty(0, dtype=torch.long),
"verifier_last_hidden_states": torch.empty(0, hidden_size, dtype=dtype),
...
}
PR is open upstream: vllm-project/speculators#527. The patch itself has been validated — v4 (with the patch) ran through to completion, where v1 (without it) died at step 9485 from the dtype mismatch. The actual change in data.py:67 is four lines.
Related fragility worth flagging upstream: speculators' --checkpoint-freq default unit is epochs, not steps. A 1-epoch run that crashes mid-step throws away every step of progress. We lost 9 hours of training the first time. Worth a separate issue for step-level checkpointing.
Training trajectory (v4 run)
⚠️ Metric clarification: the trainer logs both
full_acc_Nandcond_acc_Nper step.full_accis the unconditional probability that position N is correct — this is what aligns with vLLM's runtime/metricsper-position acceptance counter.cond_accis conditional on positions 0..N-1 all being correct. To compare against Part 28's vanilla baseline (65.6 / 43.3 / 29.2 / 20.5), usefull_acc. All acceptance numbers in this post arefull_acc.
Loss + full_acc trajectory (single-step samples, unsmoothed)
| Step | Loss | full_acc_0 | full_acc_1 | full_acc_2 |
|---|---|---|---|---|
| 1k | 7.51 | 66.8% | 39.9% | 25.1% |
| 2k | 6.66 | 69.5% | 44.0% | 27.9% |
| 4k | 4.74 | 77.0% | 55.9% | 42.6% |
| 6k | 7.18 | 64.8% | 39.0% | 24.8% |
| 8k | 7.39 | 65.7% | 38.9% | 24.5% |
| Final val (N=1266 batches) | 6.94 | 66.8% | 41.4% | 26.4% |
Note: trainer defaults to ttt_steps=3, so training only evaluates pos 0/1/2 — there's no full_acc_3 during training. At inference time the drafter extrapolates to pos 3 (the bench in the next section measures it directly).
Convergence observations
- Loss trajectory is bouncy, not monotonic. It starts at 7.5 around step 1k, hits a low of 4.74 near step 4k, and drifts back to the 7 range by the end of the epoch. This is the cosine LR schedule: warmup completes by ~step 100 and the LR peaks around step 4k, where the model jumps around the loss landscape and learns the rough distribution shape. After that the LR decays into fine refinement and single-step samples look noisier.
- The training-time metrics look mediocre on their own. Final val
full_acc_2is 26.4% — slightly below Part 28's vanilla baseline of 29.2%. After looking at val we briefly thought Part 30 was going to be a "NO WIN" failure post. - But training acceptance ≠ inference acceptance. Validation
full_accis teacher-forced argmax matching against the Magpie ground-truth tokens (strict). Inference acceptance is rejection sampling against the body's actual sampling distribution at T=0.7 (looser). The next section's inference bench is the real verdict — and it's the opposite story.
Inference bench (the real test)
Once training finishes, we load the new drafter into vLLM (same config as Part 29 except we swap the draft model to our fine-tuned checkpoint) and measure per-position acceptance + throughput.
Setup
vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
--speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.85 \
--max-model-len 8192 \
--enable-prefix-caching --trust-remote-code
- N=10 prompts × T=0.7 × batch=1, max_tokens=200
- Same vLLM container, same prompt pool — only the draft model swaps between vanilla MTP and our fine-tuned EAGLE-3
- Part 28's vanilla baseline numbers are reused from the previously published bench (same hardware, same body); a strictly paired re-run on the exact prompt set is not done here, but the 19-52pp acceptance gaps below cannot plausibly be explained by prompt-set variation alone.
Per-position acceptance (the headline numbers)
⚠️ Methodology note: in the table below, the "Part 28 vanilla draft" column is measured on chat completions while the "Fine-tuned drafter (this post)" column is measured on raw
/v1/completions— the endpoints don't match. The flattening of the acceptance curve is a real mechanism observation (within the same endpoint, retrain vs vanilla shows a substantial gap), but the precise per-position deltas are contaminated by the endpoint asymmetry. Round 2 paired bench (same endpoint, same prompt set) will appear in Part 31.
| Position | Part 28 vanilla draft (chat) | Fine-tuned drafter (this post, raw) | Δ |
|---|---|---|---|
| pos 0 | 65.6% | 84.4% | +18.8 pp |
| pos 1 | 43.3% | 74.9% | +31.6 pp |
| pos 2 | 29.2% | 74.1% | +44.9 pp |
| pos 3 | 20.5% | 72.7% | +52.2 pp 🚀 |
The vanilla draft on an abliterated body decays at roughly −22 / −14 / −9 pp per step. The fine-tuned drafter decays at −9 / −1 / −1 — essentially a flat curve, on the raw endpoint.
Throughput sweep
⚠️ Every "Fine-tuned EAGLE-3" number in this table is from
/v1/completionsraw endpoint; the Part 28 vanilla column is from chat completions. The corresponding numbers for production chat workloads are materially lower (retrained drafter chat n=4 ~ 46 tok/s vs pure body ~ 40 = ~1.15× real uplift). The table is preserved for readers who want to reproduce the raw-endpoint figures.
num_spec_tokens | Part 28 vanilla (chat) | Fine-tuned EAGLE-3 (raw) | Speedup |
|---|---|---|---|
| 0 (no spec) | 39.3 tok/s | 39.3 tok/s | 1.00× |
| 1 | 52.6 tok/s | 59.04 tok/s | 1.12× |
| 2 | 51.4 tok/s | 66.96 tok/s | 1.30× |
| 3 | 46.9 tok/s | 74.90 tok/s | 1.60× |
| 4 | 50.0 tok/s | 100.36 tok/s aggregate / 107.59 per-prompt mean | ~2.01× |
(Bench detail: gpu_memory_utilization=0.85, max-model-len=8192, kv_cache_dtype=fp8, temperature=0.7, N=10 prompts × max_tokens=200, batch=1.)
For the vanilla draft, throughput drops as soon as you push past n=1 — deeper speculation is rejected often enough that the verification overhead outweighs the speedup. For our fine-tuned drafter on the raw endpoint, throughput climbs monotonically from n=1 through n=4, with n=4 as the sweet spot. The mechanism holds — once the drafter is realigned with the body's distribution, deep speculation becomes useful again. The point that needed correction is that "useful on raw" doesn't translate to "2× on chat".
Verdict: mechanism WIN, production framing needs correction
The fine-tuned EAGLE-3 drafter restores deep speculation on the abliterated body at the raw endpoint; the "2× speedup" framing was over-claimed for chat workloads.
- Inference acceptance (raw endpoint): vanilla 65→20% steep decay flattens to 84→73%, a +52pp recovery at pos 3. The mechanism observation is sound.
- Throughput at n=4 on raw
/v1/completions: vanilla ~50 tok/s → 100.36 tok/s aggregate, ~2.01×. The matching chat-workload comparison is retrained ~46 vs pure body ~40 = ~1.15×, well below what the raw-endpoint table suggests. - Drafter shipped to HF as
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft— 1.86 GB safetensors, drop-in replacement for the vanilla MTP assistant. The HF README now carries the same endpoint caveat. - Part 28's mechanism argument is not refuted — vanilla draft genuinely does collapse on this body, and retraining genuinely does flatten the acceptance curve. The lesson is that "flat acceptance" does not automatically translate to "2× chat throughput".
Production recipe (daily-use config):
vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
--speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.65 \
--max-model-len 65536 \
--enable-auto-tool-choice --tool-call-parser gemma4 \
--enable-prefix-caching --trust-remote-code
Compared to the Part 29 n=1 recipe, four flags change: method from mtp to eagle3, model from the vanilla Gemma 4 assistant to our drafter, num_speculative_tokens from 1 to 4, and --max-model-len from 8192 to 65536 (to accommodate hermes-agent's 64K minimum for daily use). The rest of the vLLM serve flags are identical.
Round 2 plan (Part 31 preview)
With WIN in hand, the round-2 directions are:
- Cross-workload bench. The current N=10 prompts are English instruction-following. The plan is to measure acceptance on Traditional Chinese, code-heavy, podcast-summary, and image-gen-prompt-writing workloads — the real use cases driving hikari/kiriha — and confirm the acceptance numbers hold (or characterize where they don't).
- TurboQuant KV cache 3-bit on top of EAGLE-3. Phase 2 of the GB10 stack upgrade plan. KV budget grows ~4x; for single-user chat the expected throughput delta is zero (we benchmarked that at gpu_mem 0.60 vs 0.85 and saw no difference), but for batch ≥ 4 it should matter.
ttt_steps=4or 5 training. The current drafter trains positions 0/1/2 only; n=4 inference extrapolates to pos 3. Empirically that extrapolation works fine (72.7% acceptance), but native training to pos 3 should be more robust on unfamiliar workloads.- DFlash control. Compare a DFlash drafter (the
guglxniroute) trained the same way against this body, to see whether a different drafter architecture gives a different trade-off on Gemma 4 specifically.
Today's recommendation for readers
- Want faster inference on abliterated Gemma 4 for production chat workloads? The first thing to try is the vanilla Gemma 4 MTP assistant with
num_speculative_tokens=4— Round 2 paired bench shows it hits chat EN ~53 / ZH ~45 tok/s, and our small EAGLE-3 drafter doesn't beat it on chat. This post's retrained drafter does well on raw/v1/completions(~100 tok/s aggregate), but the production-chat uplift is much smaller. Detailed numbers will land in Part 31. - Want to fine-tune your own drafter? Follow this post's pipeline section and apply our Speculators patch — change
data.py:67to defaultdtype=torch.bfloat16. - Want to track the broader abliteration + spec-decode community? The five Phase 0 repos above are worth following. Round 2 lands in Part 31.
Related
- Part 28 — Mechanism: vanilla draft can't track the modified body
- Part 29 — Deploy recipe: n=1, +34% out of the box
- Part 27 — Vanilla Gemma 4 + MTP at 108 tok/s
- HF release (verifier):
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic - Drafter checkpoint (this post's output):
coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft - Speculators bug + our patch:
speculators/src/speculators/train/data.py:67. Upstream PR: vllm-project/speculators#527. - Phase 0 prior art: see table above
FAQ
- How is Part 30 different from Parts 28 and 29?
- Part 28 was the mechanism (why a vanilla draft can't track an abliterated body at depth). Part 29 was the deploy recipe for the part that already works (n=1, +34% out of the box). Part 30 is round 1 of the retraining attempt — we fine-tune an EAGLE-3 drafter against the abliterated body's distribution to try to unlock n=2/3/4 deep speculation.
- Did fine-tuning unlock n=4?
- The mechanism, yes — on raw `/v1/completions`, pos 3 acceptance climbs from vanilla's 20.5% to 72.7% (+52pp), and the acceptance decay curve flattens from 65→43→29→20% to 84→75→74→73%. But **the '2x throughput' framing does not hold on production chat workloads**: per the endpoint correction at the top of the post, the original baseline was measured on chat while the retrain was on raw — they aren't apples-to-apples. On chat: retrained drafter ~46 tok/s, vanilla MTP n=1 ~51 tok/s, pure body ~40 tok/s → real uplift +15%, not +100%. Round 2 paired bench will run in Part 31.
- Why doesn't the retrained drafter give 2x on chat workloads?
- Two reasons. (1) Chat output is real semantic content; for deep speculation, pos 1/2/3 acceptance compounds error in a hard-to-predict context, so only pos-0 actually helps. The raw `/v1/completions` endpoint collapses into ASCII-style repetition, which a small drafter can trivially predict. (2) Our retrained EAGLE-3 head is much smaller than Google's pretrained vanilla MTP `Gemma4MTPModel` (a full Gemma layer); on chat we observe pos-0 acceptance 57% vs vanilla MTP's 70.6%. This is an architecture × workload-predictability tradeoff. Adding Chinese data in Round 2 targets a different issue (v1's ZH OOD blind spot) and isn't expected to break through this architectural ceiling.
- Phase 0 found prior art that predates this work?
- Yes. 6 public HF repos shipped 'abliterated body + spec decode drafter' combinations before Part 28 published (guglxni, AEON-7 ×2, OptimizeLLM, huginnfork, llmfan46). But no public HF release pairs EAGLE-3 with huihui's Gemma 4 26B-A4B abliterated specifically, and we couldn't find published per-position acceptance numbers in this niche. That's the narrow novelty we're contributing.
- What's the Speculators bug?
- vllm-project/speculators' `create_empty_sample()` defaults `torch.empty()` with no dtype kwarg, which materializes fp32 placeholders. When a vLLM extraction request times out and the trainer falls back to that empty sample, the downstream BF16 layers (fc / verifier_lm_head) hit a dtype mismatch and crash the whole training run. Our patch defaults to `torch.bfloat16`; upstream PR is open at [vllm-project/speculators#527](https://github.com/vllm-project/speculators/pull/527).