~/blog/dgx-spark-eagle3-finetune-abliterated-round1

DGX Spark · part 30

EAGLE-3 fine-tune against an abliterated Gemma 4 body doubles n=4 throughput to 100 tok/s

cat --toc

TL;DR

The bottleneck from Part 28 is solved. Fine-tuned RedHatAI's pretrained EAGLE-3 drafter against the huihui Gemma 4 26B-A4B abliterated FP8 body for 1 epoch / 50k Magpie samples / ~11h on a single DGX Spark GB10. Inference bench headlines: pos 3 acceptance climbs from vanilla's 20.5% to 72.7% (+52pp); n=4 throughput goes from ~50 tok/s to 100.36 tok/s aggregate (107.59 per-prompt mean) = ~2.0x speedup. Part 28's mechanism is still correct — vanilla draft really does collapse against an abliterated body — but the bottleneck disappears once you retrain the drafter against the modified distribution.

Side artifact: Speculators upstream bug — create_empty_sample() returns fp32 placeholders that crash BF16 models on every vLLM extraction timeout. Patch ships in our fork; upstream PR in preparation.

TL;DR

  • Goal: Part 28 showed that vanilla MTP draft structurally fails on an abliterated body — pos 0/1/2/3 acceptance collapses 65/43/29/20%. Part 30 retrains the drafter to realign with the abliterated body's distribution.
  • Result: WIN. Inference acceptance becomes 84/75/74/73% (decay almost gone) and n=4 throughput goes 50 → 100.36 tok/s aggregate (~2.0×) with a 107.59 tok/s per-prompt mean. Drafter shipped to HF: coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft
  • Side artifact: A real upstream create_empty_sample dtype bug found and patched in vllm-project/speculators; PR going upstream.

Phase 0 prior art (and a Part 28 correction)

Before drafting this post we ran a Phase 0 prior-art sweep. Codex turned up 6 public HF repos that predate Part 28 (which we published 2026-05-09):

RepoCreatedPattern
OptimizeLLM/Qwen3.5-122B-A10B-heretic-MTP-NVFP42026-04-11heretic body + grafted MTP, ~190 tok/s
AEON-7/DFlash-Qwen3.5-27B-Uncensored2026-04-12uncensored body + external z-lab DFlash drafter, 33.2 vs 12.2 tok/s
guglxni/Qwen3.5-9B-abliterated-DFlash2026-04-15Most direct prior art — DFlash drafter explicitly fine-tuned on abliterated activations to restore acceptance. Same mechanism story as Part 28.
AEON-7/supergemma4-26b-dflash-pilot2026-04-15DFlash 5K-sample pilot on SuperGemma abliterated; 5.79% top-1; readme explicitly admits "expect negative speedup"
huginnfork/Qwen3.6-27B-uncensored-heretic-v2-mtp2026-04-26heretic + MTP
llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved2026-05-06Shipped 3 days before Part 28. Part 28 named-and-shamed llmfan46 as "hasn't tuned for spec decode" — they had, 3 days earlier.

✏️ Correction to Part 28: at publish time, Part 28 wrote "all abliteration variants today optimize for 'lowest refusal rate' — nobody's optimizing for 'highest MTP acceptance'" and called out llmfan46 specifically as "hasn't tuned for spec decode". Both statements were already false at publish time. Phase 0 surfaced 6 repos predating Part 28, including llmfan46's own native-MTP-preserved release shipped 3 days before our article. Part 30 is built on the corrected version of those facts.

Narrow novelty (Codex-confirmed) we still own:

  • No public HF repo pairs EAGLE-3 with an abliterated body (others use DFlash or native MTP)
  • No one has published per-position acceptance numbers for spec decode against an abliterated body (Part 28 was the first; Part 30 adds the retrained-drafter comparison column)
  • huihui's Gemma 4 26B-A4B abliterated family is the most-downloaded in this niche, so this is the high-leverage point to target

Pipeline

Training stack

ComponentSetting
HardwareNVIDIA GB10 (DGX Spark), sm_12.1, 121 GB unified memory, 273 GB/s
Frameworkvllm-project/speculators v0.5.0.dev0
Verifier (body)coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic (our self-quantized huihui base)
Drafter starting pointRedHatAI/gemma-4-26B-A4B-it-speculator.eagle3 (vanilla-trained pretrained)
Training dataMagpie 50k (Magpie-Align/Magpie-Llama-3.1-Pro-300K-Filtered, instruction split), regenerated through huihui FP8 to produce (input, response) pairs against the abliterated body
vLLM dual roleServes extract_hidden_states speculative_config + ExampleHiddenStatesConnector as a hidden-states producer for the trainer. gpu_memory_utilization=0.5 leaves room for the trainer
Epochs / seq length1 epoch / 4096 packed

Data pipeline

  1. response_regeneration/script.py reruns the 50k Magpie instructions through huihui FP8 to produce on-distribution targets (~24h)
  2. prepare_data.py converts the jsonl to an Arrow dataset with token_freq.pt and assistant-token loss masks
  3. train.py --on-missing generate --on-generate delete — trainer and vLLM run concurrently; on each batch the trainer requests hidden states from vLLM, vLLM writes a safetensors file to a shared directory, trainer reads it and deletes immediately

Speculators upstream bug (the side artifact)

At training step ~9485 (83% through epoch 0), the run died with a dtype mismatch. Root cause:

# data.py — create_empty_sample, original
return {
    "hidden_states": torch.empty(0, 3 * hidden_size),  # ← defaults to fp32!
    "verifier_last_hidden_states": torch.empty(0, hidden_size),  # ← also fp32
    ...
}

When it triggers: a vLLM extraction request times out (default 15s) → all retries fail → collate_fn falls back to create_empty_sample() → downstream eagle3/core.py:fc() and :verifier_lm_head() expect BF16 → RuntimeError: expected mat1 and mat2 to have the same dtype → entire run dies.

Fix:

def create_empty_sample(hidden_size: int, dtype: torch.dtype = torch.bfloat16):
    return {
        "hidden_states": torch.empty(0, 3 * hidden_size, dtype=dtype),
        "input_ids": torch.empty(0, dtype=torch.long),
        "verifier_last_hidden_states": torch.empty(0, hidden_size, dtype=dtype),
        ...
    }

PR: opening upstream this week; this section will be updated with the link. The patch itself has been validated — v4 (with the patch) ran through to completion, where v1 (without it) died at step 9485 from the dtype mismatch. The actual change in data.py:67 is four lines.

Related fragility worth flagging upstream: speculators' --checkpoint-freq default unit is epochs, not steps. A 1-epoch run that crashes mid-step throws away every step of progress. We lost 9 hours of training the first time. Worth a separate issue for step-level checkpointing.

Training trajectory (v4 run)

⚠️ Metric clarification: the trainer logs both full_acc_N and cond_acc_N per step. full_acc is the unconditional probability that position N is correct — this is what aligns with vLLM's runtime /metrics per-position acceptance counter. cond_acc is conditional on positions 0..N-1 all being correct. To compare against Part 28's vanilla baseline (65.6 / 43.3 / 29.2 / 20.5), use full_acc. All acceptance numbers in this post are full_acc.

Loss + full_acc trajectory (single-step samples, unsmoothed)

StepLossfull_acc_0full_acc_1full_acc_2
1k7.5166.8%39.9%25.1%
2k6.6669.5%44.0%27.9%
4k4.7477.0%55.9%42.6%
6k7.1864.8%39.0%24.8%
8k7.3965.7%38.9%24.5%
Final val (N=1266 batches)6.9466.8%41.4%26.4%

Note: trainer defaults to ttt_steps=3, so training only evaluates pos 0/1/2 — there's no full_acc_3 during training. At inference time the drafter extrapolates to pos 3 (the bench in the next section measures it directly).

Convergence observations

  • Loss trajectory is bouncy, not monotonic. It starts at 7.5 around step 1k, hits a low of 4.74 near step 4k, and drifts back to the 7 range by the end of the epoch. This is the cosine LR schedule: warmup completes by ~step 100 and the LR peaks around step 4k, where the model jumps around the loss landscape and learns the rough distribution shape. After that the LR decays into fine refinement and single-step samples look noisier.
  • The training-time metrics look mediocre on their own. Final val full_acc_2 is 26.4% — slightly below Part 28's vanilla baseline of 29.2%. After looking at val we briefly thought Part 30 was going to be a "NO WIN" failure post.
  • But training acceptance ≠ inference acceptance. Validation full_acc is teacher-forced argmax matching against the Magpie ground-truth tokens (strict). Inference acceptance is rejection sampling against the body's actual sampling distribution at T=0.7 (looser). The next section's inference bench is the real verdict — and it's the opposite story.

Inference bench (the real test)

Once training finishes, we load the new drafter into vLLM (same config as Part 29 except we swap the draft model to our fine-tuned checkpoint) and measure per-position acceptance + throughput.

Setup

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-model-len 8192 \
  --enable-prefix-caching --trust-remote-code
  • N=10 prompts × T=0.7 × batch=1, max_tokens=200
  • Same vLLM container, same prompt pool — only the draft model swaps between vanilla MTP and our fine-tuned EAGLE-3
  • Part 28's vanilla baseline numbers are reused from the previously published bench (same hardware, same body); a strictly paired re-run on the exact prompt set is not done here, but the 19-52pp acceptance gaps below cannot plausibly be explained by prompt-set variation alone.

Per-position acceptance (the headline numbers)

PositionPart 28 vanilla draftFine-tuned drafter (this post)Δ
pos 065.6%84.4%+18.8 pp
pos 143.3%74.9%+31.6 pp
pos 229.2%74.1%+44.9 pp
pos 320.5%72.7%+52.2 pp 🚀

The vanilla draft on an abliterated body decays at roughly −22 / −14 / −9 pp per step. The fine-tuned drafter decays at −9 / −1 / −1 — essentially a flat curve.

Throughput sweep

num_spec_tokensPart 28 vanilla (huihui body)Fine-tuned EAGLE-3 (this post)Speedup
0 (no spec)39.3 tok/s39.3 tok/s1.00×
152.6 tok/s59.04 tok/s1.12×
251.4 tok/s66.96 tok/s1.30×
346.9 tok/s74.90 tok/s1.60×
450.0 tok/s100.36 tok/s aggregate / 107.59 per-prompt mean~2.01×

(Bench detail: gpu_memory_utilization=0.85, max-model-len=8192, kv_cache_dtype=fp8, temperature=0.7, N=10 prompts × max_tokens=200, batch=1.)

For the vanilla draft, throughput drops as soon as you push past n=1 — deeper speculation is rejected often enough that the verification overhead outweighs the speedup. For our fine-tuned drafter, throughput climbs monotonically from n=1 through n=4, with n=4 as the sweet spot. This is exactly the inverse of Part 28's pattern, which is the whole point — once the drafter is realigned with the body's distribution, deep speculation becomes useful again.

Verdict: WIN

The fine-tuned EAGLE-3 drafter restores deep speculation on the abliterated body.

  • Inference acceptance: the vanilla draft's 65→20% steep decay on this body flattens to 84→73%, a +52pp recovery at pos 3.
  • Throughput at n=4: vanilla ~50 tok/s → 100.36 tok/s aggregate (2.01×), with a 107.59 tok/s per-prompt mean (~2.15×).
  • Drafter shipped to HF as coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft — 1.86 GB safetensors, drop-in replacement for the vanilla MTP assistant.
  • Part 28's mechanism argument is not refuted — vanilla draft genuinely does collapse on this body. The point is that the collapse is fixable by realigning the drafter against the modified distribution.

Production recipe (daily-use config):

vllm serve coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --speculative-config '{"method":"eagle3","model":"coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-eagle3-draft","num_speculative_tokens":4}' \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.65 \
  --max-model-len 65536 \
  --enable-auto-tool-choice --tool-call-parser gemma4 \
  --enable-prefix-caching --trust-remote-code

Compared to the Part 29 n=1 recipe, four flags change: method from mtp to eagle3, model from the vanilla Gemma 4 assistant to our drafter, num_speculative_tokens from 1 to 4, and --max-model-len from 8192 to 65536 (to accommodate hermes-agent's 64K minimum for daily use). The rest of the vLLM serve flags are identical.

Round 2 plan (Part 31 preview)

With WIN in hand, the round-2 directions are:

  • Cross-workload bench. The current N=10 prompts are English instruction-following. The plan is to measure acceptance on Traditional Chinese, code-heavy, podcast-summary, and image-gen-prompt-writing workloads — the real use cases driving hikari/kiriha — and confirm the acceptance numbers hold (or characterize where they don't).
  • TurboQuant KV cache 3-bit on top of EAGLE-3. Phase 2 of the GB10 stack upgrade plan. KV budget grows ~4x; for single-user chat the expected throughput delta is zero (we benchmarked that at gpu_mem 0.60 vs 0.85 and saw no difference), but for batch ≥ 4 it should matter.
  • ttt_steps=4 or 5 training. The current drafter trains positions 0/1/2 only; n=4 inference extrapolates to pos 3. Empirically that extrapolation works fine (72.7% acceptance), but native training to pos 3 should be more robust on unfamiliar workloads.
  • DFlash control. Compare a DFlash drafter (the guglxni route) trained the same way against this body, to see whether a different drafter architecture gives a different trade-off on Gemma 4 specifically.

Today's recommendation for readers

  • Want faster inference on abliterated Gemma 4? Use this post's production recipe to hit ~100 tok/s aggregate at n=4. The Part 29 n=1 recipe still works as a no-retraining fallback, but n=4 is the new best.
  • Want to fine-tune your own drafter? Follow this post's pipeline section and apply our Speculators patch — change data.py:67 to default dtype=torch.bfloat16.
  • Want to track the broader abliteration + spec-decode community? The five Phase 0 repos above are worth following. Round 2 lands in Part 31.

FAQ

How is Part 30 different from Parts 28 and 29?
Part 28 was the mechanism (why a vanilla draft can't track an abliterated body at depth). Part 29 was the deploy recipe for the part that already works (n=1, +34% out of the box). Part 30 is round 1 of the retraining attempt — we fine-tune an EAGLE-3 drafter against the abliterated body's distribution to try to unlock n=2/3/4 deep speculation.
Did fine-tuning unlock n=4?
Yes, decisively. Inference bench pos 3 acceptance went from vanilla's 20.5% to 72.7% (+52pp), and n=4 throughput went from ~50 tok/s to 100.36 tok/s aggregate (2.01x; per-prompt mean was 107.59 tok/s, ~2.15x). Part 28's 'deep speculation structurally fails on abliterated bodies' bottleneck — the acceptance decay curve that dropped 65→43→29→20% — flattens to 84→75→74→73% once the drafter is retrained against the modified distribution.
Phase 0 found prior art that predates this work?
Yes. 6 public HF repos shipped 'abliterated body + spec decode drafter' combinations before Part 28 published (guglxni, AEON-7 ×2, OptimizeLLM, huginnfork, llmfan46). But no public HF release pairs EAGLE-3 with huihui's Gemma 4 26B-A4B abliterated specifically, and we couldn't find published per-position acceptance numbers in this niche. That's the narrow novelty we're contributing.
What's the Speculators bug?
vllm-project/speculators' `create_empty_sample()` defaults `torch.empty()` with no dtype kwarg, which materializes fp32 placeholders. When a vLLM extraction request times out and the trainer falls back to that empty sample, the downstream BF16 layers (fc / verifier_lm_head) hit a dtype mismatch and crash the whole training run. Our patch defaults to `torch.bfloat16`; an upstream PR is in preparation.