~/blog/dgx-spark-huihui-gemma4-mtp-n1-recipe

DGX Spark · part 29

30 lines of docker for +34% on DGX Spark: huihui Gemma 4 FP8 + vanilla MTP n=1 deployment recipe

cat --toc

TL;DR

Part 28 was the mechanism post. This is the recipe. Abliterated Gemma 4 26B-A4B FP8 + Google's official vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 → 52.6 tok/s = +34% on DGX Spark — no retraining, ~30 lines of docker.

The only real trap is that vLLM's stable release doesn't ship PR #41745 (Gemma 4 MTP integration) yet, so this recipe uses a preview image + a bind-mount of the PR head's gemma4_mtp.py. Verifier is my self-quantized coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic; draft is Google's official gemma-4-26B-A4B-it-assistant.

Why this is worth a separate post

Part 28 is the mechanism post — it explains why a draft trained against the vanilla body distribution fails to track an abliterated body once you push to n=2/3/4. The whole narrative is anchored on that bottleneck.

But the bottleneck only bites in deep speculation. At n=1, draft predicts one token from the body's real hidden state — it eats exactly one mismatch tax, and that turns out to be tolerable. The first row of Part 28's sweep is already 52.6 tok/s.

If you just want faster inference and aren't planning to fine-tune your own drafter, that +34% is a free lunch. This post is the deploy guide for the part that already works — no mechanism required, just the working recipe.

What you actually get

Modetok/svs baselinepos-0 acceptance
huihui FP8 baseline (no spec)39.3
+ vanilla MTP n=152.6+34%69%

Measured numbers; setup details and why n>1 doesn't help live in Part 28.

Prerequisites

  • DGX Spark or equivalent ARM64 + GB10 (sm_12.1, 121 GB unified memory, 273 GB/s bandwidth)
  • Docker + nvidia-container-runtime
  • huggingface-cli installed (both models are public and ungated as of 2026-05-14, so no token strictly required; running huggingface-cli login once anyway is good insurance against upstream flipping anything to gated later)
  • ~30 GB free disk: huihui FP8 is 27 GB, the Gemma 4 assistant draft is ~1.7 GB

Why the stable vLLM release won't work yet

Gemma 4 MTP support landed in vLLM via PR #41745, merged into main on 2026-05-06. The latest stable as of writing is v0.20.2 (2026-05-10), which is a patch release cut from v0.20.1 (2026-05-04). GitHub's compare API confirms status: diverged, behind_by: 39 — v0.20.2 simply does not contain #41745. Its release notes are four bug-fix bullets across three categories: DeepSeek V4 sparse-attn ×2 (MTP=1 hang + KV cache alloc), gpt-oss MXFP4 + torch.compile ×1, and Qwen3-VL deepstack ×1. Nothing about Gemma 4 MTP.

So the four real options are:

  • Build vLLM from main (clean, but you'll wait for a long build)
  • Use a nightly image (your stability rides whatever nightly ships)
  • Use eugr/spark-vllm-docker's preview image + bind-mount the PR head's gemma4_mtp.py (what this post does — fewest surprises)
  • Wait for the next minor release (v0.21.x, not out yet)

Recipe

1. Download both models

huggingface-cli download coolthor/Huihui-gemma-4-26B-A4B-it-abliterated-FP8-Dynamic \
  --local-dir ~/models/Huihui-gemma-4-26B-A4B-it-abliterated-fp8

huggingface-cli download google/gemma-4-26B-A4B-it-assistant \
  --local-dir ~/models/gemma-4-26B-A4B-it-assistant

2. Grab a pinned-commit copy of gemma4_mtp.py

mkdir -p ~/mods/gemma4-mtp-fix
# Pinned to PR #41745 merge commit 27e0057a so future drift on main can't change this file
curl -fsSL \
  https://raw.githubusercontent.com/vllm-project/vllm/27e0057aeda6bc443069c20fdf2f3cc95ed892f3/vllm/model_executor/models/gemma4_mtp.py \
  -o ~/mods/gemma4-mtp-fix/gemma4_mtp.py
md5sum ~/mods/gemma4-mtp-fix/gemma4_mtp.py
# Save this hash — the sanity check below compares against it.

3. Docker run with the bind-mount and serve config

docker run --gpus all --rm --network host --name vllm_huihui --privileged --ipc=host \
  -v $HOME/models:/models:ro \
  -v $HOME/mods/gemma4-mtp-fix/gemma4_mtp.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py:ro \
  --entrypoint /bin/bash \
  vllm/vllm-openai:gemma4-0505-arm64-cu130 \
  -c 'vllm serve /models/Huihui-gemma-4-26B-A4B-it-abliterated-fp8 \
    --served-model-name huihui-gemma4 \
    --speculative-config "{\"method\":\"mtp\",\"model\":\"/models/gemma-4-26B-A4B-it-assistant\",\"num_speculative_tokens\":1}" \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.65 \
    --max-model-len 8192 \
    --limit-mm-per-prompt "{\"image\":0,\"audio\":0,\"video\":0}" \
    --enable-auto-tool-choice --tool-call-parser gemma4 \
    --enable-prefix-caching \
    --trust-remote-code \
    --host 0.0.0.0 --port 8000'

Why n=1 — Part 28 swept n=1..4 against this body; only n=1 is faster than baseline. n=2/3/4 are all slower. Don't push deeper in prod.

Why gpu-memory-utilization=0.65 — leaves headroom for the draft model and KV cache buffers. 0.85 works too, but n=1 doesn't need that much KV budget.

4. Three-step sanity check

# (a) Confirm the bind-mount stuck — md5 should match what you downloaded.
docker exec vllm_huihui md5sum /usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/gemma4_mtp.py

# (b) Confirm the server is up and serving the right model.
curl -s http://localhost:8000/v1/models | python3 -m json.tool

# (c) Send one request and check that spec decode is actually firing.
curl -s -X POST http://localhost:8000/v1/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"huihui-gemma4","prompt":"Write a haiku about an old API.","max_tokens":80,"temperature":0.7}' \
  | python3 -m json.tool

curl -s http://localhost:8000/metrics | grep spec_decode_num_accepted_tokens

The last command should show vllm:spec_decode_num_accepted_tokens_per_pos_total{position="0"} incrementing with each request. If that counter stays at zero, the usual culprits are (a) the bind-mount silently failed and the md5 won't match, or (b) the --speculative-config JSON is malformed and vLLM's startup log has a warning that the spec config was ignored.

When n=1 isn't enough — three signals

  1. You want >52 tok/s. That's the ceiling on this body with this drafter. To get past it you have to deepen speculation, and deepening on an abliterated body requires retraining the drafter (we're running an EAGLE-3 fine-tune — Part 30 forthcoming)
  2. Your workload is Traditional Chinese-heavy. Gemma 4 sits at 46.30% on TMMLU+; Qwen 3.6 35B-A3B scored 75.07% (see the paired benchmark). For TC-heavy tasks, keep Qwen 3.6 as the primary — Gemma 4 is more useful for English brainstorming and image-gen prompts
  3. You're optimizing for batch throughput, not single-stream latency. This recipe is batch=1 latency. Real prod with batch≥4 needs different gpu-memory-utilization and max-num-batched-tokens settings — out of scope here

License and attribution

FAQ

Why a separate recipe post — couldn't this go in Part 28?
Part 28 is the mechanism post — it explains why deeper speculation collapses against an abliterated body, and the narrative is anchored on that bottleneck. This post is pure deploy: you don't have to understand the mechanism to follow the recipe and get +34% on your own box. The two posts are complements, not substitutes.
Can I just upgrade to vLLM v0.20.2 (the 2026-05-10 release) and get +34%?
No. PR #41745 (Gemma 4 MTP integration) merged into main on 2026-05-06, but v0.20.2 is a patch release cut from v0.20.1 (2026-05-04). GitHub's compare shows `behind_by: 39` — v0.20.2 does NOT contain #41745. Its release notes are four bug-fix bullets: DeepSeek V4 sparse-attn ×2, gpt-oss MXFP4 + torch.compile ×1, Qwen3-VL deepstack ×1. Either build vLLM from main, use a nightly image, or use the preview image + bind-mount path documented below.
Why not just run num_speculative_tokens=4?
Deeper speculation structurally collapses against an abliterated body — Part 28 has the per-position acceptance receipts (22pp drop per position). huihui at n=2/3/4 measures slower than n=1. n=1 is the only sweet spot on this body; don't push past it in production.
What if I want to go faster than 52 tok/s?
Two paths: (a) drop abliteration and run vanilla Gemma 4 + MTP n=4 → 108 tok/s (Part 27), or (b) fine-tune an EAGLE-3 drafter against the abliterated body's distribution so deeper speculation actually pays off. We're doing (b) — results will land in Part 30.