~ /home/coolthor

ai-muninn

Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.

whoami

hardware enthusiast running 120B models at home on DGX Spark

building options trading infrastructure with AI agents

occasionally ships iOS apps

cat ~/blog/concepts

Concepts & Methods

For those who want to understand how AI works

cat ~/blog/field-notes

Field Notes

For those who run models and debug the hard way

  • 2026-05-16
    EAGLE-3 fine-tune against an abliterated Gemma 4 body doubles n=4 throughput to 100 tok/s

    RedHatAI's EAGLE-3 drafter fine-tuned to realign with huihui Gemma 4 26B-A4B abliterated FP8 on a single DGX Spark GB10 — 1 epoch / 50k Magpie samples / 11h. Inference bench: pos 3 acceptance climbs from vanilla's 20.5% to 72.7%; n=4 throughput goes from ~50 to 100.36 tok/s aggregate (107.59 per-prompt mean) = ~2.0x. Part 28's 'deep speculation collapses on abliterated bodies' bottleneck is solved by retraining the drafter against the modified distribution. Includes a Speculators upstream create_empty_sample dtype bug + patch and a Phase 0 catalog of 6 community prior-art repos.

  • 2026-05-14
    30 lines of docker for +34% on DGX Spark: huihui Gemma 4 FP8 + vanilla MTP n=1 deployment recipe

    Part 28 explained why deep speculation breaks on an abliterated body; this post is the recipe for the part that already works. huihui Gemma 4 26B-A4B FP8 + Google's vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 tok/s to 52.6 tok/s (+34%) on GB10, no retraining required. ~30 lines of docker plus a bind-mount of PR #41745's gemma4_mtp.py. Includes a 3-step sanity check and a clear list of when n=1 stops being enough.

  • 2026-05-09
    Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body

    I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.

  • 2026-05-06
    Liftoff: Gemma 4 hits 670 tok/s aggregate on DGX Spark (108 tok/s single-stream)

    Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.

  • 2026-05-05
    How a zh-TW Linter Found 128 Mainland-China Drift in My Own Writing

    I ran sysprog21/zhtw-mcp across 72 of my Traditional Chinese articles. Three sweeps, 128 cross-strait substitutions across 42 files. The real takeaway wasn't the count — it was discovering my blindspot isn't 'I don't know the right Taiwanese term,' it's 'when a Mainland term shows up I don't auto-doubt it.'

72 posts total · view all posts →