~ /home/coolthor

ai-muninn

Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.

whoami

hardware enthusiast running 120B models at home on DGX Spark

building options trading infrastructure with AI agents

occasionally ships iOS apps

cat ~/blog/concepts

Concepts & Methods

For those who want to understand how AI works

cat ~/blog/field-notes

Field Notes

For those who run models and debug the hard way

  • 2026-05-30
    NVFP4 is 1.5× FP8 on a DGX Spark — but it's compression, not the FP4 cores

    On a GB10 DGX Spark, NVFP4 beats FP8 by ~1.5× for single-stream decode on a dense model. But the win is bandwidth (smaller weights), not the FP4 tensor cores — the fastest path never touches them.

  • 2026-05-21
    Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup

    After Part 30's endpoint correction showed Round 1 didn't actually 2x chat throughput, Round 2 added 30k regenerated Chinese instruction samples and trained for 41 hours. Result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45. The EAGLE-3 small head hits an architectural ceiling against the abliterated body; more data doesn't fix it. Plus we found a scheduler deadlock in the vLLM Gemma 4 preview image (`gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`) under long-running extract_hidden_states use (hit three times, mitigated with a watchdog).

  • 2026-05-19
    [Claude Code] Rules I'd Skip, Hooks I Can't — I Wrote a Hook That Blocks My Own Blog Commits

    I had a rule called 'fact-check before publishing.' I still shipped three fabrications. The problem wasn't the rule — it was where I put it. This is how I promoted it from skill to hook: a small script guarding the moment I press 'send,' so I can't even try without verification.

  • 2026-05-16
    EAGLE-3 fine-tune against an abliterated Gemma 4 body — Round 1 flattens the acceptance curve (plus a measurement lesson)

    RedHatAI's EAGLE-3 drafter fine-tuned to realign with huihui Gemma 4 26B-A4B abliterated FP8 on a single DGX Spark GB10 — 1 epoch / 50k Magpie samples / 11h. Inference bench on raw `/v1/completions`: pos 3 acceptance climbs from vanilla's 20.5% to 72.7%; n=4 throughput goes from ~50 to 100.36 tok/s aggregate. **A later paired bench revealed the throughput comparison used different endpoints for baseline (chat) and retrain (raw) — on production chat workloads the real uplift is far smaller than 2×; see the endpoint correction at the top of the post**. Part 28's mechanism observation (deep speculation acceptance scatters on abliterated distributions) still holds. Includes a Speculators upstream create_empty_sample dtype bug + patch and a Phase 0 catalog of 6 community prior-art repos.

  • 2026-05-14
    30 lines of docker for +34% on DGX Spark: huihui Gemma 4 FP8 + vanilla MTP n=1 deployment recipe

    Part 28 explained why deep speculation breaks on an abliterated body; this post is the recipe for the part that already works. huihui Gemma 4 26B-A4B FP8 + Google's vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 tok/s to 52.6 tok/s (+34%) on GB10, no retraining required. ~30 lines of docker plus a bind-mount of PR #41745's gemma4_mtp.py. Includes a 3-step sanity check and a clear list of when n=1 stops being enough.

76 posts total · view all posts →