~ /home/coolthor

ai-muninn

Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.

whoami

hardware enthusiast running 120B models at home on DGX Spark

building options trading infrastructure with AI agents

occasionally ships iOS apps

cat ~/blog/concepts

Concepts & Methods

For those who want to understand how AI works

cat ~/blog/field-notes

Field Notes

For those who run models and debug the hard way

  • 2026-05-09
    Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body

    I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.

  • 2026-05-06
    Liftoff: Gemma 4 hits 670 tok/s aggregate on DGX Spark (108 tok/s single-stream)

    Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.

  • 2026-05-05
    How a zh-TW Linter Found 128 Mainland-China Drift in My Own Writing

    I ran sysprog21/zhtw-mcp across 72 of my Traditional Chinese articles. Three sweeps, 128 cross-strait substitutions across 42 files. The real takeaway wasn't the count — it was discovering my blindspot isn't 'I don't know the right Taiwanese term,' it's 'when a Mainland term shows up I don't auto-doubt it.'

  • 2026-05-04
    [Field Guide] Z-Image Turbo — choosing the right config (1.37× faster, 44% less RAM)

    I ran six Z-Image Turbo quantization configs on DGX Spark GB10 — BF16 baseline, FP8 cast standard, FP8 cast fast, FP8 scaled (Kijai), NVFP4, NVFP4+FP8 encoder. With N=10 isolated GPU, NVFP4 transformer hits 5.50s warm versus BF16 7.55s (1.37× faster). All three FP8 paths are slower than BF16. Model working set drops from 20.6 GB (BF16) to 11.5 GB (NVFP4+FP8 encoder) — 44% smaller.

  • 2026-05-04
    [Field Guide] Z-Image Turbo — does choosing a faster config hurt quality? LPIPS + CLIPScore answer

    Does Z-Image Turbo quantization break image quality? Two-axis benchmark — LPIPS (perceptual distance vs BF16) + CLIPScore (image-text alignment) — across 6 prompts × 4 configs × 3 seeds = 72 samples. Result: NVFP4 produces images that look different from BF16, but no measured regression in this sample — all 4 configs land within ±0.04 std on CLIPScore, smaller than the noise floor. Production users should re-verify with their own prompt set.

70 posts total · view all posts →