~ /home/coolthor

ai-muninn

Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.

whoami

runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out

quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them

builds options-trading infrastructure with AI agents

had a spec-decode fix merged into vLLM's speculators

occasionally ships iOS apps

cat ~/blog/start-here

Start here

New here? These are a good way in.

cat ~/blog/concepts

Concepts & Methods

For those who want to understand how AI works

cat ~/blog/field-notes

Field Notes

For those who run models and debug the hard way

  • 2026-07-03
    [Just for Fun — Advanced #8] I Doubled My Agent's Decode Speed and It Got Slower: TTFT Is the Number You Actually Feel

    I swapped my home agent's brain for one that decodes 30-40 tok/s instead of 14, and it felt slower. The number I'd stared at for a year — tok/s — only measures how fast tokens come out, not how long before they start. On a hybrid model, a single cache miss re-prefills the entire prompt: same box, same brain, 2.6s warm vs 216s cold. Here's the live log.

  • 2026-07-02
    [Just for Fun — Advanced #7] Progressive Streaming on a Slow Model Got My Bot Rate-Limited by Telegram

    To ease the wait on a pokey local agent, I turned on Telegram streaming — which, the way this bot did it, means rewriting the same message every fraction of a second. On a 14 tok/s brain, a single 175-second reply works out to an estimated couple hundred edit requests, which slammed into Telegram's flood control and got the whole bot benched for four minutes — final answer included. The short, ugly lesson: slow models should not fake streaming with edits. Send the finished answer once. Live logs inside.

  • 2026-07-01
    [Agent 101 #15] Hermes /learn: I had a local 27B write its own reusable skill

    Hermes has a /learn command that turns 'something you just did' into a reusable skill — a SKILL.md. I wired it into my own fleet: one Kanban card, a local 27B running on a modded 2080 Ti, and about 3 minutes later it handed back a clean, spec-compliant skill — plus two implementation details the docs don't spell out (slash command vs. dispatch, and where skills actually live). A plain-language walkthrough of what /learn does, how to use it, and where its limits are.

  • 2026-06-29
    [Agent 101 #14] One spec, three assistants, three Tetris games: a Hermes Kanban dispatch test

    After raising a fleet of assistants, I gave them the same one-line 'make a Tetris game' spec — no details at all — one card each, and let them each write a web Tetris in a single shot. I touched zero lines of game code; I only published the result. The surprise: from that one line, the Hermes harness plus a local model I tuned myself (on a modded 2080 Ti) filled in things I never asked for — a ghost piece and wall-kick — in one shot. You can play all three.

  • 2026-06-29
    [Troubleshooting #1] HuggingFace download stuck at 0 bytes on Windows — Xet, Python 3.13, ai-toolkit

    Training with ai-toolkit on Windows + RTX 5090 hit three walls before it even started: Python 3.13 dependency hell, a HuggingFace download frozen at 0 bytes, and ssh killing the process. Each one's error pointed the wrong way — diagnosis and fix for all three.

ls ~/blog/series

Browse by series

Every thread, grouped

117 posts total · view all posts →

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime. · buy me a coffee