~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out
quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them
builds options-trading infrastructure with AI agents
had a spec-decode fix merged into vLLM's speculators
occasionally ships iOS apps
❯ cat ~/blog/start-here
Start here
New here? These are a good way in.
- 2026-06-16[Agent 101 #1] AI assistant vs ChatGPT: one answers you, one uses your tools to get things done
You mostly use ChatGPT one question at a time. A self-hosted AI assistant (agent) finishes the job with your own tools, runs on your side, and plugs into the apps you use daily. Lesson one of building your own assistant from zero.
- 2026-06-16[Agent 101 #4] How to install Hermes Agent Desktop: your first AI assistant, no terminal
Install the Hermes Agent desktop app — no terminal. Download it, let it auto-install dependencies, sign in with your ChatGPT account, and your first AI assistant is running in about 15 minutes.
- 2026-06-12[Local LLM #1] My first Q2 model looked broken on a 128GB box — the real culprit was a parser that couldn't read DSML, not the quantization
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
- 2026-06-11[Benchmark #2] Qwen3.5-122B on DGX Spark — 2× faster
Qwen3.5-122B-A10B tops out at 17 tok/s on a 128GB DGX Spark — the GDN wall in vLLM won't budge, not even with a merged perf PR. I swapped vLLM for the Atlas engine on the same abliterated NVFP4 weights and the throughput doubled to 33.9 tok/s (36.5 with MTP, ~2×), uncensored behavior intact. The real lever was outside the quant toolbox.
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-05-23[LLM 101 #7] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-04-17[LLM 101 #6] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right #7] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[Ask AI Right #6] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14[LLM 101 #5] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-06-29[Agent 101 #14] One spec, three assistants, three Tetris games: a Hermes Kanban dispatch test
After raising a fleet of assistants, I gave them the same one-line 'make a Tetris game' spec — no details at all — one card each, and let them each write a web Tetris in a single shot. I touched zero lines of game code; I only published the result. The surprise: from that one line, the Hermes harness plus a local model I tuned myself (on a modded 2080 Ti) filled in things I never asked for — a ghost piece and wall-kick — in one shot. You can play all three.
- 2026-06-29[Troubleshooting #1] HuggingFace download stuck at 0 bytes on Windows — Xet, Python 3.13, ai-toolkit
Training with ai-toolkit on Windows + RTX 5090 hit three walls before it even started: Python 3.13 dependency hell, a HuggingFace download frozen at 0 bytes, and ssh killing the process. Each one's error pointed the wrong way — diagnosis and fix for all three.
- 2026-06-28[LoRA #2] The character-LoRA control panel: dialing in style, realism, and identity
Once your character LoRA is trained, how do you control it? Why lightning flattens style, when to spend full steps, how to stack a style LoRA, and why the trigger word alone won't hold the look.
- 2026-06-27[Just for Fun — Advanced #6] The Tool-Definition Tax: 17K Tokens Before I Say a Word, Re-Billed on Every Cache Miss
I added up what my home agent pays before it reads a single word from me: ~23K tokens of overhead, and 17K of that is just the instruction manuals for its tools. Worse, it runs a hybrid model — on a cache miss it re-processes all 17K from scratch, and a single user turn can do that a dozen-plus times. This is context economics, badly underestimated. The fix isn't cutting tools; it's loading them on demand, the way skills already do.
- 2026-06-26[Just for Fun — Advanced #5] llama.cpp won't persist KV cache to disk — so I put a 60-line proxy in front of it (7× faster restore)
On a long conversation, every message makes the model re-read the whole thing (re-prefill) before it answers — worst right after a restart or a cache eviction. Stock llama.cpp can save the KV cache to disk (--slot-save-path) but won't do it on its own — the auto-persist feature request is closed as not planned. A tiny stdlib reverse-proxy restores instead of re-prefilling: 9.9s → 1.4s on a 5K chat (7×). Mechanism, proxy design, and why I haven't shipped it yet.
❯ ls ~/blog/series
Browse by series
Every thread, grouped