~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out
quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them
builds options-trading infrastructure with AI agents
had a spec-decode fix merged into vLLM's speculators
occasionally ships iOS apps
❯ cat ~/blog/start-here
Start here
New here? These are a good way in.
- 2026-06-16[Agent 101 #1] AI assistant vs ChatGPT: one answers you, one uses your tools to get things done
You mostly use ChatGPT one question at a time. A self-hosted AI assistant (agent) finishes the job with your own tools, runs on your side, and plugs into the apps you use daily. Lesson one of building your own assistant from zero.
- 2026-06-16[Agent 101 #4] How to install Hermes Agent Desktop: your first AI assistant, no terminal
Install the Hermes Agent desktop app — no terminal. Download it, let it auto-install dependencies, sign in with your ChatGPT account, and your first AI assistant is running in about 15 minutes.
- 2026-06-12[Local LLM #1] My first Q2 model looked broken on a 128GB box — the real culprit was a parser that couldn't read DSML, not the quantization
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
- 2026-06-11[Benchmark #2] Qwen3.5-122B on DGX Spark — 2× faster
Qwen3.5-122B-A10B tops out at 17 tok/s on a 128GB DGX Spark — the GDN wall in vLLM won't budge, not even with a merged perf PR. I swapped vLLM for the Atlas engine on the same abliterated NVFP4 weights and the throughput doubled to 33.9 tok/s (36.5 with MTP, ~2×), uncensored behavior intact. The real lever was outside the quant toolbox.
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-05-23[LLM 101 #7] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-04-17[LLM 101 #6] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right #7] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[Ask AI Right #6] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14[LLM 101 #5] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-06-28[LoRA #2] The character-LoRA control panel: dialing in style, realism, and identity
Once your character LoRA is trained, how do you control it? Why lightning flattens style, when to spend full steps, how to stack a style LoRA, and why the trigger word alone won't hold the look.
- 2026-06-27[Just for Fun — Advanced #6] The Tool-Definition Tax: 17K Tokens Before I Say a Word, Re-Billed on Every Cache Miss
I added up what my home agent pays before it reads a single word from me: ~23K tokens of overhead, and 17K of that is just the instruction manuals for its tools. Worse, it runs a hybrid model — on a cache miss it re-processes all 17K from scratch, and a single user turn can do that a dozen-plus times. This is context economics, badly underestimated. The fix isn't cutting tools; it's loading them on demand, the way skills already do.
- 2026-06-26[Just for Fun — Advanced #5] llama.cpp won't persist KV cache to disk — so I put a 60-line proxy in front of it (7× faster restore)
On a long conversation, every message makes the model re-read the whole thing (re-prefill) before it answers — worst right after a restart or a cache eviction. Stock llama.cpp can save the KV cache to disk (--slot-save-path) but won't do it on its own — the auto-persist feature request is closed as not planned. A tiny stdlib reverse-proxy restores instead of re-prefilling: 9.9s → 1.4s on a 5K chat (7×). Mechanism, proxy design, and why I haven't shipped it yet.
- 2026-06-25[Just for Fun — Advanced #4] Quantizing the Draft Cache Backfired — A Counterintuitive Look at Qwen MTP (f16 ran 34% faster than q4)
Quantizing the main KV cache to q4 to save memory is fine. So I quantized the MTP draft cache too — it's just a little draft, surely a free win. It wasn't: q4 draft cache ran 29.6 tok/s, the un-quantized f16 ran 39.7, and f16 used less VRAM on top of that. The draft cache is one of the few places where quantizing is a net loss — here's the triple penalty.
- 2026-06-24[Agent 101 #13] See what your fleet of AI agents is doing — from your phone: Muninn adds a Kanban board
Hermes has a built-in Kanban, but on your phone all you get is Telegram's plain text. Muninn now pulls that board onto the phone: Running / Blocked / Done columns — who's working on what, which card got blocked — at a glance. Zero backend, pure P2P.
❯ ls ~/blog/series
Browse by series
Every thread, grouped