~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out
quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them
builds options-trading infrastructure with AI agents
had a spec-decode fix merged into vLLM's speculators
occasionally ships iOS apps
❯ cat ~/blog/start-here
Start here
New here? These are a good way in.
- 2026-06-16[Agent 101 #1] AI assistant vs ChatGPT: one answers you, one uses your tools to get things done
You mostly use ChatGPT one question at a time. A self-hosted AI assistant (agent) finishes the job with your own tools, runs on your side, and plugs into the apps you use daily. Lesson one of building your own assistant from zero.
- 2026-06-16[Agent 101 #4] How to install Hermes Agent Desktop: your first AI assistant, no terminal
Install the Hermes Agent desktop app — no terminal. Download it, let it auto-install dependencies, sign in with your ChatGPT account, and your first AI assistant is running in about 15 minutes.
- 2026-06-12[Local LLM #1] My first Q2 model looked broken on a 128GB box — the real culprit was a parser that couldn't read DSML, not the quantization
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
- 2026-06-11[Benchmark #2] Qwen3.5-122B on DGX Spark — 2× faster
Qwen3.5-122B-A10B tops out at 17 tok/s on a 128GB DGX Spark — the GDN wall in vLLM won't budge, not even with a merged perf PR. I swapped vLLM for the Atlas engine on the same abliterated NVFP4 weights and the throughput doubled to 33.9 tok/s (36.5 with MTP, ~2×), uncensored behavior intact. The real lever was outside the quant toolbox.
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-05-23[LLM 101 #7] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-04-17[LLM 101 #6] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right #7] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[Ask AI Right #6] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14[LLM 101 #5] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-06-24[Agent 101 #13] See what your fleet of AI agents is doing — from your phone: Muninn adds a Kanban board
Hermes has a built-in Kanban, but on your phone all you get is Telegram's plain text. Muninn now pulls that board onto the phone: Running / Blocked / Done columns — who's working on what, which card got blocked — at a glance. Zero backend, pure P2P.
- 2026-06-24[Just for Fun — Advanced #3] I Maxed Context to 256K, It Loaded Fine — Then Crashed in Real Use: A VRAM Detective Story on a 22GB Frankencard
The model card says n_ctx_train=262144. The card has 22GB. The 27B's Q4 weights are only 15.7GB. The math looks obvious: max it to 256K, plenty to spare. -c 262144, launch — loads fine, no error. A few turns of real conversation later: 503, the service restarts itself. No tidy out-of-memory in the log, just a lone 0xc0000409. nvidia-smi: free VRAM down to ~170 MiB. Where did the gigabytes go? This is the hunt: I first blamed context checkpoints, but the llama.cpp source says they live in host RAM — the real VRAM eater is the KV cache; free-VRAM-vs-context is nonlinear, and the one stable sweet spot isn't 256K — it's 128K.
- 2026-06-23[Agent 101 #12] Reach your home AI agent from anywhere: Muninn, a private iOS app over iroh P2P
Hermes runs at home, but you lose it the second you walk out. Bridging through Telegram works but it's fiddly and routes every message through someone else's server. Muninn is an iOS app built for Hermes: give your agent one command, scan a QR, and your phone connects straight home over an encrypted iroh tunnel — no cloud in the path.
- 2026-06-23[Just for Fun — Advanced #2] I Gave Up 100 tok/s for 30 — Fast Isn't the Same as Useful
Picking a local model, I looked at tok/s first too. Gemma 12B does 90-100 and it's great — until you put it on a kanban board, where it finishes the work and just walks away, never marking the card done. A Qwen 27B that's three times slower actually closes the loop. Why throughput is the wrong number for an agent — plus how grep almost lied to me about it.
- 2026-06-22[Agent 101 #11] Assistant gone haywire? Don't blame the engine — usually it's the car that broke, not the engine
When an AI assistant loops, wanders, freezes, or answers the wrong question, your first instinct is 'this model is dumb.' But from my own debugging, eight times out of ten it's not the model — it's the ring around it (tools, config, memory). The model is the engine; that ring is the car. A car that won't move usually doesn't have a broken engine — it has a flat tire or a clogged fuel line.
❯ ls ~/blog/series
Browse by series
Every thread, grouped