~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out
quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them
builds options-trading infrastructure with AI agents
had a spec-decode fix merged into vLLM's speculators
occasionally ships iOS apps
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-05-23[LLM 101 #7] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-04-17[LLM 101 #6] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right #7] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[LLM 101 #5] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
- 2026-04-14[Ask AI Right #6] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-06-12[Local LLM #1] My first Q2 model looked broken on a 128GB box — the real culprit was a parser that couldn't read DSML, not the quantization
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
- 2026-06-12[Local LLM #2] Running a 15 tok/s 284B as your daily agent brain — the settings that make it bearable
A 284B model at 15 tok/s, wired into a daily agent. Two sets of settings make it comfortable — server-side and agent-framework-side. --no-mmap cuts cold start to 57s, the KV disk cache halves prefill, and one missing context_length will crash the whole session.
- 2026-06-12[Local LLM #3] Weights win: a 284B crushed to 2-bit still beats the small model that fits
DeepSeek-V4-Flash (284B) only fits a 128GB box at asymmetric Q2 (~80GB). Sounds like suicide quantization — but it's surgical: only the layers that barely affect quality get cut. As a daily agent it ran 280 turns with zero degradation. Big enough weights survive 2-bit.
- 2026-06-11[Benchmark #2] Qwen3.5-122B on DGX Spark — 2× faster
Qwen3.5-122B-A10B tops out at 17 tok/s on a 128GB DGX Spark — the GDN wall in vLLM won't budge, not even with a merged perf PR. I swapped vLLM for the Atlas engine on the same abliterated NVFP4 weights and the throughput doubled to 33.9 tok/s (36.5 with MTP, ~2×), uncensored behavior intact. The real lever was outside the quant toolbox.
- 2026-06-09[Just for Fun #4] Gemma 4 E2B on a GTX 970: the biggest quant runs fastest (47.6 tok/s)
Four Gemma 4 E2B quants on a 2014 GTX 970. The bigger 3.2GB QAT Q4_0 beats the 2.9GB Q2_K — 47.6 vs 32.8 tok/s — because a tensor-core-less Maxwell card is dequant-bound, not bandwidth-bound.