~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
hardware enthusiast running 120B models at home on DGX Spark
building options trading infrastructure with AI agents
occasionally ships iOS apps
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-04-17[LLM 101] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[Ask AI Right] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14[LLM 101] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
- 2026-04-13[Ask AI Right] Before You Build It, Ask: Does This Already Exist?
Your first question to AI shouldn't be 'help me do X.' It should be 'is there something that already does X?' This article teaches you how to use AI as a research assistant — finding tools, comparing alternatives, and verifying they're still alive.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-04-28[llm-compressor] Self-Quantizing a 35B Abliterated MoE to FP8 on DGX Spark: 4 OOMs, 3 Prefix Bugs, and Why the First Success Wasn't Actually FP8
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
- 2026-04-28[SWE-bench] Where Qwen 3.6 35B Loses on SWE-bench Lite: Anatomy of 155 Unresolved Tasks
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.
- 2026-04-26[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
- 2026-04-25[Benchmark] TMMLU+ Paired Eval: Qwen 3.6 35B Sweeps Gemma 4 26B 51-of-51 on Traditional Chinese
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
- 2026-04-22[Hands-On] Making NVFP4 17% Faster on GB10 with a Triton FP8 Bypass
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.