~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
hardware enthusiast running 120B models at home on DGX Spark
building options trading infrastructure with AI agents
occasionally ships iOS apps
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-05-23[LLM 101] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-04-17[LLM 101] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16[Ask AI Right] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-14[Ask AI Right] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14[LLM 101] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-05-21Round 2 EAGLE-3 retrain didn't break the ceiling — a 60-hour null-result writeup
After Part 30's endpoint correction showed Round 1 didn't actually 2x chat throughput, Round 2 added 30k regenerated Chinese instruction samples and trained for 41 hours. Result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45. The EAGLE-3 small head hits an architectural ceiling against the abliterated body; more data doesn't fix it. Plus we found a scheduler deadlock in the vLLM Gemma 4 preview image (`gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`) under long-running extract_hidden_states use (hit three times, mitigated with a watchdog).
- 2026-05-19[Claude Code] Rules I'd Skip, Hooks I Can't — I Wrote a Hook That Blocks My Own Blog Commits
I had a rule called 'fact-check before publishing.' I still shipped three fabrications. The problem wasn't the rule — it was where I put it. This is how I promoted it from skill to hook: a small script guarding the moment I press 'send,' so I can't even try without verification.
- 2026-05-16EAGLE-3 fine-tune against an abliterated Gemma 4 body — Round 1 flattens the acceptance curve (plus a measurement lesson)
RedHatAI's EAGLE-3 drafter fine-tuned to realign with huihui Gemma 4 26B-A4B abliterated FP8 on a single DGX Spark GB10 — 1 epoch / 50k Magpie samples / 11h. Inference bench on raw `/v1/completions`: pos 3 acceptance climbs from vanilla's 20.5% to 72.7%; n=4 throughput goes from ~50 to 100.36 tok/s aggregate. **A later paired bench revealed the throughput comparison used different endpoints for baseline (chat) and retrain (raw) — on production chat workloads the real uplift is far smaller than 2×; see the endpoint correction at the top of the post**. Part 28's mechanism observation (deep speculation acceptance scatters on abliterated distributions) still holds. Includes a Speculators upstream create_empty_sample dtype bug + patch and a Phase 0 catalog of 6 community prior-art repos.
- 2026-05-1430 lines of docker for +34% on DGX Spark: huihui Gemma 4 FP8 + vanilla MTP n=1 deployment recipe
Part 28 explained why deep speculation breaks on an abliterated body; this post is the recipe for the part that already works. huihui Gemma 4 26B-A4B FP8 + Google's vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 tok/s to 52.6 tok/s (+34%) on GB10, no retraining required. ~30 lines of docker plus a bind-mount of PR #41745's gemma4_mtp.py. Includes a 3-step sanity check and a clear list of when n=1 stops being enough.
- 2026-05-09Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body
I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.