~ /home/coolthor

ai-muninn

Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.

❯ whoami

runs all kinds of models at home — LLMs, image gen, video gen, then writes down what he figures out

quantizes models to FP8 / NVFP4 and ships them on Hugging Face — people actually run them

builds options-trading infrastructure with AI agents

had a spec-decode fix merged into vLLM's speculators

occasionally ships iOS apps

❯

❯ cat ~/blog/concepts

Concepts & Methods

For those who want to understand how AI works

2026-05-23
[LLM 101 #7] How to spot AI hallucinations — three red flags before you verify
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
2026-04-17
[LLM 101 #6] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
2026-04-16
[Ask AI Right #7] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
2026-04-14
[LLM 101 #5] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
2026-04-14
[Ask AI Right #6] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.

❯ cat ~/blog/field-notes

Field Notes

For those who run models and debug the hard way

2026-06-05
[Benchmark] NVFP4 Weight-Only Quantization Taxes Chinese ~2x Harder Than English (gemma-4-12B)
I benchmarked BF16 vs FP8 vs NVFP4 weight-only on gemma-4-12B across English (MMLU) and Traditional Chinese (TMMLU+) on a DGX Spark. FP8 is near-lossless on both; NVFP4 drops Chinese ~6pp but English only ~3pp.
2026-06-04
[Benchmark] Gemma 4 12B Omni on DGX Spark: Weight-Only NVFP4 Beats W4A4 (and Keeps Multimodal)
I quantized Google's new omni Gemma 4 12B on a DGX Spark GB10. Weight-only NVFP4 hits 24.9 tok/s in 7.7 GB and keeps image/audio/video working — full W4A4 is slower AND breaks multimodal.
2026-06-02
[AI Agent] My Local Agent Flailed at Image Gen — It Was the Harness, Not the Weights
My local 35B agent went haywire generating images until I read its tool-call logs: 0% malformed calls. The model was fine — a broken ComfyUI tool was making it improvise. The fix was a clean ACI skill, not fine-tuning.
2026-06-01
[Benchmark] NVFP4 W4A4 beats FP8 on a DGX Spark MoE: 67 vs 52 tok/s once CUDA graphs fire
On a GB10 DGX Spark, NVFP4 W4A4 went from 23 to 67 tok/s the moment I dropped --enforce-eager — beating FP8 by 29% and saving 16GB. The catch from Part 32 was real, just dense-only.
2026-06-01
[Benchmark] NVFP4 shrinks a video model 33% on a DGX Spark — with zero speed gain
NVFP4 took a distilled Sulphur 2 (LTX-2.3) video model from 29 to 19.5 GB on a GB10 DGX Spark with no quality loss and — since video is compute-bound — no speed gain (if anything a hair slower).

81 posts total · view all posts →