~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
hardware enthusiast running 120B models at home on DGX Spark
building options trading infrastructure with AI agents
occasionally ships iOS apps
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-04-11[Ask AI Right] Why AI Feels Useless to You — Answer Machine vs Collaboration Tool
Same AI, same question, different results. The people who find ChatGPT life-changing and the people who think it's useless are doing completely different things — and the difference is a single mindset shift.
- 2026-04-10[Ask AI Right] You Don't Know What You Need — Let AI Find It
Most people don't struggle with using AI — they struggle with knowing what to use it for. This article teaches you a simple method to let AI identify the repetitive parts of your workday you've stopped noticing.
- 2026-04-10[LLM 101] So Many Models — Which One Should You Download?
Gemma, Llama, Qwen, Mistral — the model list is overwhelming. This guide uses car-buying logic to help you pick the right AI model based on size, speed, and quality.
- 2026-04-10[LLM 101] What Is Quantization? Q4, Q8, FP16 Explained
Q4_K_M, Q8_0, FP16 — the same model comes in a dozen versions and the names look like hieroglyphs. This guide explains what quantization actually does, why it doesn't ruin the model, and which level to pick.
- 2026-04-09[Ask AI Right] You Opened AI — Now What Do You Say?
AI isn't Google — you're not searching, you're having a conversation. This article teaches you what to say when you first open ChatGPT, five things you can try right now, and how to adjust when the answer isn't quite right.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-04-08[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 2026-04-08[Benchmark] 4 Machines, 4 Models, 1 Answer: Memory Decides Everything
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 2026-04-07[Benchmark] Gemma 4 E2B vs E4B: 81 tok/s vs 52 on Three Machines — Bandwidth Is Everything
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 2026-04-07[Benchmark] From 19 to 50 tok/s: We Quantized Gemma 4 E4B to NVFP4 Before Anyone Else
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 2026-04-05[vLLM] Gemma 4 26B-A4B NVFP4 on DGX Spark: 52 tok/s with 16 GB of Weights
Deploying Gemma 4 26B-A4B MoE NVFP4 on GB10 via vLLM 0.19 — 52 tok/s decode, 16.5 GB model, 82 GB free KV cache. Includes the Phase 0 decision that killed the 31B variant.