~ /home/coolthor
ai-muninn
Research notes on AI infrastructure, LLM serving, and autonomous agents. Things that took too long to figure out, written down so you don't have to.
❯ whoami
hardware enthusiast running 120B models at home on DGX Spark
building options trading infrastructure with AI agents
occasionally ships iOS apps
❯ cat ~/blog/concepts
Concepts & Methods
For those who want to understand how AI works
- 2026-04-10[AI 怎麼問] You Don't Know What You Need — Let AI Find It
Most people don't struggle with using AI — they struggle with knowing what to use it for. This article teaches you a simple method to let AI identify the repetitive parts of your workday you've stopped noticing.
- 2026-04-10[LLM 101] So Many Models — Which One Should You Download?
Gemma, Llama, Qwen, Mistral — the model list is overwhelming. This guide uses car-buying logic to help you pick the right AI model based on size, speed, and quality.
- 2026-04-09[Ask AI Right] You Opened AI — Now What Do You Say?
AI isn't Google — you're not searching, you're having a conversation. This article teaches you what to say when you first open ChatGPT, five things you can try right now, and how to adjust when the answer isn't quite right.
- 2026-04-09[Ask AI Right] Which AI Should You Use in 2026?
ChatGPT, Claude, and Gemini — the three AI assistants you can start using right now. A no-jargon guide to what each one does best, how much they cost, and how to get started.
- 2026-04-08[LLM 101] Dense, MoE, PLE, SSM — Four AI Model Architectures Explained Simply
Dense is everyone working. MoE is expert rotation. PLE is a dictionary on every floor. SSM is a speed reader. A zero-jargon guide to the four main AI model architectures and how to pick between them.
❯ cat ~/blog/field-notes
Field Notes
For those who run models and debug the hard way
- 2026-04-08[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 2026-04-08[Benchmark] 4 Machines, 4 Models, 1 Answer: Memory Decides Everything
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 2026-04-07[Benchmark] Gemma 4 E2B vs E4B: 81 tok/s vs 52 on Three Machines — Bandwidth Is Everything
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 2026-04-07[Benchmark] From 19 to 50 tok/s: We Quantized Gemma 4 E4B to NVFP4 Before Anyone Else
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 2026-04-05[vLLM] Gemma 4 26B-A4B NVFP4 on DGX Spark: 52 tok/s with 16 GB of Weights
Deploying Gemma 4 26B-A4B MoE NVFP4 on GB10 via vLLM 0.19 — 52 tok/s decode, 16.5 GB model, 82 GB free KV cache. Includes the Phase 0 decision that killed the 31B variant.