❯ ls -la ~/blog
70 posts · 10 series
- datetitle
- 2026-05-09
I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.
- 2026-05-06
Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.
- 2026-05-05
I ran sysprog21/zhtw-mcp across 72 of my Traditional Chinese articles. Three sweeps, 128 cross-strait substitutions across 42 files. The real takeaway wasn't the count — it was discovering my blindspot isn't 'I don't know the right Taiwanese term,' it's 'when a Mainland term shows up I don't auto-doubt it.'
- 2026-05-04
I ran six Z-Image Turbo quantization configs on DGX Spark GB10 — BF16 baseline, FP8 cast standard, FP8 cast fast, FP8 scaled (Kijai), NVFP4, NVFP4+FP8 encoder. With N=10 isolated GPU, NVFP4 transformer hits 5.50s warm versus BF16 7.55s (1.37× faster). All three FP8 paths are slower than BF16. Model working set drops from 20.6 GB (BF16) to 11.5 GB (NVFP4+FP8 encoder) — 44% smaller.
- 2026-05-04
Does Z-Image Turbo quantization break image quality? Two-axis benchmark — LPIPS (perceptual distance vs BF16) + CLIPScore (image-text alignment) — across 6 prompts × 4 configs × 3 seeds = 72 samples. Result: NVFP4 produces images that look different from BF16, but no measured regression in this sample — all 4 configs land within ±0.04 std on CLIPScore, smaller than the noise floor. Production users should re-verify with their own prompt set.
- 2026-05-01
Ten days ago I called NVFP4 a trap on DGX Spark GB10. Today the same hardware hits 74.75 tok/s on Nemotron 3 Nano W4A16, beating my own FP8 ceiling and the public 67 tok/s forum number. The 4-layer patch stack, the quant variant choice, and the bandwidth math behind it.
- 2026-05-01
Same DGX Spark, different goal: watch a 3-minute Andrej Karpathy talk and output the spoken content + visual scene. 89 seconds wall, 53,842 prompt tokens, factually correct. The use_audio_in_video flag, the upstream-image gotcha, and the long-video knob math.
- 2026-04-30
ai-muninn.com burned through Vercel Hobby's 1M Edge Requests quota this month. It wasn't traffic, wasn't bots, wasn't large images. Next.js defaults /public/* to must-revalidate, which makes every conditional GET (even 304s) count as an edge request. Three lines of next.config.ts to fix. Three rounds of fact-check rewrites to publish.
- 2026-04-28
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
- 2026-04-28
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.
- 2026-04-26
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
- 2026-04-25
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
- 2026-04-22
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.
- 2026-04-21
NVFP4 should be faster than FP8 — fewer bits, less bandwidth. On DGX Spark's GB10 (SM121), it's 32% slower. Root cause: missing hardware instruction. Dual-engine proof with vLLM and SGLang.
- 2026-04-20
One scaffold (backticks + edit-tool + budget prompt), three models (Gemma 4 E4B, Gemma 4 26B, Qwen 3.6 35B), zero code changes between runs. Qwen 3.6 hit 48.33% — beating SWE-agent + Claude 3.7 Sonnet. The scaffold is the fixed cost; the model is the variable.
- 2026-04-17
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-17
Gemma 4 26B-A4B FP8 scored 116/300 on SWE-bench Lite, ranking #16 globally. Zero API cost on a DGX Spark. The scaffold — not the model — was the differentiator.
- 2026-04-16
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-15
Two days running mini-swe-agent + vLLM on a GB10. From wrong doc conclusions to Gemma 4 self-submitting a clean patch in 38 steps — what actually unlocked it.
- 2026-04-15
How does Q4_K_M fit a 14B model into 4 bits without ruining it? Not by 'cutting off 75%' — but through three layers: K-quant super-blocks, TurboQuant random rotation, and a 1-bit JL sign sketch. A mechanism walkthrough without the equations.
- 2026-04-14
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
- 2026-04-13
Your first question to AI shouldn't be 'help me do X.' It should be 'is there something that already does X?' This article teaches you how to use AI as a research assistant — finding tools, comparing alternatives, and verifying they're still alive.
- 2026-04-13
Your CLAUDE.md and MEMORY.md grow silently until they eat 10K+ tokens per turn. I built a /slim skill that lets Claude diagnose and fix its own bloat — here's how.
- 2026-04-13
You just started using Claude Code and the context window keeps filling up. Here's where the tokens actually go, what you can do about it, and how to make Claude remember things without re-reading everything.
- 2026-04-13
Everything you need to go from a sealed DGX Spark box to serving your first local LLM. Hardware check, Ollama quickstart, vLLM production setup, model selection, and the 5 gotchas that cost hours.
- 2026-04-13
Gemma 4 E2B / E4B / 26B MoE / 31B Dense benchmarked on DGX Spark, RTX 5090, and MacBook Pro. One table with speed, memory, quantization format. Selection guide included.
- 2026-04-13
A feasibility test: can open-source models run SWE-Bench locally for free? Gemma 4 26B failed on OpenHands (40+ errors) but fixed a test bug in 9 steps on SWE-agent. Same model — the action format was the difference.
- 2026-04-11
Same AI, same question, different results. The people who find ChatGPT life-changing and the people who think it's useless are doing completely different things — and the difference is a single mindset shift.
- 2026-04-10
Most people don't struggle with using AI — they struggle with knowing what to use it for. This article teaches you a simple method to let AI identify the repetitive parts of your workday you've stopped noticing.
- 2026-04-10
Which local AI model should you download? Gemma, Llama, Qwen, Mistral compared by size, speed, and quality. Simple formula: parameters × 0.6 = GB needed. Beginner-friendly guide.
- 2026-04-10
Q4_K_M, Q8_0, FP16 — the same model comes in a dozen versions and the names look like hieroglyphs. This guide explains what quantization actually does, why it doesn't ruin the model, and which level to pick.
- 2026-04-09
AI isn't Google — you're not searching, you're having a conversation. This article teaches you what to say when you first open ChatGPT, five things you can try right now, and how to adjust when the answer isn't quite right.
- 2026-04-09
ChatGPT, Claude, and Gemini — the three AI assistants you can start using right now. A no-jargon guide to what each one does best, how much they cost, and how to get started.
- 2026-04-08
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 2026-04-08
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 2026-04-08
Dense is everyone working. MoE is expert rotation. PLE is a dictionary on every floor. SSM is a speed reader. A zero-jargon guide to the four main AI model architectures and how to pick between them.
- 2026-04-07
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 2026-04-07
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 2026-04-07
Ollama is a microwave — one command and you're chatting with AI. vLLM is a professional oven — 30% faster, handles multiple users, but takes real setup. A zero-jargon guide to choosing between them.
- 2026-04-05
Gemma 4 26B in NVFP4 hits 52 tok/s on DGX Spark — 16.5 GB used, 82 GB free for KV cache. Why MoE wins over the 31B dense variant on Blackwell GB10.
- 2026-04-05
Gemma 4 31B-IT NVFP4 on GB10 maxes out at 7.0 tok/s — bandwidth-bound at 273 GB/s. The math predicted 4.4 tok/s theoretical; NVFP4 compression buys 60% but can't escape the wall. Choose MoE.
- 2026-04-05
Same Gemma 4 26B-A4B, same GPU, 30% speed gap. vLLM NVFP4 hits 52 tok/s while Ollama Q4_K_M tops at 40. Root cause: Marlin kernels, CUDA graphs, and an Ollama CPU/GPU split trap.
- 2026-04-02
DGX Spark power and thermal issues blew up after Carmack's criticism. This guide covers three distinct symptoms: 30W PD controller defect (needs RMA), 100W thermal throttling, and 5W driver bug (fixable). One command, 30 seconds to diagnose.
- 2026-03-30
Real benchmark numbers for Google's TurboQuant on a GB10/SM121 (DGX Spark) — actual compression ratios, Qwen2.5-3B accuracy validation, and why Qwen3.5-35B's hybrid attention architecture makes things complicated.
- 2026-03-24
How to point NemoClaw's inference backend to a local Ollama or vLLM endpoint. Config location, model swap, and what OpenShell still enforces when the cloud is gone.
- 2026-03-24
Your ChatGPT Plus subscription already includes GPT-5.4 with 1M context. openclaw's OAuth flow lets you use it for AI agents — zero API credits, one command. Full setup guide.
- 2026-03-23
NemoClaw's official installer fails on DGX Spark out of the box. This guide covers the 4 fixes — Node upgrade, npm link, OpenShell tar.gz, cgroupns — to get your first AI agent running in 30 minutes.
- 2026-03-23
NemoClaw bundles OpenClaw + OpenShell + NVIDIA Agent Toolkit into one installer for DGX Spark. Architecture breakdown, what it does, and whether it's worth your time.
- 2026-03-21
Building a multi-agent orchestrator with `claude -p` subprocess reveals a silent data loss problem. The SDK fix, session resume, parallel execution, and why setting_sources matters.
- 2026-03-21
Adding --kv-cache-dtype fp8 to a vLLM serve script on GB10 causes outputs to degrade into repetition after ~500 tokens. Root cause: missing calibration data, q_scale defaults to 1.0.
- 2026-03-21
Connecting openclaw to a 131K context model and hitting 400 max_tokens must be at least 1, got -1292. The context budget math, the config key trap, and the fix.
- 2026-03-21
Replacing choppy editMessageText polling with Telegram's sendMessageDraft for live animated output. The patch, the think-block filter, and the optional chaining trap in DM chats.
- 2026-03-19
The bot process is running. The token is valid. Messages are being consumed. Nobody is home. A systematic takedown of every wrong hypothesis — and the hidden causal chain that connects Tailscale routing tables to silent sendMessage failures in Node.js.
- 2026-03-19
How to get gpt-oss-120B running on a DGX Spark (GB10, SM121) with vLLM. The goal: a 120B model serving a local AI agent at zero API cost. The path: six bugs, one silent env var, and a startup log that tells you everything.
- 2026-03-19
After fixing the four SM121 NVFP4 bugs, Qwen3.5-122B boots cleanly and generates correct output. Then you check the speed. 14 tok/s. No flags to fix it. Here's why — and what to wait for.
- 2026-03-18
How to wire a callhelp tool into a local agent loop so it can spawn Codex CLI mid-reasoning. One permission flag you must set, and why Claude's quota stays mine.
- 2026-03-17
CUTLASS FP4 kernels target SM120 (GB200). On SM121 (GB10, DGX Spark) they run silently and produce garbage. Here's the full diagnostic story — 4 bugs, the row-identical failure signature, and the working fix.
- 2026-03-16
Why we stopped having the OpenClaw agent orchestrate multi-step tasks directly, and started spawning Codex subprocesses instead. The pattern that keeps agent context minimal and tasks reliable.
- 2026-03-13
Getting NVIDIA's Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (SM121, 128GB). Four SM121-specific pitfalls, the env-var-that-does-nothing, and a working docker command.
- 2026-03-07
vLLM OOMed on restart despite 128GB unified memory. Cause: Ollama's KEEP_ALIVE=2h was holding 19-51GB in GPU. Diagnosis command, manual unload fix, and why to set KEEP_ALIVE=0 once vLLM is your primary stack.
- 2026-03-06
Adding --enable-chunked-prefill to a Qwen3.5-35B (SSM+MoE hybrid) dropped throughput from 47 tok/s to 5.7 tok/s. Why SSM recurrence and chunked prefill are fundamentally incompatible.
- 2026-03-05
Step-by-step guide: Ollama to vLLM on DGX Spark GB10. Qwen3.5-35B hits 47 tok/s with TTFT dropping from 3s to 0.12s. Covers 6 real gotchas including SSM + chunked prefill trap and GPU memory conflicts.
- 2026-03-05
Full stack local AI agent: Mac Mini M4 as the always-on gateway, GX10 for inference, Telegram as the UI. No subscriptions, no cloud APIs. Six deployment lessons from the trenches.
- 2026-03-01
GLM-4.7-Flash hits 57.8 tok/s on short context but drops to 42 tok/s at 8K. Qwen3.5-35B SSM hybrid: 56 tok/s at short, 56 tok/s at 8K. Why agents with long system prompts should care about this difference.
- 2026-02-26
A custom /debate command that pits Codex CLI against Gemini CLI on architecture, code, and decisions. Different training data, different blind spots — and the disagreements between them are usually the most useful output.
- 2026-02-26
How I replaced screenshot-heavy iOS test runs with ui_describe_all-first testing in Claude Code, cutting context usage by 81% for BPS Tracker. Plus Fastlane integration for App Store automation.
- 2026-02-25
Spent weeks restarting the OpenClaw gateway for every config change. Then discovered the file watcher. What hot-reloads instantly, what still needs a restart, and how to tell auth failures from transient network errors.
- 2026-02-19
A Claude Code config rule marked MANDATORY was skipped twice in one session. Here's the root cause — three architectural reasons why emphasis doesn't work — and three system-level solutions that do.
- 2026-02-19
Benchmarking 8 local LLMs on NVIDIA GB10 (128GB unified memory) across 7 task categories. Quantization surprises, a 120B model that fails at JSON, and thinking models that spend their entire budget thinking.
showing 70 posts