❯ ls -la ~/blog
112 posts · 15 series
- datetitle
- 2026-06-28
Once your character LoRA is trained, how do you control it? Why lightning flattens style, when to spend full steps, how to stack a style LoRA, and why the trigger word alone won't hold the look.
- 2026-06-27
I added up what my home agent pays before it reads a single word from me: ~23K tokens of overhead, and 17K of that is just the instruction manuals for its tools. Worse, it runs a hybrid model — on a cache miss it re-processes all 17K from scratch, and a single user turn can do that a dozen-plus times. This is context economics, badly underestimated. The fix isn't cutting tools; it's loading them on demand, the way skills already do.
- 2026-06-26
On a long conversation, every message makes the model re-read the whole thing (re-prefill) before it answers — worst right after a restart or a cache eviction. Stock llama.cpp can save the KV cache to disk (--slot-save-path) but won't do it on its own — the auto-persist feature request is closed as not planned. A tiny stdlib reverse-proxy restores instead of re-prefilling: 9.9s → 1.4s on a 5K chat (7×). Mechanism, proxy design, and why I haven't shipped it yet.
- 2026-06-25
Quantizing the main KV cache to q4 to save memory is fine. So I quantized the MTP draft cache too — it's just a little draft, surely a free win. It wasn't: q4 draft cache ran 29.6 tok/s, the un-quantized f16 ran 39.7, and f16 used less VRAM on top of that. The draft cache is one of the few places where quantizing is a net loss — here's the triple penalty.
- 2026-06-24
Hermes has a built-in Kanban, but on your phone all you get is Telegram's plain text. Muninn now pulls that board onto the phone: Running / Blocked / Done columns — who's working on what, which card got blocked — at a glance. Zero backend, pure P2P.
- 2026-06-24
The model card says n_ctx_train=262144. The card has 22GB. The 27B's Q4 weights are only 15.7GB. The math looks obvious: max it to 256K, plenty to spare. -c 262144, launch — loads fine, no error. A few turns of real conversation later: 503, the service restarts itself. No tidy out-of-memory in the log, just a lone 0xc0000409. nvidia-smi: free VRAM down to ~170 MiB. Where did the gigabytes go? This is the hunt: I first blamed context checkpoints, but the llama.cpp source says they live in host RAM — the real VRAM eater is the KV cache; free-VRAM-vs-context is nonlinear, and the one stable sweet spot isn't 256K — it's 128K.
- 2026-06-23
Hermes runs at home, but you lose it the second you walk out. Bridging through Telegram works but it's fiddly and routes every message through someone else's server. Muninn is an iOS app built for Hermes: give your agent one command, scan a QR, and your phone connects straight home over an encrypted iroh tunnel — no cloud in the path.
- 2026-06-23
Picking a local model, I looked at tok/s first too. Gemma 12B does 90-100 and it's great — until you put it on a kanban board, where it finishes the work and just walks away, never marking the card done. A Qwen 27B that's three times slower actually closes the loop. Why throughput is the wrong number for an agent — plus how grep almost lied to me about it.
- 2026-06-22
When an AI assistant loops, wanders, freezes, or answers the wrong question, your first instinct is 'this model is dumb.' But from my own debugging, eight times out of ten it's not the model — it's the ring around it (tools, config, memory). The model is the engine; that ring is the car. A car that won't move usually doesn't have a broken engine — it has a flat tire or a clogged fuel line.
- 2026-06-22
I dug up a 22GB-modded RTX 2080 Ti for ~$340 all-in (¥2079 sticker + shipping) — just enough to keep a resident 27B agent brain running on the same cheap old desktop. What the mod changes, and the gotchas.
- 2026-06-21
ds4 ships directional steering — a runtime activation edit that nudges the model along a chosen direction, and the math is literally abliteration with a continuous, signed scale. I got it running on GB10/CUDA (the tooling looks Metal-only, but the activation dump fires on CUDA too) and pulled a verbosity vector from our abliterated Q2 model. The dial works, but it ignores the textbook: the sweep is non-monotonic and positive scales collapse the output to a four-word fragment. Two cuts from the same scalpel, fighting each other.
- 2026-06-20
Your assistant is installed, but right now it only talks — it's all mouth. This post gives it hands: connect tools so it actually checks your folders, runs your commands, calls services you wrote yourself. The key idea is MCP, the 'universal outlet' standard for tools — plug one in and the assistant can use it. All running on your side, connected to your own stuff.
- 2026-06-19
We used ChatGPT as the assistant's brain. This post does something bolder — swaps that brain from the cloud to a local model running on your own machine (e.g. ds4). The payoff is an autonomous brain: no cloud model provider, your conversations stay on your machine, no usage caps. The honest cost: local brains are usually slower (~10 tok/s on my ds4) and need a capable machine. Swap the brain, keep the body — Hermes doesn't change at all.
- 2026-06-18
Train a Wan 2.2 character LoRA on your own RTX 5090 from a single reference image. Then generate the same person from text — new outfits, scenes, art styles, even video. No cloud, no bill.
- 2026-06-17
Comfortable with one assistant and want a second and third? Hermes gives each one its own home (config, memory, personality), each able to run a different model and handle different tasks. Plain-language: why split them, how, and the three I actually run. Honest: most people only need one — this is for when you want to tinker.
- 2026-06-16
Your AI assistant only reads text? Give it eyes and ears — send a photo it understands, send a voice clip it understands. Not by swapping in a pricier model, but by bolting on a small vision model as a perception side-car. Hermes's built-in auxiliary.vision + faster-whisper, measured end to end.
- 2026-06-16
The last and most satisfying step: set up a task that runs itself. Tell it in plain words, and every day it researches what you care about, sums it up, and messages your Telegram. Set it once, close the laptop, and it pings you the next morning.
- 2026-06-16
Order your assistant around from your phone. Chat with one official Telegram bot, get a key (token), hand it to Hermes — done. No public URL, no webhook, no tunnel, because Hermes fetches messages from Telegram itself.
- 2026-06-16
Install the Hermes Agent desktop app — no terminal. Download it, let it auto-install dependencies, sign in with your ChatGPT account, and your first AI assistant is running in about 15 minutes.
- 2026-06-16
An AI assistant = a brain + a body. Use your ChatGPT account as the brain and Hermes as the body — one fixed combo, nothing to choose. Here's why it's set up this way, and what to have ready before you install.
- 2026-06-16
You don't need to write code to have your own AI assistant. An agent framework already packages the hard parts so you just install and go. Here's why you shouldn't wire it yourself — and why this series uses Hermes.
- 2026-06-16
You mostly use ChatGPT one question at a time. A self-hosted AI assistant (agent) finishes the job with your own tools, runs on your side, and plugs into the apps you use daily. Lesson one of building your own assistant from zero.
- 2026-06-14
A retrieval-augmented support bot for my blog, running on a 2014 GTX 970 and a ~600MB embedding model. llama.cpp embeddings on CPU, numpy brute-force cosine over 3,475 chunks, an embedding-score guardrail, and Cloudflare Tunnel.
- 2026-06-14
On a tensor-core-less Maxwell GTX 970 running Gemma 4 E2B, Flash Attention nearly doubles long-context decode (24.3 → 42.5 tok/s) and saves ~430MB VRAM — while q8 KV cache barely saves memory and slows decode. The usual KV-cache advice flips.
- 2026-06-13
DiffusionGemma 26B-A4B runs on vLLM on a 128GB DGX Spark via an official prebuilt image — no PR-waiting, no cherry-picking. NVFP4 hits 158 tok/s single-stream and 257 aggregate. But a single tok/s number lies: diffusion speed is decided by whether the 256-token canvas fills.
- 2026-06-12
DeepSeek-V4-Flash (284B) only fits a 128GB box at asymmetric Q2 (~80GB). Sounds like suicide quantization — but it's surgical: only the layers that barely affect quality get cut. As a daily agent it ran 280 turns with zero degradation. Big enough weights survive 2-bit.
- 2026-06-12
A 284B model at 15 tok/s, wired into a daily agent. Two sets of settings make it comfortable — server-side and agent-framework-side. --no-mmap cuts cold start to 57s, the KV disk cache halves prefill, and one missing context_length will crash the whole session.
- 2026-06-12
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
- 2026-06-11
Qwen3.5-122B-A10B tops out at 17 tok/s on a 128GB DGX Spark — the GDN wall in vLLM won't budge, not even with a merged perf PR. I swapped vLLM for the Atlas engine on the same abliterated NVFP4 weights and the throughput doubled to 33.9 tok/s (36.5 with MTP, ~2×), uncensored behavior intact. The real lever was outside the quant toolbox.
- 2026-06-09
A 2014 GTX 970 running Gemma 4 E2B (vision + audio) plus Piper TTS — a full offline voice assistant that sees, listens, talks back, and writes code. ~2.8s end-to-end, ~$15 of hardware.
- 2026-06-09
Four Gemma 4 E2B quants on a 2014 GTX 970. The bigger 3.2GB QAT Q4_0 beats the 2.9GB Q2_K — 47.6 vs 32.8 tok/s — because a tensor-core-less Maxwell card is dequant-bound, not bandwidth-bound.
- 2026-06-05
I benchmarked BF16 vs FP8 vs NVFP4 weight-only on gemma-4-12B across English (MMLU) and Traditional Chinese (TMMLU+) on a DGX Spark. FP8 is near-lossless on both; NVFP4 drops Chinese ~6pp but English only ~3pp.
- 2026-06-04
I quantized Google's new omni Gemma 4 12B on a DGX Spark GB10. Weight-only NVFP4 hits 24.9 tok/s in 7.7 GB and keeps image/audio/video working — full W4A4 is slower AND breaks multimodal.
- 2026-06-02
My local 35B agent went haywire generating images until I read its tool-call logs: 0% malformed calls. The model was fine — a broken ComfyUI tool was making it improvise. The fix was a clean ACI skill, not fine-tuning.
- 2026-06-01
NVFP4 took a distilled Sulphur 2 (LTX-2.3) video model from 29 to 19.5 GB on a GB10 DGX Spark with no quality loss and — since video is compute-bound — no speed gain (if anything a hair slower).
- 2026-06-01
On a GB10 DGX Spark, NVFP4 W4A4 went from 23 to 67 tok/s the moment I dropped --enforce-eager — beating FP8 by 29% and saving 16GB. The catch from Part 32 was real, just dense-only.
- 2026-05-30
On a GB10 DGX Spark, NVFP4 beats FP8 by ~1.5× for single-stream decode on a dense model. But the win is bandwidth (smaller weights), not the FP4 tensor cores — the fastest path never touches them.
- 2026-05-23
AI delivers wrong answers in the same confident tone as right ones. Three red flags to catch it early — impossible numbers, suspiciously specific details, answers that shift on a re-ask — plus a case where ChatGPT gave me a +205% P&L that can't exist.
- 2026-05-21
After Part 30's endpoint correction showed Round 1 didn't actually 2x chat throughput, Round 2 added 30k regenerated Chinese instruction samples and trained for 41 hours. Result: Round 2 B drafter delivers chat EN 45 tok/s / ZH 29 tok/s — essentially the same as v1 (EN 46 / ZH 27), and well below vanilla MTP n=4's EN 53 / ZH 45. The EAGLE-3 small head hits an architectural ceiling against the abliterated body; more data doesn't fix it. Plus we found a scheduler deadlock in the vLLM Gemma 4 preview image (`gemma4-0505-arm64-cu130`, internal build `0.20.2rc1.dev49+g9b4e83934`) under long-running extract_hidden_states use (hit three times, mitigated with a watchdog).
- 2026-05-19
I had a rule called 'fact-check before publishing.' I still shipped three fabrications. The problem wasn't the rule — it was where I put it. This is how I promoted it from skill to hook: a small script guarding the moment I press 'send,' so I can't even try without verification.
- 2026-05-16
RedHatAI's EAGLE-3 drafter fine-tuned to realign with huihui Gemma 4 26B-A4B abliterated FP8 on a single DGX Spark GB10 — 1 epoch / 50k Magpie samples / 11h. Inference bench on raw `/v1/completions`: pos 3 acceptance climbs from vanilla's 20.5% to 72.7%; n=4 throughput goes from ~50 to 100.36 tok/s aggregate. **A later paired bench revealed the throughput comparison used different endpoints for baseline (chat) and retrain (raw) — on production chat workloads the real uplift is far smaller than 2×; see the endpoint correction at the top of the post**. Part 28's mechanism observation (deep speculation acceptance scatters on abliterated distributions) still holds. Includes a Speculators upstream create_empty_sample dtype bug + patch and a Phase 0 catalog of 6 community prior-art repos.
- 2026-05-14
Part 28 explained why deep speculation breaks on an abliterated body; this post is the recipe for the part that already works. huihui Gemma 4 26B-A4B FP8 + Google's vanilla MTP draft at num_speculative_tokens=1 takes baseline 39.3 tok/s to 52.6 tok/s (+34%) on GB10, no retraining required. ~30 lines of docker plus a bind-mount of PR #41745's gemma4_mtp.py. Includes a 3-step sanity check and a clear list of when n=1 stops being enough.
- 2026-05-09
I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.
- 2026-05-06
Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.
- 2026-05-05
I ran sysprog21/zhtw-mcp across 72 of my Traditional Chinese articles. Three sweeps, 128 cross-strait substitutions across 42 files. The real takeaway wasn't the count — it was discovering my blindspot isn't 'I don't know the right Taiwanese term,' it's 'when a Mainland term shows up I don't auto-doubt it.'
- 2026-05-04
Does Z-Image Turbo quantization break image quality? Two-axis benchmark — LPIPS (perceptual distance vs BF16) + CLIPScore (image-text alignment) — across 6 prompts × 4 configs × 3 seeds = 72 samples. Result: NVFP4 produces images that look different from BF16, but no measured regression in this sample — all 4 configs land within ±0.04 std on CLIPScore, smaller than the noise floor. Production users should re-verify with their own prompt set.
- 2026-05-04
I ran six Z-Image Turbo quantization configs on DGX Spark GB10 — BF16 baseline, FP8 cast standard, FP8 cast fast, FP8 scaled (Kijai), NVFP4, NVFP4+FP8 encoder. With N=10 isolated GPU, NVFP4 transformer hits 5.50s warm versus BF16 7.55s (1.37× faster). All three FP8 paths are slower than BF16. Model working set drops from 20.6 GB (BF16) to 11.5 GB (NVFP4+FP8 encoder) — 44% smaller.
- 2026-05-01
Same DGX Spark, different goal: watch a 3-minute Andrej Karpathy talk and output the spoken content + visual scene. 89 seconds wall, 53,842 prompt tokens, factually correct. The use_audio_in_video flag, the upstream-image gotcha, and the long-video knob math.
- 2026-05-01
Ten days ago I called NVFP4 a trap on DGX Spark GB10. Today the same hardware hits 74.75 tok/s on Nemotron 3 Nano W4A16, beating my own FP8 ceiling and the public 67 tok/s forum number. The 4-layer patch stack, the quant variant choice, and the bandwidth math behind it.
- 2026-04-30
ai-muninn.com burned through Vercel Hobby's 1M Edge Requests quota this month. It wasn't traffic, wasn't bots, wasn't large images. Next.js defaults /public/* to must-revalidate, which makes every conditional GET (even 304s) count as an edge request. Three lines of next.config.ts to fix. Three rounds of fact-check rewrites to publish.
- 2026-04-28
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.
- 2026-04-28
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
- 2026-04-26
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
- 2026-04-25
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
- 2026-04-22
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.
- 2026-04-21
NVFP4 should be faster than FP8 — fewer bits, less bandwidth. On DGX Spark's GB10 (SM121), it's 32% slower. Root cause: missing hardware instruction. Dual-engine proof with vLLM and SGLang.
- 2026-04-20
One scaffold (backticks + edit-tool + budget prompt), three models (Gemma 4 E4B, Gemma 4 26B, Qwen 3.6 35B), zero code changes between runs. Qwen 3.6 hit 48.33% — beating SWE-agent + Claude 3.7 Sonnet. The scaffold is the fixed cost; the model is the variable.
- 2026-04-17
Gemma 4 26B-A4B FP8 scored 116/300 on SWE-bench Lite, ranking #16 globally. Zero API cost on a DGX Spark. The scaffold — not the model — was the differentiator.
- 2026-04-17
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
- 2026-04-16
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
- 2026-04-15
Two days running mini-swe-agent + vLLM on a GB10. From wrong doc conclusions to Gemma 4 self-submitting a clean patch in 38 steps — what actually unlocked it.
- 2026-04-15
How does Q4_K_M fit a 14B model into 4 bits without ruining it? Not by 'cutting off 75%' — but through three layers: K-quant super-blocks, TurboQuant random rotation, and a 1-bit JL sign sketch. A mechanism walkthrough without the equations.
- 2026-04-14
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
- 2026-04-14
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
- 2026-04-13
A feasibility test: can open-source models run SWE-Bench locally for free? Gemma 4 26B failed on OpenHands (40+ errors) but fixed a test bug in 9 steps on SWE-agent. Same model — the action format was the difference.
- 2026-04-13
Gemma 4 E2B / E4B / 26B MoE / 31B Dense benchmarked on DGX Spark, RTX 5090, and MacBook Pro. One table with speed, memory, quantization format. Selection guide included.
- 2026-04-13
You just started using Claude Code and the context window keeps filling up. Here's where the tokens actually go, what you can do about it, and how to make Claude remember things without re-reading everything.
- 2026-04-13
Your first question to AI shouldn't be 'help me do X.' It should be 'is there something that already does X?' This article teaches you how to use AI as a research assistant — finding tools, comparing alternatives, and verifying they're still alive.
- 2026-04-13
Your CLAUDE.md and MEMORY.md grow silently until they eat 10K+ tokens per turn. I built a /slim skill that lets Claude diagnose and fix its own bloat — here's how.
- 2026-04-13
Everything you need to go from a sealed DGX Spark box to serving your first local LLM. Hardware check, Ollama quickstart, vLLM production setup, model selection, and the 5 gotchas that cost hours.
- 2026-04-11
Same AI, same question, different results. The people who find ChatGPT life-changing and the people who think it's useless are doing completely different things — and the difference is a single mindset shift.
- 2026-04-10
Q4_K_M, Q8_0, FP16 — the same model comes in a dozen versions and the names look like hieroglyphs. This guide explains what quantization actually does, why it doesn't ruin the model, and which level to pick.
- 2026-04-10
Most people don't struggle with using AI — they struggle with knowing what to use it for. This article teaches you a simple method to let AI identify the repetitive parts of your workday you've stopped noticing.
- 2026-04-10
Which local AI model should you download? Gemma, Llama, Qwen, Mistral compared by size, speed, and quality. Simple formula: parameters × 0.6 = GB needed. Beginner-friendly guide.
- 2026-04-09
AI isn't Google — you're not searching, you're having a conversation. This article teaches you what to say when you first open ChatGPT, five things you can try right now, and how to adjust when the answer isn't quite right.
- 2026-04-09
ChatGPT, Claude, and Gemini — the three AI assistants you can start using right now. A no-jargon guide to what each one does best, how much they cost, and how to get started.
- 2026-04-08
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
- 2026-04-08
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
- 2026-04-08
Dense is everyone working. MoE is expert rotation. PLE is a dictionary on every floor. SSM is a speed reader. A zero-jargon guide to the four main AI model architectures and how to pick between them.
- 2026-04-07
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
- 2026-04-07
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
- 2026-04-07
Ollama is a microwave — one command and you're chatting with AI. vLLM is a professional oven — 30% faster, handles multiple users, but takes real setup. A zero-jargon guide to choosing between them.
- 2026-04-05
Gemma 4 31B-IT NVFP4 on GB10 maxes out at 7.0 tok/s — bandwidth-bound at 273 GB/s. The math predicted 4.4 tok/s theoretical; NVFP4 compression buys 60% but can't escape the wall. Choose MoE.
- 2026-04-05
Same Gemma 4 26B-A4B, same GPU, 30% speed gap. vLLM NVFP4 hits 52 tok/s while Ollama Q4_K_M tops at 40. Root cause: Marlin kernels, CUDA graphs, and an Ollama CPU/GPU split trap.
- 2026-04-05
Gemma 4 26B-A4B + NVFP4 hits 52 tok/s on DGX Spark (GB10) — 7.5× faster than the 31B dense, in 16.5 GB with 82 GB free for KV cache. Plus the vLLM 0.19 marlin/patch gotchas that make it work.
- 2026-04-02
DGX Spark power and thermal issues blew up after Carmack's criticism. This guide covers three distinct symptoms: 30W PD controller defect (needs RMA), 100W thermal throttling, and 5W driver bug (fixable). One command, 30 seconds to diagnose.
- 2026-03-30
Real benchmark numbers for Google's TurboQuant on a GB10/SM121 (DGX Spark) — actual compression ratios, Qwen2.5-3B accuracy validation, and why Qwen3.5-35B's hybrid attention architecture makes things complicated.
- 2026-03-24
Your ChatGPT Plus subscription already includes GPT-5.4 with 1M context. openclaw's OAuth flow lets you use it for AI agents — zero API credits, one command. Full setup guide.
- 2026-03-24
How to point NemoClaw's inference backend to a local Ollama or vLLM endpoint. Config location, model swap, and what OpenShell still enforces when the cloud is gone.
- 2026-03-23
NemoClaw's official installer fails on DGX Spark out of the box. This guide covers the 4 fixes — Node upgrade, npm link, OpenShell tar.gz, cgroupns — to get your first AI agent running in 30 minutes.
- 2026-03-23
NemoClaw bundles OpenClaw + OpenShell + NVIDIA Agent Toolkit into one installer for DGX Spark. Architecture breakdown, what it does, and whether it's worth your time.
- 2026-03-21
Replacing choppy editMessageText polling with Telegram's sendMessageDraft for live animated output. The patch, the think-block filter, and the optional chaining trap in DM chats.
- 2026-03-21
Connecting openclaw to a 131K context model and hitting 400 max_tokens must be at least 1, got -1292. The context budget math, the config key trap, and the fix.
- 2026-03-21
Adding --kv-cache-dtype fp8 to a vLLM serve script on GB10 causes outputs to degrade into repetition after ~500 tokens. Root cause: missing calibration data, q_scale defaults to 1.0.
- 2026-03-21
Building a multi-agent orchestrator with `claude -p` subprocess reveals a silent data loss problem. The SDK fix, session resume, parallel execution, and why setting_sources matters.
- 2026-03-19
The bot process is running. The token is valid. Messages are being consumed. Nobody is home. A systematic takedown of every wrong hypothesis — and the hidden causal chain that connects Tailscale routing tables to silent sendMessage failures in Node.js.
- 2026-03-19
How to get gpt-oss-120B running on a DGX Spark (GB10, SM121) with vLLM. The goal: a 120B model serving a local AI agent at zero API cost. The path: six bugs, one silent env var, and a startup log that tells you everything.
- 2026-03-19
After fixing the four SM121 NVFP4 bugs, Qwen3.5-122B boots cleanly and generates correct output. Then you check the speed. 14 tok/s. No flags to fix it. Here's why — and what to wait for.
- 2026-03-18
How to wire a callhelp tool into a local agent loop so it can spawn Codex CLI mid-reasoning. One permission flag you must set, and why Claude's quota stays mine.
- 2026-03-17
CUTLASS FP4 kernels target SM120 (GB200). On SM121 (GB10, DGX Spark) they run silently and produce garbage. Here's the full diagnostic story — 4 bugs, the row-identical failure signature, and the working fix.
- 2026-03-16
Why we stopped having the OpenClaw agent orchestrate multi-step tasks directly, and started spawning Codex subprocesses instead. The pattern that keeps agent context minimal and tasks reliable.
- 2026-03-13
Getting NVIDIA's Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (SM121, 128GB). Four SM121-specific pitfalls, the env-var-that-does-nothing, and a working docker command.
- 2026-03-07
vLLM OOMed on restart despite 128GB unified memory. Cause: Ollama's KEEP_ALIVE=2h was holding 19-51GB in GPU. Diagnosis command, manual unload fix, and why to set KEEP_ALIVE=0 once vLLM is your primary stack.
- 2026-03-06
Adding --enable-chunked-prefill to a Qwen3.5-35B (SSM+MoE hybrid) dropped throughput from 47 tok/s to 5.7 tok/s. Why SSM recurrence and chunked prefill are fundamentally incompatible.
- 2026-03-05
Step-by-step guide: Ollama to vLLM on DGX Spark GB10. Qwen3.5-35B hits 47 tok/s with TTFT dropping from 3s to 0.12s. Covers 6 real gotchas including SSM + chunked prefill trap and GPU memory conflicts.
- 2026-03-05
Full stack local AI agent: Mac Mini M4 as the always-on gateway, GX10 for inference, Telegram as the UI. No subscriptions, no cloud APIs. Six deployment lessons from the trenches.
- 2026-03-01
GLM-4.7-Flash hits 57.8 tok/s on short context but drops to 42 tok/s at 8K. Qwen3.5-35B SSM hybrid: 56 tok/s at short, 56 tok/s at 8K. Why agents with long system prompts should care about this difference.
- 2026-02-26
A custom /debate command that pits Codex CLI against Gemini CLI on architecture, code, and decisions. Different training data, different blind spots — and the disagreements between them are usually the most useful output.
- 2026-02-26
How I replaced screenshot-heavy iOS test runs with ui_describe_all-first testing in Claude Code, cutting context usage by 81% for BPS Tracker. Plus Fastlane integration for App Store automation.
- 2026-02-25
Spent weeks restarting the OpenClaw gateway for every config change. Then discovered the file watcher. What hot-reloads instantly, what still needs a restart, and how to tell auth failures from transient network errors.
- 2026-02-19
A Claude Code config rule marked MANDATORY was skipped twice in one session. Here's the root cause — three architectural reasons why emphasis doesn't work — and three system-level solutions that do.
- 2026-02-19
Benchmarking 8 local LLMs on NVIDIA GB10 (128GB unified memory) across 7 task categories. Quantization surprises, a 120B model that fails at JSON, and thinking models that spend their entire budget thinking.
showing 112 posts