Blog — ai-muninn

❯

❯ ls -la ~/blog

70 posts · 10 series

#dgx-spark (37)#vllm (31)#gb10 (25)#gemma-4 (18)#benchmark (18)#sm121 (15)#ollama (15)#nvfp4 (14)#beginner (14)#ai-agent (11)#quantization (10)#openclaw (10)#fp8 (9)#gemma-4-26b-a4b (9)#llm (9)

datereadtitle
2026-05-0913m
Want MTP speedup on abliterated Gemma 4? Vanilla draft can't track the modified body
I self-quantized huihui's abliterated Gemma 4 26B-A4B to FP8-Dynamic and shipped it to HF. After sweeping num_speculative_tokens 1→4, the abliterated body is exactly as fast as vanilla on the same stack (39.4 vs 39.3 tok/s baseline) and the MTP boost at n=1 is equivalent — but per-position acceptance decays so steeply that deeper speculation is wasted. Three drafts of this article each smuggled in a different fabrication that Codex caught; this is the corrected version.
#gemma-4 #abliteration #mtp #speculative-decoding
2026-05-0613m
Liftoff: Gemma 4 hits 670 tok/s aggregate on DGX Spark (108 tok/s single-stream)
Google announced Multi-Token Prediction drafters for Gemma 4 on 2026-05-05. The vLLM PR was opened and approved the same day; a preview Docker image shipped hours later. I tested it on DGX Spark: Gemma 4 26B-A4B-it FP8 + MTP γ=4 hits 108.78 tok/s single-stream (2.66× baseline), 674.28 tok/s aggregate at concurrency=8. One undocumented trap: the drafter pairs with -it, not base.
#gemma-4 #mtp #speculative-decoding #vllm
2026-05-0513m
How a zh-TW Linter Found 128 Mainland-China Drift in My Own Writing
I ran sysprog21/zhtw-mcp across 72 of my Traditional Chinese articles. Three sweeps, 128 cross-strait substitutions across 42 files. The real takeaway wasn't the count — it was discovering my blindspot isn't 'I don't know the right Taiwanese term,' it's 'when a Mainland term shows up I don't auto-doubt it.'
#zh-tw #ai-workflow #linter #skills
2026-05-0411m
[Field Guide] Z-Image Turbo — choosing the right config (1.37× faster, 44% less RAM)
I ran six Z-Image Turbo quantization configs on DGX Spark GB10 — BF16 baseline, FP8 cast standard, FP8 cast fast, FP8 scaled (Kijai), NVFP4, NVFP4+FP8 encoder. With N=10 isolated GPU, NVFP4 transformer hits 5.50s warm versus BF16 7.55s (1.37× faster). All three FP8 paths are slower than BF16. Model working set drops from 20.6 GB (BF16) to 11.5 GB (NVFP4+FP8 encoder) — 44% smaller.
#z-image #comfyui #nvfp4 #fp8
2026-05-0415m
[Field Guide] Z-Image Turbo — does choosing a faster config hurt quality? LPIPS + CLIPScore answer
Does Z-Image Turbo quantization break image quality? Two-axis benchmark — LPIPS (perceptual distance vs BF16) + CLIPScore (image-text alignment) — across 6 prompts × 4 configs × 3 seeds = 72 samples. Result: NVFP4 produces images that look different from BF16, but no measured regression in this sample — all 4 configs land within ±0.04 std on CLIPScore, smaller than the noise floor. Production users should re-verify with their own prompt set.
#z-image #comfyui #nvfp4 #fp8
2026-05-0113m
[vLLM] Nemotron 3 Nano on DGX Spark: 74.75 tok/s NVFP4 — 11.5% Past the Public Baseline
Ten days ago I called NVFP4 a trap on DGX Spark GB10. Today the same hardware hits 74.75 tok/s on Nemotron 3 Nano W4A16, beating my own FP8 ceiling and the public 67 tok/s forum number. The 4-layer patch stack, the quant variant choice, and the bandwidth math behind it.
#nemotron-3 #nvfp4 #vllm #dgx-spark
2026-05-0113m
[vLLM] Watching English Videos with DGX Spark: Nemotron Omni Multimodal on GB10
Same DGX Spark, different goal: watch a 3-minute Andrej Karpathy talk and output the spoken content + visual scene. 89 seconds wall, 53,842 prompt tokens, factually correct. The use_audio_in_video flag, the upstream-image gotcha, and the long-video knob math.
#nemotron-omni #multimodal #vllm #dgx-spark
2026-04-309m
Vercel Hobby hit 1M/1M Edge Requests. The bug was a Cache-Control header.
ai-muninn.com burned through Vercel Hobby's 1M Edge Requests quota this month. It wasn't traffic, wasn't bots, wasn't large images. Next.js defaults /public/* to must-revalidate, which makes every conditional GET (even 304s) count as an edge request. Three lines of next.config.ts to fix. Three rounds of fact-check rewrites to publish.
#vercel #next.js #cache-control #edge-network
2026-04-2813m
[llm-compressor] Self-Quantizing a 35B Abliterated MoE to FP8 on DGX Spark: 4 OOMs, 3 Prefix Bugs, and Why the First Success Wasn't Actually FP8
Quantizing huihui-ai's Qwen3.6-35B-A3B abliterated to FP8 for vLLM on a 128 GB UMA box. Seven attempts, two distinct OOM modes, a model class that silently breaks vLLM's loader, and why streaming save_pretrained returns BF16 not FP8. Final result: 51.72 tok/s, 1.68× BF16.
#dgx-spark #gb10 #sm121 #llm-compressor
2026-04-289m
[SWE-bench] Where Qwen 3.6 35B Loses on SWE-bench Lite: Anatomy of 155 Unresolved Tasks
Qwen 3.6 35B-A3B FP8 hits 48.33% (145/300) on SWE-bench Lite with the same scaffold that gets Gemma 4 26B to 38.67%. The 9.66-point gap deserves an explanation. This is a deep dive on Qwen 3.6's 155 failures: 76% are wrong-logic patches, 14% are incomplete fixes, 10% never submit. The categorization is asymmetric — Gemma 4's failures haven't been classified the same way yet — so the cross-model comparison is part hypothesis, part data.
#swe-bench #qwen-3.6 #gemma-4 #failure-analysis
2026-04-268m
[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law
Ran huihui-ai's abliterated Qwen 3.6 35B through the same TMMLU+ harness as Part 21. Aggregate dropped 75.07% → 73.22%. The cost isn't uniform: regulatory subjects (信託 −7.7, 行政法 −7.1) lose the most, while pure logic and math actually improve. Hokkien also got worse — abliteration doesn't fix data scarcity.
#tmmlu+#abliteration #traditional-chinese #qwen-3.6
2026-04-257m
[Benchmark] TMMLU+ Paired Eval: Qwen 3.6 35B Sweeps Gemma 4 26B 51-of-51 on Traditional Chinese
Two MoE models on the same DGX Spark, same harness, same 22,690 questions. Qwen 3.6 35B-A3B scored 75.07%, Gemma 4 26B-A4B scored 46.30%. Qwen won every single one of the 51 subjects — including Taiwan-specific topics where I expected Gemma to win.
#tmmlu+#traditional-chinese #qwen-3.6 #gemma-4
2026-04-2212m
[Hands-On] Making NVFP4 17% Faster on GB10 with a Triton FP8 Bypass
Part 19 proved NVFP4 is a trap on DGX Spark. This time we fight back: a Triton kernel that dequants NVFP4 to FP8 and feeds the FP8 tensor cores. 40.8 → 47.6 tok/s, with full code.
#nvfp4 #fp8 #triton #dgx-spark
2026-04-217m
[Benchmark] NVFP4 Is a Trap on GB10: FP8 Wins by 32% (vLLM + SGLang Tested)
NVFP4 should be faster than FP8 — fewer bits, less bandwidth. On DGX Spark's GB10 (SM121), it's 32% slower. Root cause: missing hardware instruction. Dual-engine proof with vLLM and SGLang.
#nvfp4 #fp8 #dgx-spark #gb10
2026-04-207m
[Benchmark] Same Scaffold, Three Models: 16% → 38% → 48% on SWE-bench Lite
One scaffold (backticks + edit-tool + budget prompt), three models (Gemma 4 E4B, Gemma 4 26B, Qwen 3.6 35B), zero code changes between runs. Qwen 3.6 hit 48.33% — beating SWE-agent + Claude 3.7 Sonnet. The scaffold is the fixed cost; the model is the variable.
#swe-bench #gemma-4 #qwen-3.6 #scaffold
2026-04-179m
[LLM 101] Why Run AI on Your Own Computer? It's Not a Cheaper ChatGPT — It's a Different Tool
Local AI isn't a budget ChatGPT. It's a knowledge extractor, private code assistant, and offline tool. Monthly power cost ~$1.20 vs ChatGPT Plus $20. This guide has a decision table for when to use which.
#llm #local-ai #ollama #beginner
2026-04-1712m
[Benchmark] SWE-bench Lite 38.67% with a 26B Local Model — 0.33% from Claude 3.5 Sonnet Scaffolds
Gemma 4 26B-A4B FP8 scored 116/300 on SWE-bench Lite, ranking #16 globally. Zero API cost on a DGX Spark. The scaffold — not the model — was the differentiator.
#swe-bench #gemma-4 #mini-swe-agent #vllm
2026-04-1613m
[Ask AI Right] What AI Does Poorly — Four Landmines to Know Before Using ChatGPT or Claude in 2026
AI is strong, but four things still trip it up in 2026: hallucinations, stale knowledge, short memory, and privacy defaults. Even Anthropic's own lawyers got caught by the first one.
#ai #hallucination #chatgpt #claude
2026-04-158m
[AI Agent] Gemma 4 26B Cleared a SWE-bench Lite Instance — After 28 Tries Across Two Days
Two days running mini-swe-agent + vLLM on a GB10. From wrong doc conclusions to Gemma 4 self-submitting a clean patch in 38 steps — what actually unlocked it.
#swe-bench #mini-swe-agent #gemma-4 #vllm
2026-04-1511m
[LLM Deep Dive] What Quantization Algorithms Actually Do: From Q4_K_M to TurboQuant
How does Q4_K_M fit a 14B model into 4 bits without ruining it? Not by 'cutting off 75%' — but through three layers: K-quant super-blocks, TurboQuant random rotation, and a 1-bit JL sign sketch. A mechanism walkthrough without the equations.
#llm #quantization #k-quant #turboquant
2026-04-1410m
[Ask AI Right] The Art of Follow-Up Questions — What to Do When the First Answer Is Too Shallow
The first answer AI gives you is a rough draft, not the final answer. Learn 5 follow-up techniques — adding constraints, asking for comparisons, and letting AI ask YOU questions — to get dramatically better results.
#ai #conversation #follow-up #beginner
2026-04-1412m
[LLM 101] Context Window — How Much Can AI Read at Once?
AI forgets what you said 20 messages ago. It's not broken — its desk is full. This guide explains context windows, why conversations go stale, and how to work around the limit.
#llm #context-window #beginner #tokens
2026-04-138m
[Ask AI Right] Before You Build It, Ask: Does This Already Exist?
Your first question to AI shouldn't be 'help me do X.' It should be 'is there something that already does X?' This article teaches you how to use AI as a research assistant — finding tools, comparing alternatives, and verifying they're still alive.
#ai #tools #research #beginner
2026-04-139m
[Claude Code] Build a Self-Auditing Skill That Keeps Your Config Lean
Your CLAUDE.md and MEMORY.md grow silently until they eat 10K+ tokens per turn. I built a /slim skill that lets Claude diagnose and fix its own bloat — here's how.
#claude-code #tokens #context-window #skills
2026-04-1317m
Claude Code Burning Through Tokens? 8 Fixes to Make Sessions Last 10x Longer
You just started using Claude Code and the context window keeps filling up. Here's where the tokens actually go, what you can do about it, and how to make Claude remember things without re-reading everything.
#claude-code #tokens #context-window #beginner
2026-04-138m
[DGX Spark] From Unboxing to Running: Complete Deployment Guide
Everything you need to go from a sealed DGX Spark box to serving your first local LLM. Hardware check, Ollama quickstart, vLLM production setup, model selection, and the 5 gotchas that cost hours.
#dgx-spark #gb10 #gx10 #vllm
2026-04-136m
[Benchmark] Gemma 4 Complete Guide on DGX Spark — Which Model Should You Pick?
Gemma 4 E2B / E4B / 26B MoE / 31B Dense benchmarked on DGX Spark, RTX 5090, and MacBook Pro. One table with speed, memory, quantization format. Selection guide included.
#gemma-4 #dgx-spark #gb10 #benchmark
2026-04-137m
[AI Agent] Gemma 4 Went from 40 Errors to a 9-Step Bug Fix — by Switching One Thing
A feasibility test: can open-source models run SWE-Bench locally for free? Gemma 4 26B failed on OpenHands (40+ errors) but fixed a test bug in 9 steps on SWE-agent. Same model — the action format was the difference.
#swe-bench #gemma-4 #qwen-3.5 #openhands
2026-04-1110m
[Ask AI Right] Why AI Feels Useless to You — Answer Machine vs Collaboration Tool
Same AI, same question, different results. The people who find ChatGPT life-changing and the people who think it's useless are doing completely different things — and the difference is a single mindset shift.
#ai #chatgpt #prompting #beginner
2026-04-105m
[Ask AI Right] You Don't Know What You Need — Let AI Find It
Most people don't struggle with using AI — they struggle with knowing what to use it for. This article teaches you a simple method to let AI identify the repetitive parts of your workday you've stopped noticing.
#ai #productivity #beginner #workflow
2026-04-1011m
[LLM 101] How to Choose an AI Model: Gemma vs Llama vs Qwen vs Mistral (2026)
Which local AI model should you download? Gemma, Llama, Qwen, Mistral compared by size, speed, and quality. Simple formula: parameters × 0.6 = GB needed. Beginner-friendly guide.
#llm #model-selection #ollama #beginner
2026-04-1012m
[LLM 101] What Is Quantization? Q4, Q8, FP16 Explained
Q4_K_M, Q8_0, FP16 — the same model comes in a dozen versions and the names look like hieroglyphs. This guide explains what quantization actually does, why it doesn't ruin the model, and which level to pick.
#llm #quantization #ollama #beginner
2026-04-096m
[Ask AI Right] You Opened AI — Now What Do You Say?
AI isn't Google — you're not searching, you're having a conversation. This article teaches you what to say when you first open ChatGPT, five things you can try right now, and how to adjust when the answer isn't quite right.
#ai #chatgpt #beginner #conversation
2026-04-097m
[Ask AI Right] Which AI Should You Use in 2026?
ChatGPT, Claude, and Gemini — the three AI assistants you can start using right now. A no-jargon guide to what each one does best, how much they cost, and how to get started.
#ai #chatgpt #claude #gemini
2026-04-089m
[Benchmark] Rescuing Gemma 4 31B on a 32GB MacBook Pro: From 1.5 to 12.8 tok/s
Gemma 4 31B runs at 1.5 tok/s on MBP M1 Max with Ollama due to swap. The fix: reduce context window (9 tok/s) or switch to oMLX (12.8 tok/s). The real culprit is KV cache allocation, not model size.
#gemma-4 #31b #m1-max #ollama
2026-04-089m
[Benchmark] 4 Machines, 4 Models, 1 Answer: Memory Decides Everything
Gemma 4 E2B through 31B benchmarked on RTX 5090, M1 Max, DGX Spark, and M4 with Ollama. E2B hits 310 tok/s on 5090. 31B hits 1.5 tok/s on MBP — swap kills faster hardware. Memory capacity > bandwidth.
#gemma-4 #rtx-5090 #dgx-spark #gb10
2026-04-089m
[LLM 101] Dense, MoE, PLE, SSM — Four AI Model Architectures Explained Simply
Dense is everyone working. MoE is expert rotation. PLE is a dictionary on every floor. SSM is a speed reader. A zero-jargon guide to the four main AI model architectures and how to pick between them.
#dense #moe #ple #ssm
2026-04-078m
[Benchmark] Gemma 4 E2B vs E4B: 81 tok/s vs 52 on Three Machines — Bandwidth Is Everything
Gemma 4 E2B is 44-82% faster than E4B across M1 Max, GB10, and M4. We benchmarked both on Ollama with 3 runs per scenario, unique prompts, and proper warm-up. Memory bandwidth predicts generation speed better than anything else.
#gemma-4 #e2b #e4b #ollama
2026-04-0710m
[Benchmark] From 19 to 50 tok/s: We Quantized Gemma 4 E4B to NVFP4 Before Anyone Else
Gemma 4 E4B NVFP4A16 hits 49.9 tok/s on DGX Spark — 2.6x faster than BF16. First NVFP4 checkpoint on HuggingFace. PLE architecture, FP8 vs NVFP4, and the llm-compressor version hell that almost stopped us.
#gemma-4 #e4b #nvfp4 #fp8
2026-04-078m
[LLM 101] Ollama vs vLLM: Two Ways to Run AI on Your Own Computer
Ollama is a microwave — one command and you're chatting with AI. vLLM is a professional oven — 30% faster, handles multiple users, but takes real setup. A zero-jargon guide to choosing between them.
#ollama #vllm #llm #local-ai
2026-04-058m
Gemma 4 26B in 16 GB at 52 tok/s — DGX Spark NVFP4
Gemma 4 26B in NVFP4 hits 52 tok/s on DGX Spark — 16.5 GB used, 82 GB free for KV cache. Why MoE wins over the 31B dense variant on Blackwell GB10.
#gemma-4 #nvfp4 #vllm #dgx-spark
2026-04-056m
[Benchmark] Gemma 4 31B Dense on DGX Spark: 7 tok/s and the Bandwidth Wall
Gemma 4 31B-IT NVFP4 on GB10 maxes out at 7.0 tok/s — bandwidth-bound at 273 GB/s. The math predicted 4.4 tok/s theoretical; NVFP4 compression buys 60% but can't escape the wall. Choose MoE.
#gemma-4 #nvfp4 #vllm #dgx-spark
2026-04-057m
[Benchmark] vLLM vs Ollama on the Same Model: Why 30% Faster on GB10
Same Gemma 4 26B-A4B, same GPU, 30% speed gap. vLLM NVFP4 hits 52 tok/s while Ollama Q4_K_M tops at 40. Root cause: Marlin kernels, CUDA graphs, and an Ollama CPU/GPU split trap.
#vllm #ollama #benchmark #dgx-spark
2026-04-026m
[DGX Spark] Overheating, 100W Power Cap, 30W Safety Mode — Complete Diagnostic Guide
DGX Spark power and thermal issues blew up after Carmack's criticism. This guide covers three distinct symptoms: 30W PD controller defect (needs RMA), 100W thermal throttling, and 5W driver bug (fixable). One command, 30 seconds to diagnose.
#gx10 #gb10 #dgx-spark #power-delivery
2026-03-308m
[Benchmark] TurboQuant on GX10: Is 3-bit KV Cache Compression Actually Lossless?
Real benchmark numbers for Google's TurboQuant on a GB10/SM121 (DGX Spark) — actual compression ratios, Qwen2.5-3B accuracy validation, and why Qwen3.5-35B's hybrid attention architecture makes things complicated.
#turboquant #kv-cache #quantization #vllm
2026-03-247m
[AI Agent] NemoClaw Without the Cloud: Swapping Nemotron for a Local Ollama Model
How to point NemoClaw's inference backend to a local Ollama or vLLM endpoint. Config location, model swap, and what OpenShell still enforces when the cloud is gone.
#nemoclaw #openclaw #openshell #ollama
2026-03-245m
[AI Agent] openclaw + ChatGPT OAuth: Run GPT-5.4 Agents Without API Credits
Your ChatGPT Plus subscription already includes GPT-5.4 with 1M context. openclaw's OAuth flow lets you use it for AI agents — zero API credits, one command. Full setup guide.
#openclaw #gpt-5.4 #chatgpt #oauth
2026-03-237m
[AI Agent] How to Install NemoClaw on DGX Spark (4 Undocumented Fixes)
NemoClaw's official installer fails on DGX Spark out of the box. This guide covers the 4 fixes — Node upgrade, npm link, OpenShell tar.gz, cgroupns — to get your first AI agent running in 30 minutes.
#nemoclaw #openclaw #openshell #ai-agent
2026-03-235m
[AI Agent] What Is NemoClaw? NVIDIA's AI Agent Framework for DGX Spark Explained
NemoClaw bundles OpenClaw + OpenShell + NVIDIA Agent Toolkit into one installer for DGX Spark. Architecture breakdown, what it does, and whether it's worth your time.
#nemoclaw #openclaw #openshell #ai-agent
2026-03-217m
[Claude Code] claude-agent-sdk vs subprocess: Why Intermediate Turns Disappear
Building a multi-agent orchestrator with `claude -p` subprocess reveals a silent data loss problem. The SDK fix, session resume, parallel execution, and why setting_sources matters.
#claude-code #claude-agent-sdk #multi-agent #orchestrator
2026-03-216m
[vLLM] FP8 KV Cache on GB10: Why Outputs Collapse into Repetition Loops
Adding --kv-cache-dtype fp8 to a vLLM serve script on GB10 causes outputs to degrade into repetition after ~500 tokens. Root cause: missing calibration data, q_scale defaults to 1.0.
#vllm #fp8 #kv-cache #gb10
2026-03-216m
[AI Agent] openclaw + 131K Context: When max_tokens Goes Negative
Connecting openclaw to a 131K context model and hitting 400 max_tokens must be at least 1, got -1292. The context budget math, the config key trap, and the fix.
#openclaw #context-window #vllm #gpt-oss
2026-03-216m
[AI Agent] openclaw Real-Time Streaming via Telegram Bot API 9.5 sendMessageDraft
Replacing choppy editMessageText polling with Telegram's sendMessageDraft for live animated output. The patch, the think-block filter, and the optional chaining trap in DM chats.
#openclaw #telegram #streaming #bot-api
2026-03-1911m
[AI Agent] openclaw: Why the Bot Went Silent — Tailscale, IPv6, and a Node.js Happy Eyeballs Trap
The bot process is running. The token is valid. Messages are being consumed. Nobody is home. A systematic takedown of every wrong hypothesis — and the hidden causal chain that connects Tailscale routing tables to silent sendMessage failures in Node.js.
#node.js #tailscale #ipv6 #undici
2026-03-1911m
[vLLM] Running a 120B Model on DGX Spark at 60 tok/s — Zero API Cost, Six Bugs
How to get gpt-oss-120B running on a DGX Spark (GB10, SM121) with vLLM. The goal: a 120B model serving a local AI agent at zero API cost. The path: six bugs, one silent env var, and a startup log that tells you everything.
#dgx-spark #sm121 #vllm #gpt-oss
2026-03-196m
[vLLM] Qwen3.5-122B Runs. But at 14 tok/s.
After fixing the four SM121 NVFP4 bugs, Qwen3.5-122B boots cleanly and generates correct output. Then you check the speed. 14 tok/s. No flags to fix it. Here's why — and what to wait for.
#dgx-spark #sm121 #qwen3.5-122b #vllm
2026-03-185m
[AI Agent] openclaw: When the Agent Calls for Help
How to wire a callhelp tool into a local agent loop so it can spawn Codex CLI mid-reasoning. One permission flag you must set, and why Claude's quota stays mine.
#ai-agent #openclaw #codex #llm
2026-03-1710m
[vLLM] Why Your DGX Spark Only Says "!!!!!": Debugging NVFP4 on SM121
CUTLASS FP4 kernels target SM120 (GB200). On SM121 (GB10, DGX Spark) they run silently and produce garbage. Here's the full diagnostic story — 4 bugs, the row-identical failure signature, and the working fix.
#dgx-spark #sm121 #vllm #nvfp4
2026-03-168m
[AI Agent] The Codex-Executor Pattern: Keeping Agent Sessions Small
Why we stopped having the OpenClaw agent orchestrate multi-step tasks directly, and started spawning Codex subprocesses instead. The pattern that keeps agent context minimal and tasks reliable.
#ai-agent #claude-code #codex #agent-architecture
2026-03-1310m
[vLLM] Nemotron-3-Super-120B on a Single GB10: Full Day Debug Log
Getting NVIDIA's Nemotron-3-Super-120B-NVFP4 running on an ASUS GX10 (SM121, 128GB). Four SM121-specific pitfalls, the env-var-that-does-nothing, and a working docker command.
#dgx-spark #gb10 #sm121 #nemotron
2026-03-077m
[vLLM] Ollama's KEEP_ALIVE Is Silently Eating Your vLLM Headroom
vLLM OOMed on restart despite 128GB unified memory. Cause: Ollama's KEEP_ALIVE=2h was holding 19-51GB in GPU. Diagnosis command, manual unload fix, and why to set KEEP_ALIVE=0 once vLLM is your primary stack.
#vllm #ollama #gpu-memory #dgx-spark
2026-03-067m
[vLLM] Don't Add --enable-chunked-prefill to SSM Models
Adding --enable-chunked-prefill to a Qwen3.5-35B (SSM+MoE hybrid) dropped throughput from 47 tok/s to 5.7 tok/s. Why SSM recurrence and chunked prefill are fundamentally incompatible.
#vllm #ssm #qwen #dgx-spark
2026-03-0512m
[vLLM] Qwen3.5-35B at 47 tok/s on DGX Spark: Ollama to vLLM Migration Guide
Step-by-step guide: Ollama to vLLM on DGX Spark GB10. Qwen3.5-35B hits 47 tok/s with TTFT dropping from 3s to 0.12s. Covers 6 real gotchas including SSM + chunked prefill trap and GPU memory conflicts.
#dgx-spark #gb10 #vllm #ollama
2026-03-0510m
[AI Agent] Zero API Cost: Running OpenClaw on DGX Spark + Mac Mini
Full stack local AI agent: Mac Mini M4 as the always-on gateway, GX10 for inference, Telegram as the UI. No subscriptions, no cloud APIs. Six deployment lessons from the trenches.
#openclaw #ai-agent #dgx-spark #mac-mini
2026-03-018m
[Benchmark] Pure MoE vs SSM Hybrid: Context Decay and Why It Matters for Agents
GLM-4.7-Flash hits 57.8 tok/s on short context but drops to 42 tok/s at 8K. Qwen3.5-35B SSM hybrid: 56 tok/s at short, 56 tok/s at 8K. Why agents with long system prompts should care about this difference.
#benchmark #ssm #moe #dgx-spark
2026-02-268m
[Dev Workflow] I Made Two AIs Argue. The Disagreements Are the Point.
A custom /debate command that pits Codex CLI against Gemini CLI on architecture, code, and decisions. Different training data, different blind spots — and the disagreements between them are usually the most useful output.
#dev-workflow #claude-code #gemini #codex
2026-02-268m
[Claude Code] Testing iOS Apps with Claude Code: 81% Context Reduction
How I replaced screenshot-heavy iOS test runs with ui_describe_all-first testing in Claude Code, cutting context usage by 81% for BPS Tracker. Plus Fastlane integration for App Store automation.
#claude-code #ios #swift #testing
2026-02-257m
[AI Agent] OpenClaw Config Hot-Reload: No Restart Needed
Spent weeks restarting the OpenClaw gateway for every config change. Then discovered the file watcher. What hot-reloads instantly, what still needs a restart, and how to tell auth failures from transient network errors.
#ai-agent #openclaw #configuration #developer-workflow
2026-02-1912m
[Claude Code] I Wrote MANDATORY. The AI Ignored It.
A Claude Code config rule marked MANDATORY was skipped twice in one session. Here's the root cause — three architectural reasons why emphasis doesn't work — and three system-level solutions that do.
#claude-code #ai-agents #prompt-engineering #systems-design
2026-02-1911m
[Benchmark] 8 Models on DGX Spark: Finding the Best Stack for AI Agents
Benchmarking 8 local LLMs on NVIDIA GB10 (128GB unified memory) across 7 task categories. Quantization surprises, a 120B model that fails at JSON, and thinking models that spend their entire budget thinking.
#dgx-spark #gb10 #ollama #benchmark

showing 70 posts