DeepSeek-V4-Flash on DGX Spark — Series

~ / blog / series / DeepSeek-V4-Flash on DGX Spark

❯ ls ~/blog/series/deepseek-v4-flash-on-dgx-spark

10 posts

partdatereadtitle
12026-06-1210m
[Local LLM] My first Q2 model looked broken on a 128GB box — the real culprit was a parser that couldn't read DSML, not the quantization
DeepSeek-V4-Flash is 284B. I got it onto a single 128GB GB10 with antirez's ds4 engine and an asymmetric Q2 GGUF at 15.6 tok/s. The fun part: the broken tool calls weren't the 2-bit quant's fault. The runtime just couldn't parse DSML.
22026-06-1214m
[Local LLM] Running a 15 tok/s 284B as your daily agent brain — the settings that make it bearable
A 284B model at 15 tok/s, wired into a daily agent. Two sets of settings make it comfortable — server-side and agent-framework-side. --no-mmap cuts cold start to 57s, the KV disk cache halves prefill, and one missing context_length will crash the whole session.
32026-06-127m
[Local LLM] Weights win: a 284B crushed to 2-bit still beats the small model that fits
DeepSeek-V4-Flash (284B) only fits a 128GB box at asymmetric Q2 (~80GB). Sounds like suicide quantization — but it's surgical: only the layers that barely affect quality get cut. As a daily agent it ran 280 turns with zero degradation. Big enough weights survive 2-bit.
42026-07-0512m
[Local LLM] My 284B agent quietly stopped reusing its KV cache — the ds4 evict storm that re-paid prefill every turn
A month after wiring DeepSeek-V4-Flash into a daily agent, it felt slow again. Two log lines explained it: the disk-KV cache was evicting live prefixes (hits=0), and tool-call turns never saved a checkpoint — so common=268 out of 14209 and every turn re-paid full prefill. The fix: 256G KV budget + PR #489.
52026-07-0610m
[Local LLM] Depth-1 MTP on V4-Flash: +9% on agent turns, −4% on prose — route speculative decode by workload
Depth-1 MTP speculative decode on DeepSeek-V4-Flash is lossless but workload-dependent on a GB10: +9.4% on agent turns and +6.5% on code, −3.6% on prose and −3.7% on Chinese chat. The sign follows the acceptance rate because decode here is verify-bound (108ms verify vs 4ms draft). It's not a global faster switch — route it by workload.
62026-07-0712m
[Local LLM] Why a 284B fits a 128GB GB10 at long context: DeepSeek-V4-Flash attacks the KV cache, not the parameter count
DeepSeek-V4-Flash is a 284B MoE that stays fast at long context on a 128GB GB10 because its hybrid CSA/HCA attention and lightning indexer shrink the KV cache to ~871MiB at 64K and read only a few hundred compressed rows per step. What I found reading ds4's code and DeepSeek's V4 paper.
72026-07-0912m
[Local LLM] FlashMemory can't improve DeepSeek-V4-Flash's own lightning indexer — I retrained it on my exact Q2 and it still lost
V4-Flash already ships a native lightning indexer tracking true attention at 93–96%. FlashMemory pre-filters candidate chunks, but it makes near-random selections on my Q2 and reaches only 89–92% when retrained — still a NO-GO on GB10.
82026-07-1011m
[Local LLM] How to tell if a hyped LLM optimization is real on your hardware: read the source, find the ceiling, run one experiment
Most hyped LLM optimizations don't survive contact with your own model and hardware. Three cheap checks — read the source, find the true ceiling, run one discriminating experiment — that tell you which are real, anchored in the FlashMemory investigation on DeepSeek-V4-Flash.
92026-07-2112m
[DeepSeek-V4-Flash] Swapping the ds4 Engine for a Free Half-Generation Speedup on One DGX Spark
Asked Codex if my ds4 repo had a single-DGX-Spark optimization; found Entrpi. Swapped the engine, not the model — decode 14-16 to about 20 tok/s, prefill about 2×.
102026-07-2515m
[DeepSeek-V4-Flash] The Engine Upgrade That OOM'd My DGX Spark — and Why I Rolled It Back
Part 9's Entrpi engine hit 20 tok/s. Days later it OOM-crashed under real agent traffic. Root cause: two copies of the weights in memory. I rolled back.

← back to all posts