Does DeepSeek-V4-Flash's decode speed drop as context grows?

Yes, noticeably. ~15 tok/s on short chats, ~9 at 16K, ~5 at 64K. Even with DSv4's compressed KV, per-token attention cost climbs with real context length. The allocated context (-c) doesn't affect decode; what's actually fed in does.

Why does ds4/llama.cpp cold-start take 7 minutes, and how do you fix it?

mmap page-faults a ~75GB CUDA buffer at ~200 MB/s. Add --no-mmap for a sequential ~1.4 GB/s read and cold start drops from ~7 min to ~57s. The cost: that 80GB becomes a hard CUDA buffer (non-reclaimable page cache), so memory headroom gets tight when you co-run other things.

Can you quantize DeepSeek-V4-Flash's KV cache?

No. -ctk/-ctv set to q8_0 or q4_0 both crash in this DSv4 fork's CUDA concat kernel (it won't take quantized KV types). Doesn't matter though — DSv4's compressed KV is tiny already (64K of f16 KV is just 1.7GB). ds4 instead gives you a KV disk cache that writes computed prefixes to NVMe and reuses them.

Is ds4's KV disk cache actually useful?

Yes. Send the same 4011-token prompt twice: first cold prefill is 11.4s; the second hits the on-disk prefix checkpoint (47ms to load) and only prefills the remainder, down to 6.5s. It keys on a prefix hash, writes to real NVMe, and even reuses across quant. It's why long chats decode slowly but reconnecting doesn't re-explode TTFT every time.

Decode is only 5–12 tok/s — isn't DeepSeek-V4-Flash painful as an agent?

Better than you'd expect, because it's bimodal. From a day of real logs: 155 LLM calls averaging 46K tokens of context, but 48% returned in 5–15s — because the disk KV cache loads a big prefix in ~87ms and only the new tail gets recomputed. Hot path (cache hit) is 5–15s; cold path (restart / post-compaction / new topic) is 60–160s. As an agent, it spends most of its time on the fast one.

~/blog/dgx-spark-deepseek-v4-flash-context-memory-engineering

DeepSeek-V4-Flash on DGX Spark · part 2

[Local LLM] Running a 15 tok/s 284B as your daily agent brain — the settings that make it bearable

2026-06-1214 min read#dgx-spark #gb10 #deepseek-v4-flash #kv-cache 中文版

❯ cat --toc

The short version
The setup
Just give me the settings: ds4 server side + agent framework side
Decode isn't a constant — it collapses with context
--no-mmap: same model, same disk, 7 minutes to 57 seconds
KV disk cache: write what it thought to disk, grab it next time
Decode's slow, but as an agent it's surprisingly smooth
KV quantization is a dead end: I went to optimize and found nothing to optimize
I chased a non-existent watchdog for half an hour — the watchdog was my grep
The real compression culprit: Hermes defaults to letting ds4 summarize itself, and it times out at 6 minutes
The coexistence wall: one 128GB box can't comfortably run a frontier LLM and media generation
Takeaways
Where the time actually went
Debugging takeaways
The one-liner
TL;DR

TL;DR

Getting DeepSeek-V4-Flash to run on a GB10 is the entry ticket. To use it as a daily agent, you do memory and context work: decode falls from 15 tok/s to 5 (at 64K), --no-mmap cuts cold start from 7 min to 57s, and KV quantization is a dead end (the concat kernel won't take it). But slow decode doesn't mean slow in practice: from a day of logs, average context is 46K tokens yet 48% of replies finish in 5–15s, because ds4's KV disk cache loads a 50K-token prefix in 87ms (cold prefill would be 150s). In practice it's bimodal — the hot path is smooth, only the cold path drags.

The whole post in one image

The short version

Part 1 got this 284B model running on my small box. But "runs" and "comfortable to use every day" are separated by a pile of engineering. Imagine an assistant with a great memory and two quirks: the longer you talk, the slower he replies (he re-reads the whole conversation every time he speaks), and every "shift" starts with a 7-minute warm-up. This post is about fixing those quirks — and about the half-hour I spent chasing a problem that didn't exist.

The setup

A model's benchmark number and its feel as "the brain you use every day" are two different things. The first is one prompt on an idle box. The second is whether it's still pleasant after a full day of use, with context bloated and fighting your image generator for memory.

This is part 2 of the DeepSeek-V4-Flash on DGX Spark series. Part 1 proved it runs at 15.6 tok/s with working tool calls. This post is every memory and context wall I hit wiring it into a daily agent — and what you have to set, on the server and the agent framework, to get around them.

Just give me the settings: ds4 server side + agent framework side

Here's the conclusion up front. To run this as a daily brain, the settings split in two. I crashed my own agent today by missing one of them.

ds4 server side:

--ctx 131072, not 256K — decode is the same, but TTFT triples (numbers below).
--kv-disk-dir is mandatory — it's what keeps long chats from recomputing prefill every turn, and the whole reason this thing is bearable as an agent.
--power 100 — the default is already 100, but don't copy a recipe that sets 75 (that's a flat 25% off).
--no-mmap buys a 57s cold start but tightens memory headroom; keep mmap if you co-run something like an image generator.
single user → --max-batch-size 1, no batching.

Agent framework side (the half people miss):

Tell the framework the server's real context limit (131072 for ds4). Leave it out and the framework falls back to some larger default → context overgrows → the server returns 400 Prompt too long → the whole session is wedged. That's exactly what I broke today: every new message crashed on entry.
Compress before you hit the limit, and summarize with an off-box small model (I hand this to a Gemma 4 E2B running on a separate GTX 970 — fully local, never leaves my network). Don't let ds4 summarize itself — it's single-worker, so it's slow and times out.
Cap single-tool output. One terminal command dumping 50K characters (~29K tokens) can blow the window in one shot — that's where my one real overflow came from.
On the client, use model=deepseek-chat for non-thinking — it dodges the bug where thinking swallows DSML, and it's more direct for everyday use.

Every setting below has a pothole behind it.

Decode isn't a constant — it collapses with context

The first counterintuitive thing: decode speed isn't constant; it falls with real context length.

Prompt actually fed in	TTFT (to first token)	decode tok/s
64 tokens	1.9s	14.4
4K tokens	18.4s	11.1
16K tokens	73.2s	9.2
64K tokens	426.7s (7 min!)	5.2

14.4 tok/s on a short chat, down to 5.2 at 64K — a third of the speed. Even with DSv4's compressed KV, per-token attention cost climbs with context. People online claim "DSv4 TPS is ~14 no matter the length"; my measurements say that's wrong.

TTFT is the worse one. A 64K prompt spends 7 minutes just in prefill. 1.9s on a short chat is invisible, but once context bloats, you wait every turn.

And one easy thing to misread: this has nothing to do with the -c setting. On a clean idle bench (warmup discarded, 3 runs), -c 64K and -c 256K both decode at 14.3 on the same short prompt. 256K's real cost is TTFT — 1.36s to first token at 256K vs 0.41s at 64K, 3x slower. 256K doesn't make it think faster; it just makes it wait an extra second before it speaks. So I set 128K for daily use.

--no-mmap: same model, same disk, 7 minutes to 57 seconds

That 7-minute cold start always bugged me. An 80GB file off NVMe shouldn't take 7 minutes by any sequential-read math.

The culprit is mmap. While -ngl 99 fills a 75GB CUDA buffer, mmap page-faults it in randomly at ~200 MB/s. Add --no-mmap for a sequential read at ~1.4 GB/s:

./ds4-server --cuda -m ./ds4flash.gguf --ctx 131072 --no-mmap ...

Cold start drops from ~7 min to 57s, 6–7x. Same model, same disk, one flag.

The tradeoff: --no-mmap makes that 80GB a hard CUDA buffer, not reclaimable page cache. Co-run something like ComfyUI and your allocatable headroom gets tight. The speed isn't free.

KV disk cache: write what it thought to disk, grab it next time

ds4 has something I liked more the longer I used it: a KV disk cache. Start the server with:

./ds4-server --cuda -m ./ds4flash.gguf \
  --kv-disk-dir ./ds4-kv-cache --kv-disk-space-mb 32768 ...

It writes computed prefix KV to NVMe, keyed on a prefix hash. I measured the hit — same 4011-token prompt sent twice:

First (cold): full 4011-token prefill, 11.4s.
Second (warm): hits the on-disk checkpoint, kv cache hit tokens=2048 load=47.5ms, only prefills the remaining 1963 tokens — 6.5s.

Half the prefill. Loading those 2048 tokens of KV takes 47ms, ~100x faster than recomputing (~6s). And it reuses across quant — change quantization and it can still pull the old KV.

This is why long chats decode at 5–9 tok/s but reconnecting doesn't re-explode TTFT to 7 minutes every time: the prefix comes off disk, and only the new tail gets recomputed. It writes what it already thought to disk. Ask the same opening again and it doesn't think it through twice.

Decode's slow, but as an agent it's surprisingly smooth

That decode-collapses-to-5 table reads like "this thing can't be used interactively." I assumed the same. Then I ran it as my agent's brain for a full day, and the logs tell a different story — slow decode, surprisingly smooth use.

The key is that disk KV cache. From a day of logs: 155 calls, averaging 46,507 tokens of context (heavily agentic, context grows fat). By the math, a cold 46K prefill is ~130s — every message should be a two-minute wait. In practice: 48% finished in 5–15s.

The difference is the cache hit. Those 15 hits loaded a 50K-token prefix in 87ms each — a cold prefill of that size is 150s, ~1700x slower. So each turn, almost all of that 46K of history comes straight off disk; the only real work is the new tail plus the generated output. Decode is still 9–12 tok/s, but that cost only hits the new tokens — you don't re-pay for the whole context every turn. That's where the disk KV earns its keep.

So it's really two speeds:

Hot (cache hit): 5–15s. Smooth.
Cold (post-compaction rebuild, first message after restart, brand-new topic): 60–160s. Drags. Every "ds4 is so slow" complaint is this path.

That headline 15 tok/s is misleading because it only describes the cold half. As an agent, it spends most of its time on the hot one.

Tool use checked out too: 108 tool completions, framework layer barely ever dropped a call. The 37 "failures" all turn out to be command-level — 300s timeouts, errors in Playwright/python scripts the model wrote itself, things I ctrl-C'd — i.e. an agent doing normal trial-and-error, not a broken tool.

KV quantization is a dead end: I went to optimize and found nothing to optimize

Since context is the bottleneck, I figured I'd quantize the KV. The whole path dies:

ggml-cuda/concat.cu:244: ggml_cuda_op_concat: unsupported type

-ctk q8_0 crashes, -ctv q4_0 crashes, same wall — this DSv4 llama.cpp fork's CUDA concat kernel won't take quantized KV types; q8_0/q4_0 just abort (a kernel limit of the fork, not the DSv4 model itself). It loads fine and dies on the first inference.

The funny part: even if it worked, there's nothing to save. DSv4's compressed KV is tiny already — 64K of f16 KV is 1.7GB. I tried to trim the KV of a model that already barely uses any, and it wouldn't even let me concat the quantized version. A nice "went to optimize, found nothing to optimize" ending.

I chased a non-existent watchdog for half an hour — the watchdog was my grep

This one's the honesty-series staple.

After making --no-mmap my default, context once grew to 56K, a big prefill's working memory blew past the ceiling, and global_oom killed the server (dmesg: total-vm:339GB). I restarted, then noticed something strange: roughly every 60 seconds, something seemed to relaunch the server with --no-mmap, fighting me.

I went looking. cron? No. systemd timer? No. launchd? No. Half an hour in, I was getting paranoid — where was this watchdog coming from?

The truth made me want to headbutt the desk: I'd been using pgrep -f llama-server to check server state, and that pattern matched my own diagnostic shell (its command line contains "llama-server"). The phantom process even showed "--no-mmap" — because that was a word in my grep pattern. Worse, I once ran pkill -9 -f and killed my own ssh session.

I spent half an hour chasing a watchdog that didn't exist — the watchdog was my own grep. Lesson: check with pgrep -xc (exact comm, not the whole command line), kill with pkill -x. I've hit this pothole more than once in this series.

The real compression culprit: Hermes defaults to letting ds4 summarize itself, and it times out at 6 minutes

Compression belongs in the honesty series too — it was broken at first, and broken quietly.

First, how it works. An agent framework's context compression (I run Hermes) doesn't just drop old messages — it asks an LLM to summarize them into something short. The catch is: which LLM? If you don't point it at a dedicated summarizer, the framework falls back to your main model — ds4 itself.

That's the trap. ds4 is 284B, single-worker, ~15 tok/s. Asking it to summarize a 60–70K-token conversation means asking it to decode another wall of text — and it's already slow with one worker. In the logs, every compression looked like this:

context compression started: tokens=~71,628 model=deepseek-chat
Auxiliary compression: using auto (deepseek-chat) at http://…:8000/v1/
  …(six minutes later)…
Auxiliary compression: connection error (Request timed out.)
Failed to generate context summary → Inserted a fallback context marker.

Every time, it burned the full 6-minute timeout before giving up, then dropped in a "fallback marker" — meaning compression never actually happened. The context kept growing and the next turn got slower. (The funny part: I thought I'd configured a small model for this. The logs showed that setting never took effect — it silently fell back to ds4 every time.)

The fix is simple. The real lesson isn't whether to compress — it's: don't summarize with your single-worker big model. Point it at a separate small model instead.

I sent mine to a Gemma 4 E2B on a spare GTX 970:

Summarizer	Result
ds4 itself (284B, single worker)	6-minute timeout → failed, dropped a fallback marker
Gemma 4 E2B (separate server)	completes → succeeded (a `/compress` smoke test ran in ~76s, messages 44→18, actually summarized)

To be honest this isn't apples-to-apples — the ds4 timeouts were on 60–90K-token contexts, the E2B run was a small test, and a real 64K compress would take longer. But the takeaway is obvious: one always times out and never finishes; the other finishes and actually compresses — locally, for free, never leaving the network.

And you don't need a second machine. The point is "a separate small model," not "a separate box." The same E2B runs fine as another llama-server (on a different port) on the DGX Spark itself — the E2B Q4 weights plus a 64K KV (q8) come to ~3–4GB, which fits in the dozen-plus GB ds4 leaves free. It's a separate process, so it doesn't tie up ds4's one worker — compression runs alongside serving. And the GB10 is faster than the GTX 970, so it'd only be quicker. Point the framework's compression at that port and you're done.

(One gotcha: Hermes requires the summarizer's context window to be ≥64K — it has to swallow the whole chunk it's compressing in one shot. So launch that small server with -c 65536; if VRAM is tight, drop the vision encoder you don't need and quantize the KV with q8_0 to halve it.)

The coexistence wall: one 128GB box can't comfortably run a frontier LLM and media generation

One last engineering reality. DSv4's 80GB of weights are fixed, and once context bloats the prefill checkpoints (~14GB), available memory drops to single-digit GB fast. Ask ComfyUI to make an image then and it has no room to stage the model — the same image jumps from 17s to 265s, thrashing through stage/unstage.

I tried various flags (--disable-pinned-memory did help — another story), but the conclusion holds: on one 128GB box, a frontier LLM and media generation can't comfortably coexist. Either separate them in time (the 57s --no-mmap restart makes this tolerable — stop the LLM when you want images) or separate them by hardware — move media generation to another machine entirely.

I went with hardware separation: image and video generation moved to an RTX 5090, GB10 dedicated to DSv4. Coexistence solved for good, no more switching.

Takeaways

Where the time actually went

I spent half an hour chasing the watchdog that didn't exist. Not because it was hard, but because I asked the wrong question from the start — "what's restarting the server" — when the right one was "how do I confirm anything is restarting it at all." pgrep -f's false positive turned my own diagnostic tool into the ghost I was chasing.

Debugging takeaways

Two of them. First, pgrep -f / pkill -f will match its own shell command when that command contains the target string — diagnose with -x (exact comm). Second, when an OOM kills something unexpected, don't assume "memory's just tight" — check dmesg for the real cause (another time in this series it was ComfyUI pinning 112GB of host memory; the model was innocent).

The one-liner

Context shift manages token count (the logical layer); the OOM killer manages physical RAM (the hardware layer) — two different layers. Clearing KV tokens doesn't save you from a physical memory spike. "Doesn't it auto-compress and drop old stuff?" — it does, but that's a context-window token budget, unrelated to the physical-memory spike during prefill.

TL;DR

decode collapses with context: 14.4 (short) → 5.2 (@64K); allocated context doesn't matter, real length does.
--no-mmap cuts cold start from 7 min to 57s, at the cost of tighter memory headroom.
KV disk cache halves prefill on a hit (prefix-hash + NVMe + cross-quant); long chats stop re-exploding TTFT.
KV quantization is dead (the concat kernel won't take it), but DSv4's KV is tiny anyway, so it doesn't matter.
The three agent-side settings people miss: tell it the real context limit (or it crashes with 400 Prompt too long), hand compression to a separate small model (don't let ds4 summarize itself — it can even live on the same box), and cap single-tool output.
One 128GB box can't comfortably run a frontier LLM + media generation — splitting by space (media on another box) is the clean fix.

Also in this series: Part 1 — Running a 284B on a 128GB box, then blaming the wrong thing for broken tool calls · Part 3 — Weights win: a 284B crushed to 2-bit still beats the small model that fits

FAQ

Does DeepSeek-V4-Flash's decode speed drop as context grows?: Yes, noticeably. ~15 tok/s on short chats, ~9 at 16K, ~5 at 64K. Even with DSv4's compressed KV, per-token attention cost climbs with real context length. The allocated context (-c) doesn't affect decode; what's actually fed in does.
Why does ds4/llama.cpp cold-start take 7 minutes, and how do you fix it?: mmap page-faults a ~75GB CUDA buffer at ~200 MB/s. Add --no-mmap for a sequential ~1.4 GB/s read and cold start drops from ~7 min to ~57s. The cost: that 80GB becomes a hard CUDA buffer (non-reclaimable page cache), so memory headroom gets tight when you co-run other things.
Can you quantize DeepSeek-V4-Flash's KV cache?: No. -ctk/-ctv set to q8_0 or q4_0 both crash in this DSv4 fork's CUDA concat kernel (it won't take quantized KV types). Doesn't matter though — DSv4's compressed KV is tiny already (64K of f16 KV is just 1.7GB). ds4 instead gives you a KV disk cache that writes computed prefixes to NVMe and reuses them.
Is ds4's KV disk cache actually useful?: Yes. Send the same 4011-token prompt twice: first cold prefill is 11.4s; the second hits the on-disk prefix checkpoint (47ms to load) and only prefills the remainder, down to 6.5s. It keys on a prefix hash, writes to real NVMe, and even reuses across quant. It's why long chats decode slowly but reconnecting doesn't re-explode TTFT every time.
Decode is only 5–12 tok/s — isn't DeepSeek-V4-Flash painful as an agent?: Better than you'd expect, because it's bimodal. From a day of real logs: 155 LLM calls averaging 46K tokens of context, but 48% returned in 5–15s — because the disk KV cache loads a big prefix in ~87ms and only the new tail gets recomputed. Hot path (cache hit) is 5–15s; cold path (restart / post-compaction / new topic) is 60–160s. As an agent, it spends most of its time on the fast one.

Don't miss the next one

Subscribe, and you won't.

One-click unsubscribe anytime.

← back to blog

The short version

The setup

Just give me the settings: ds4 server side + agent framework side

Decode isn't a constant — it collapses with context

--no-mmap: same model, same disk, 7 minutes to 57 seconds

KV disk cache: write what it thought to disk, grab it next time

Decode's slow, but as an agent it's surprisingly smooth

KV quantization is a dead end: I went to optimize and found nothing to optimize

I chased a non-existent watchdog for half an hour — the watchdog was my grep

The real compression culprit: Hermes defaults to letting ds4 summarize itself, and it times out at 6 minutes

The coexistence wall: one 128GB box can't comfortably run a frontier LLM and media generation

Takeaways

Where the time actually went

Debugging takeaways

The one-liner

TL;DR

FAQ

Read next

Don't miss the next one