~/blog/dgx-spark-deepseek-v4-flash-284b-ds4-engine

DeepSeek-V4-Flash on DGX Spark · part 1

[Local LLM] Running a 284B DeepSeek-V4-Flash on a 128GB DGX Spark — and blaming the wrong thing for broken tool calls

cat --toc

TL;DR

DeepSeek-V4-Flash is a 284B model. I got it onto a single 128GB DGX Spark (GB10) with antirez's hand-written ds4 engine and an asymmetric Q2 GGUF (~80GB), decoding at 15.3–15.6 tok/s. The interesting part wasn't the number — it was a conclusion I got wrong. I thought the model was "faking tool calls and making up results" because 2-bit was too lossy. The real culprit: the runtime had no DSML parser. DeepSeek-V4 calls tools in DSML (XML-like), not JSON. The model emitted it correctly; nothing parsed it. Swap engines and 2-bit calls tools just fine.

The whole post in one image

The short version

DeepSeek-V4-Flash is a 284B-parameter model — the kind of thing that normally lives in a server rack. I only have a desktop box: a DGX Spark with 128GB of unified memory. Quantization (squeezing every weight to fewer bits) gets it small enough to fit, and the result is a genuinely capable model running entirely on my desk. This is one evening: getting it running, benchmarking it, and wiring it into my daily agent. The detour worth reading is the part where I decided the model "couldn't use tools" — and was wrong about why.


The setup

Spec sheets lie. You see "128GB unified memory" and assume you can finally run big models; you actually run one and learn that "fits" and "runs well" are two different things.

This is part 1 of the DeepSeek-V4-Flash on DGX Spark series. It started one tired night with "is DSv4 even worth squeezing onto a GB10?" and ended with the thing wired into production. This post covers whether it runs and the tool-call bug I misdiagnosed. Part 2 covers the memory and context engineering to make it a daily driver; Part 3 covers why a 284B model crushed to 2-bit still holds up.

Phase 0: I didn't have to write five CUDA ops — the community already did

My first instinct was "this needs hand-written kernels." DeepSeek-V4 has a compressed KV cache and sparse routing; not every runtime can load it.

Turned out I didn't have to write any of it:

  • antirez (yes, the Redis guy) wrote an engine specifically for DeepSeek-V4-Flash, called ds4, with a CUDA backend that natively understands DSv4's tool format.
  • huihui already shipped an abliterated (refusal-direction removed) DSv4 GGUF in the asymmetric Q2 layout ds4 wants.

The hardest part — porting a frontier architecture to a single card and squeezing it to fit — was already done. My only job was to compile it, run it, and troubleshoot the rough edges. That's the value of a Phase 0 pass: survey before you build, and you often find the wheel already exists.

First GB10 numbers, stuck behind a stale nvcc

The build command is clean:

git clone https://github.com/antirez/ds4 && cd ds4
make cuda-spark            # don't pass -arch yourself; ds4 says this is fastest

You don't quantize anything yourself — ds4 ships download_model.sh. On my 128GB box, q2-imatrix:

./download_model.sh q2-imatrix    # 96/128GB boxes; lands in ./gguf/ and points ./ds4flash.gguf at it
./download_model.sh mtp           # MTP draft, if you want speculative decoding

That's antirez's own stock q2-imatrix (~80GB, from antirez/deepseek-v4-gguf). The one I run daily is huihui's abliterated build — same asymmetric Q2 layout, refusal direction removed — which is a separate download from huihui-ai/Huihui-DeepSeek-V4-Flash-abliterated-ds4-GGUF (Huihui-DeepSeek-V4-Flash-BF16-abliterated-ds4-Q2.gguf, 86.7GB); just point -m at that file. The speed numbers below cover both.

The first build hit a wall. A stale /usr/bin/nvcc (12.0) was shadowing the 13.2 I'd installed, and the compile died on math-vector.h __SVFloat32_t undefined. Make sure which nvcc resolves to 13.2, not the stale 12.0. Then starting the server hit a second one: the binary built by make cuda-spark (nvcc 13.2) on the stock 13.0 driver throws the provided PTX was compiled with an unsupported toolchain.

The fix is cuda-compat-13-2 (NVIDIA's official forward-compat — one package, doesn't touch the kernel driver, zero reflash risk), with the right env on launch:

sudo apt install cuda-compat-13-2          # just this package, no driver changes
LD_LIBRARY_PATH=/usr/local/cuda-13.2/compat ./ds4-server --cuda \
  -m ./ds4flash.gguf --ctx 131072 --power 100 --warm-weights \
  --host 0.0.0.0 --port 8000

# ds4 has no /health route (returns 404); watch the log for "listening on", or hit /v1/models:
curl -s http://localhost:8000/v1/models

This trick works for anything built with a 13.2+ toolchain that has to run on the stock 580 driver — no risky driver upgrade (on DGX Spark a driver upgrade can boot-loop you into a USB rescue).

One knob people miss: --power. ds4 already defaults to 100 (full speed), so you don't normally set it — but some community recipes copy --power 75, which costs you 25% (11.5 vs 15.3 tok/s). Make sure you're at 100. --warm-weights loads the weights into the GPU before serving, so the first request doesn't eat a cold-start spike.

15.6 tok/s, and six configs all pinned to the same number

Once it ran, I swept every knob. The result is a little funny — nothing moved the needle:

Configdecode tok/s
stock Q215.56
stock + MTP draft 115.58
stock + MTP draft 214.48
huihui abliterated Q215.3
huihui + MTP draft 115.63
huihui + MTP draft 214.44

Six configs — stock, abliterated, speculative on, speculative off — all parked around 15.6. On ds4, that's the ceiling: GB10's 273 GB/s feeding 80GB of weights only lets you emit so many tokens per second. (Strictly, it's ds4's ceiling — a different engine with better kernels moves the number, which I cover with the Atlas example in the Qwen3.5-122B post.)

Two results stood out. First, speculative decoding (MTP) doesn't pay off here. Draft 1 is roughly break-even; draft 2 actually drops to 14.5 — and that slowdown is the proof the mechanism is real (you only pay overhead if it's actually engaged), which makes draft 1's flat result a genuine break-even, not a no-op. On a bandwidth-bound box, the extra bandwidth speculation burns is about what it earns back.

Second, abliteration doesn't cost speed here. Stock is only ~1.7% faster than abliterated (noise). I'd measured abliteration costing -50% chat throughput on Gemma; what's the difference? The bottleneck. You pay the abliteration tax when you're compute-bound, not when you're bandwidth-bound.

The tool-call detective story: I blamed 2-bit

This is the part worth writing about.

I sent DSv4 a weather-tool request and got back something that looked like a tool call but wasn't really one — tool_turns=0, parameters that didn't line up, like it was acting out a result. I jumped to the obvious conclusion: the 2-bit quant had wrecked it so badly it couldn't even emit valid tool calls.

That conclusion was half wrong. The direction was right (the output wasn't being treated as a tool call); the attribution was wrong (not 2-bit).

The truth: DeepSeek-V4 doesn't call tools in JSON/ChatML — it uses DSML, an XML-like markup that looks like this:

<|DSML|tool_calls><|DSML|invoke name="get_weather">
<|DSML|parameter name="city" string="true">Taipei</|DSML|parameter>
</|DSML|invoke></|DSML|tool_calls>

The llama.cpp fork I was running had no V4 DSML parser (upstream only knows V3.2's function_calls, not V4's tool_calls; the relevant PR was closed unmerged). The model emitted correct DSML — nothing parsed it, so it became plain text → tool_turns=0 → looks like the model faking a call.

The proof is in the quant table. huihui's asymmetric Q2 allocates precision like this:

LayerPrecision
routed ffn gate/up expertsIQ2_XXS
routed ffn down expertsQ2_K
shared experts / attention / output headQ8_0
embedding / router / indexerF16

Everything at 2-bit is a routed expert — huge in volume, but each only touching a tiny slice of tokens. Everything tied to tool format, routing, and attention stays at Q8/F16. 2-bit never touches the layers that emit DSML.

Swap in the ds4 engine (which renders and parses DSML natively) and tool calls just work: get_weather(Taipei), structured tool_calls, first try. The takeaway: I blamed 2-bit, but the real culprit was a runtime with no DSML parser. Put it on the right engine and 2-bit uses tools fine.

Shipping it: wiring it into my agent

The last step was wiring it into the agent I use every day. The serve command is the one above; the client uses model=deepseek-chat for non-thinking mode — partly to dodge a known bug where thinking mode swallows the DSML, partly because everyday chat is snappier that way.

ds4 has one more thing I ended up loving: a KV disk cache. It writes computed prefix KV to NVMe, so the next conversation with the same opening loads straight off disk instead of recomputing. That's the load-bearing piece for "frontier model as a daily driver," so it gets its own treatment in Part 2.

Takeaways

Where the time actually went

Not the build — the tool-call misdiagnosis. I stared at "faking a tool call" output and my brain auto-filled "2-bit is too lossy," and I almost wrote that as the conclusion. What actually cracked it was reading DeepSeek-V4's encoding docs, finding DSML, and realizing the problem lived in the runtime, not the model. I jumped from an output symptom straight to a precision verdict and skipped the layer in between: did the runtime even parse this format?

Debugging takeaways

When a model "talks but can't use tools," don't reach for precision first. Check the wire format: what does this model use for tool calls, and does your runtime know it? DSML, function_calls, and OpenAI tool_calls are different wire formats — a parser mismatch shows up as "the model can't use tools," even when the model emitted everything correctly.

The one-liner

Misdiagnosis costs more than the bug. A "2-bit is too lossy" verdict sends you swapping models, raising precision, and burning download bandwidth; a "runtime is missing a parser" verdict just swaps engines. Before you call the model dumb, verify that every layer of the toolchain actually understands its output.

TL;DR

  • DeepSeek-V4-Flash (284B) runs on a single GB10: ds4 engine + asymmetric Q2 (~80GB), 15.3–15.6 tok/s.
  • The speed is a bandwidth ceiling: six configs converge on 15.6, MTP doesn't pay off under bandwidth limits, abliteration costs no speed.
  • The "acting" tool calls were a misattribution to 2-bit: the real culprit was a runtime with no DSML parser. On ds4, tool calls work.
  • Daily-driver details (KV disk cache, context engineering) are in Part 2; the Q2 quality argument is in Part 3.

Also in this series: Part 2 — Running a 15 tok/s 284B as a daily agent brain: the settings that make it comfortable · Part 3 — Weights win: a 284B crushed to 2-bit still beats the small model that fits

FAQ

Can DeepSeek-V4-Flash run on a single DGX Spark (GB10)?
Yes, with antirez's purpose-built ds4 engine and an asymmetric Q2 GGUF (routed experts at IQ2_XXS/Q2_K, attention and output kept at Q8/F16), ~80GB total. It decodes at ~15.3–15.6 tok/s on GB10. Upstream llama.cpp won't load it — no V4 architecture loader — and vLLM only takes FP4/FP8 safetensors that don't fit on one box.
Why does DeepSeek-V4-Flash look like it can't use tools at first?
Its tool calls use DSML (XML-like markup), not JSON. The llama.cpp fork I first ran had no V4 DSML parser, so the model emitted correct DSML but the runtime treated it as plain text — which looks exactly like faking a tool call and making up the result. Swap in an engine with a DSML parser and tool calls work. It's not the 2-bit quantization.
Does Q2 quantization make the model dumb?
Depends how you quantize. DeepSeek-V4-Flash uses asymmetric Q2 — only the routed experts (huge in volume, but each handling a tiny slice of tokens) drop to 2-bit; the quality-critical layers (attention, shared experts, output head) stay at Q8/F16. I ran it as a daily agent for ~30 hours / 280 turns with zero degradation.
How fast can a 100B+ model run on GB10?
Depends on the engine. DeepSeek-V4-Flash does ~15.6 tok/s on ds4, bottlenecked by GB10's 273 GB/s feeding tens of GB of weights. But half of that 13–17 tok/s band is immature kernels, not pure bandwidth physics — an engine with native sm_121 kernels can more than double it. For interactive speed, pick a small-active ~30B MoE (~67 tok/s on NVFP4).