DeepSeek-V4-Flash on DGX Spark · part 3
[Local LLM] Weights win: a 284B crushed to 2-bit still beats the small model that fits
❯ cat --toc
TL;DR
Weights win. DeepSeek-V4-Flash is a 284B model; to fit a 128GB box it has to drop to asymmetric Q2 (~80GB). Sounds like suicide quantization — but it's surgical: it only quantizes the high-volume, low-impact routed experts and keeps attention / shared experts / output / embedding at Q8/F16. So "2-bit" only touches the layers that barely affect quality. As a daily agent it ran 280 turns with zero degradation: fluent output, correct technical reasoning, well-formed tool calls. The weights are big enough that 2-bit doesn't crush them. (It's also uncensored, by the way.)

The whole post in one image
The short version
DeepSeek-V4-Flash is a 284B model that normally needs a server rack. To fit it on the 128GB box on my desk, you quantize — shave the precision of every weight. Dropping to 2-bit (Q2) sounds terrifying, like squeezing a book's text past legibility. But this quantization is smart: it only crushes the parts that are bulky but barely matter, and keeps the parts that decide how smart it is (attention, reasoning, output) at high precision. Result: it fits the small box, but the brain is still the 284B brain. I ran it as my daily assistant for 280 heavy turns and it never dropped the ball. This is the case for why a big enough model wins even when it's quantized hard.
Fitting 284B into 128GB: is 2-bit still usable?
antirez (the Redis one), who wrote the ds4 engine, opens his README with why: "Very capable open weight models finally exist. DeepSeek v4 Flash feels quasi-frontier." Capable open-weight models finally exist, and DSv4-Flash is close to frontier-level.
But quasi-frontier means big. On a single 128GB DGX Spark, a 284B has exactly one path: quantize hard enough to fit. Asymmetric Q2, ~80GB.
This is part 3 of the series. Part 1 got it running, Part 2 tuned it into a comfortable agent. This post answers the question I doubted from the start: cut this hard to 2-bit, how much is left? More than I expected.
Why 2-bit didn't make it dumb: a scalpel, not an axe
"2-bit" is a scary number, but it doesn't tell you which layers. DeepSeek-V4-Flash's quantization is asymmetric:
| Layer | Precision |
|---|---|
| routed ffn gate/up experts | IQ2_XXS |
| routed ffn down experts | Q2_K |
| shared experts / attention / output head | Q8_0 |
| embedding / router / indexer | F16 |
Everything at 2-bit is a routed expert — they're most of the model's volume, but each only processes a tiny slice of tokens the router sends its way, so each one's impact on overall quality is limited.
The layers that actually decide quality — attention (how it reads context), shared experts (every token passes through), output head (how it picks words), embedding — all stay at Q8 or F16. This is a scalpel, not an axe. It precisely quantizes only the layers that barely affect quality and keeps the ones that decide it at high precision, so the quality hit stays as small as possible. That's why a model that sounds like it got cut in half doesn't feel like it. (The same precision table explains in Part 1 why tool calls survive — the format layers are on the high-precision side too.)
The evidence: 280 turns, zero degradation, no benchmark
Mechanism isn't enough; I want evidence. I didn't run a synthetic benchmark like TMMLU+. I did something else: wired it into the agent I use every day and read the actual production logs.
I read nearly a week of logs end to end — about 30 hours, 280 heavy turns, averaging 6.7 tool calls each:
- Fluent, natural output in its target language, zero leakage from the base model's other locale.
- Correct technical reasoning — it correctly judged whether a GTX 970's VRAM could run a given model, what unified memory is, estimated tok/s — all right.
- Well-formed tool calls, escaping multi-line Python parameters inside nested structures correctly.
- Zero degradation: 280 heavy turns, no runaways, no repeat loops, no broken JSON.
Despite being a "2-bit" model, it ran 280 heavy agentic turns without dropping the ball. I trust a week of 280 real working turns over one synthetic score.
A side note: it's also an unlocked model
This build is huihui's abliterated one — the refusal direction removed. Ask it something sensitive and it doesn't hard-refuse; it clears the censorship filter.
I'm not going to dig into politics here (that's a separate rabbit hole). The point is only this: when you combine a big enough architecture, quantization that fits, and an uncensored base, you've got a quasi-frontier model on your own desk that's entirely yours. No API, no quota, no one on the other end deciding what it's allowed to say.
Weights win
Back to the title. If both have to fit the same box, you're picking between a high-precision small model and a big one cut to 2-bit.
I take the latter. A 284B keeps the world it's seen and the reasoning depth it learned even when quantized hard; a small model at high precision still has a small ceiling. I'll take the slower decode (~15 tok/s) for a much higher quality floor.
Weights win — on a small box, scale is the cushion you can't crush out. antirez bet that open weights finally got good enough, and he was right; "good enough" means even cut to 2-bit, it still holds.
Takeaways
Where the time actually went
Not the tech — it was convincing myself that 2-bit isn't garbage. I shared the common prior: Q2 = halved = probably dumb. Overturning that prior didn't come from a benchmark score; it came from going back through those 280 turns of production logs, line by line, confirming it hadn't actually broken. The expensive part is rarely running the experiment — it's dropping the assumption.
Debugging takeaways
Don't judge a quant by its bit count — look at which layers got cut. Labels like "Q2" or "4-bit" only report the average, not the distribution. Asymmetric quantization can keep the core layers high-precision and only cut the redundant ones; same "Q2," but cutting the right place versus the wrong place is night and day. To judge whether a quant is usable, ask what it kept.
The one-liner
Bit counts lie. "2-bit" sounds like halving the model, but cutting routed experts ≠ cutting its reasoning. A big enough model has plenty of redundancy it can afford to cut; cut the redundancy, keep the core, and you get a model that fits the small box without losing its mind. Scale gives you capability — and the room to be compressed.
TL;DR
- Weights win: 284B at asymmetric Q2 (~80GB) fits 128GB, and quality holds up.
- "2-bit" is asymmetric: it surgically quantizes only the low-impact routed experts and keeps attention / shared / output / embedding at Q8/F16.
- Evidence is production logs: 280 turns, zero degradation (fluent / correct reasoning / clean tool calls), not a synthetic benchmark.
- Side note: abliterated, uncensored, clears the filter.
- Picking a model: given it fits, a heavily-quantized big model beats a high-precision small one. Scale is the cushion you can't crush out.
Also in this series: Part 1 — Running a 284B on a 128GB box, then blaming the wrong thing for broken tool calls · Part 2 — Running a 15 tok/s 284B as a daily agent brain: the settings that make it comfortable
FAQ
- Can you quantize a 284B model to 2-bit and still have usable quality?
- Yes, surprisingly well — the key is asymmetric Q2. Only the routed experts (huge in volume, each handling a tiny slice of tokens) drop to 2-bit; the layers that decide quality (attention, shared experts, output head) stay at Q8, with embedding/router at F16. It's a smart quantization that only touches the layers that barely affect quality. I ran it as a daily agent for 280 turns with zero degradation.
- Why quantize DeepSeek-V4-Flash to Q2 at all?
- To fit a single 128GB DGX Spark. A 284B model at full precision needs a server rack; asymmetric Q2 squeezes it to ~80GB, which just fits in 128GB of unified memory. The cost is slow decode (~15 tok/s), but the quality holds up better than the bit count suggests.
- A Q2 284B vs a higher-precision small model — which is better?
- My experience says the big one wins. A large enough model (284B) keeps what it has seen and the reasoning depth it learned even at 2-bit; a small model at high precision still has a small ceiling. Given both fit the same box, I'd rather have a heavily-quantized big model. That's the 'weights win' argument — scale is the cushion you can't crush out.
- Is this abliterated DeepSeek-V4-Flash censored?
- No. huihui's abliterated build removes the refusal direction, so it doesn't hard-refuse sensitive prompts and clears the censorship filter. This post isn't about that — just noting it's an unlocked model.