DGX Spark · part 22
[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law
❯ cat --toc
- Plain-Language Version: What does removing the safety filter cost?
- Preface
- Setup: same as Part 21, with one substitution
- Headline: −1.85pp aggregate, abliterated still beats Gemma by +26.92pp
- Where it hurts: regulatory and authority subjects bleed 5–8pp
- Where it doesn't hurt: pure reasoning improves slightly
- Hokkien blind spot: still there, slightly worse
- What Was Gained
- What cost the most time
- Transferable diagnostic: per-subject Δ tells you what was removed
- Universal pattern
- Conclusion
TL;DR
Same harness, same DGX Spark, same 22,690 questions as Part 21 — this time on the abliterated Qwen 3.6 35B. Aggregate 75.07% → 73.22%, −1.85pp. The cost isn't uniform: regulatory subjects bleed (信託 −7.7, 行政法 −7.1, 反洗錢 −6.7), pure logic actually improves (+2.9 logic_reasoning, +1.7 junior_math). Hokkien got worse, not better — abliteration doesn't fix data scarcity. Even after the loss, the abliterated Qwen still beats Gemma by +26.92pp on Traditional Chinese.
Plain-Language Version: What does removing the safety filter cost?
Part 21 showed that Qwen 3.6 35B beats Gemma 4 26B by 28 points on a Traditional Chinese benchmark. But the original Qwen still hedges and refuses on certain prompts — useful for daily writing, annoying for fiction, security research, or just having a frank conversation.
The community workaround is abliteration: a surgical weight modification that removes the model's tendency to refuse, without retraining. A widely used variant is from a team called huihui-ai.
Question: how much capability does abliteration cost? I ran the same Traditional Chinese benchmark on the abliterated version. Result: only 2 points lost overall — but the loss is not uniform. Subjects about trust law, administrative law, money laundering lost 6–8 points. Subjects about pure reasoning actually gained 1–3 points. One way to read this: abliteration removes a "default to the cautious / regulated answer" tendency, which hurts in compliance subjects but is harmless or helpful in pure logic. (The benchmark pattern is consistent with that reading; it doesn't prove it.)
The one thing abliteration didn't fix: Taiwanese Hokkien. It got worse, not better. The Hokkien blind spot was never about safety filtering — it's about training data not existing.
Preface
Part 21 compared Gemma 4 26B and Qwen 3.6 35B on TMMLU+. Qwen swept 51-of-51 subjects with a 28.77pp aggregate lead. Conclusion: pick Qwen for any Traditional Chinese work.
But "Qwen" in that context was the original Qwen, with all its alignment hedging intact. For practical local-LLM use — fiction, sharp feedback, security research, frank chat — most people reach for an abliterated variant. The natural follow-up question: what does abliteration cost in measurable accuracy?
Same harness, same machine, same 22,690 questions. One model swap.
Setup: same as Part 21, with one substitution
Hardware: NVIDIA GB10 (DGX Spark)
Harness: lm-evaluation-harness
Backend: local-completions API hitting vLLM
Few-shot: 5
Concurrency: 8
Dataset: ikala/tmmluplus
Model: huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated (BF16, 67 GB)
The abliterated checkpoint ships in BF16 — no FP8 release from huihui-ai. I tried to convert it to FP8 myself for an apples-to-apples precision comparison with the Part 21 Qwen FP8 baseline, but llmcompressor pins transformers ≤ 4.57.6, while qwen3_5_moe first appears in transformers v5.2.0 (absent in v5.0.0 and v5.1.0). Newer than llmcompressor allows, in other words. I documented the toolchain conflict so the next run can monkey-patch around it; for this run, BF16 it is.
This means the −1.85pp delta below may include both abliteration impact and a precision difference (FP8 → BF16) I haven't separately measured on this model. I'd estimate the precision component is small relative to 1.85pp, but I'm not citing a number.
Headline: −1.85pp aggregate, abliterated still beats Gemma by +26.92pp
| Model | TMMLU+ Aggregate |
|---|---|
| Gemma 4 26B-A4B FP8 | 46.30% |
| Qwen 3.6 35B FP8 (original) | 75.07% |
| Qwen 3.6 35B abliterated BF16 | 73.22% |
| Δ from abliteration | −1.85pp |
| Abliterated vs Gemma | +26.92pp |
If you were planning to deploy abliterated Qwen as a daily Traditional Chinese model, this is the answer: you give up under 2 percentage points of TMMLU+ accuracy in exchange for a model that doesn't refuse. The trade is favorable for any non-benchmark workload.
Where it hurts: regulatory and authority subjects bleed 5–8pp
Of 51 paired subjects, 38 got worse, 8 got better, 5 unchanged. Mean Δ = −1.97pp, stdev 2.29.
Top 5 most degraded by abliteration:
| Subject | Original | Abliterated | Δ |
|---|---|---|---|
| trust_practice (信託實務) | 62.1 | 54.4 | −7.7 |
| administrative_law (行政法) | 71.2 | 64.0 | −7.1 |
| anti_money_laundering (反洗錢) | 83.6 | 76.9 | −6.7 |
| jce_humanities (人文國考) | 83.3 | 77.8 | −5.6 |
| real_estate (不動產實務) | 58.7 | 53.3 | −5.4 |
The pattern is consistent: subjects where "the correct answer is the conservative / regulated / legal one" lose disproportionately. Abliteration removes the model's tendency to default toward the safe-and-rule-following answer. In a multiple-choice exam where the regulator-approved answer is the right one, that tendency was actually load-bearing.
This is a real cost, not a curiosity. If you're using abliterated Qwen for compliance, legal, or financial advice tasks, expect lower reliability than the original.
Where it doesn't hurt: pure reasoning improves slightly
Top 5 subjects that improved (or didn't change):
| Subject | Original | Abliterated | Δ |
|---|---|---|---|
| logic_reasoning | 46.8 | 49.6 | +2.9 |
| junior_math_exam | 56.0 | 57.7 | +1.7 |
| junior_chinese_exam | 90.9 | 92.0 | +1.1 |
| advance_chemistry | 71.5 | 72.4 | +0.8 |
| physical_education | 73.2 | 73.7 | +0.6 |
These aren't huge gains — within stderr — but the direction is consistent across five subjects. One speculative reading: abliteration also strips a small amount of mid-thought hedging ("on the other hand…", "however, one should consider…"), and pure reasoning subjects benefit from cleaner inference paths. Less defensible than the regulatory-bleed finding — treat it as a hypothesis, not a result.
For pure-logic agent work or math tutoring, abliterated Qwen is roughly comparable to the original in this run.
Hokkien blind spot: still there, slightly worse
| Model | Hokkien acc |
|---|---|
| Gemma 4 26B FP8 | 32.56% |
| Qwen 3.6 35B FP8 (original) | 41.86% |
| Qwen 3.6 35B abliterated | 37.21% (−4.65pp) |
Random baseline = 25%.
Abliteration didn't help here, it slightly hurt. This is consistent with the Part 21 conclusion: the Hokkien gap is a data scarcity problem, not a safety-filter problem. Abliteration perturbs the weights to remove refusals, but on a domain the model never learned (because the public Hokkien corpus is hundreds of MB total), any perturbation is just noise on top of guessing.
Takeaway: if "uncensored" was your hope for unlocking Hokkien capability, this run says no. Use Yentinglin's Llama-3-Taiwan-70B-Instruct for Hokkien, or accept the gap.
What Was Gained
What cost the most time
The FP8 quantization toolchain. I tried to convert huihui-ai's BF16 abliterated weights to FP8 so the comparison would be apples-to-apples with the Part 21 Qwen FP8 baseline. llmcompressor (the standard tool) pins transformers ≤ 4.57.6 in setup.py. But qwen3_5_moe config first lands in transformers v5.2.0 (absent in v5.0.0 / v5.1.0). On top of that, transformers v5 removed the use_auth_token argument that llmcompressor's entrypoints/utils.py still passes (PR #41666). Three different version combinations failed for two distinct reasons. The fix path: fork llmcompressor, swap use_auth_token for token, install editable, then install transformers ≥ 5.2.0. About an hour of toolchain debugging that produced no FP8 weights this time.
The pragmatic call: run BF16, accept the precision confound, get the data. Whatever the precision cost is, the per-subject pattern (regulatory loss, logic gain) is too uneven to be explained by a uniform precision drop.
Transferable diagnostic: per-subject Δ tells you what was removed
Aggregate scores tell you "is this still a useful model" (yes, 73% vs 75%). Per-subject Δ tells you what kind of capability the modification correlates with. Regulatory subjects bleeding 6–8pp while pure logic gains 1–3pp is consistent with abliteration touching a deference-to-rules direction more than a general-intelligence direction — but this is one paired benchmark, not proof. The framing lines up with the refusal-direction paper commonly cited by abliteration implementations, which models refusal as a single direction in the residual stream.
If I'd only reported aggregate, I'd have lost the most interesting finding.
Universal pattern
When you modify a model, diff at the per-subject level, not just aggregate. The mean is a summary; the distribution is the story. The 1.85pp aggregate looks like noise. The 7.7pp drop on trust_practice is a signal.
Conclusion
If you're deploying abliterated Qwen 3.6 35B for daily Traditional Chinese work in April 2026:
- For general writing, chat, fiction, blunt feedback — go ahead. The 1.85pp aggregate cost is invisible in practice and you get a model that doesn't hedge.
- For compliance, legal, financial advice — use the original Qwen. The −7.7pp on trust law and −7.1pp on administrative law is a real reliability hit.
- For pure reasoning / math — abliterated is roughly comparable to the original in this run, possibly a hair better. Don't expect a big gap either way.
- For Hokkien — neither helps. Use Yentinglin's Llama-3-Taiwan-70B.
- If you want a clean apples-to-apples comparison — quantize abliterated to FP8 yourself (monkey-patch llmcompressor; ~1 hour of toolchain work) and re-run. That removes the precision confound from this analysis.
Run config and the full 51-subject Δ table are in ~/tmmlu-runs/qwen-abl-bf16-full/ on my GX10. The whole abliterated run took about 5 hours wall-clock at BF16; an FP8 run would be ~3 hours.
Also in this series:
FAQ
- What is abliteration and why use it?
- Abliteration is a weight-modification technique that suppresses a model's refusal behavior by neutralizing the projection direction associated with refusing. It's done after training, requires no fine-tune dataset, and produces a model that follows instructions instead of declining them. Used for fiction writing, blunt feedback, security research, and chat where alignment hedging gets in the way.
- Did abliteration tank the model's Traditional Chinese ability?
- No. Aggregate TMMLU+ dropped from 75.07% to 73.22%, a 1.85 percentage-point loss. Even after abliteration, the model still beats Gemma 4 26B-A4B by 26.92 percentage points on Traditional Chinese. The cost is real but smaller than the difference between models.
- Where does abliteration cost the most?
- Regulatory and authority-laden subjects: trust_practice (−7.7pp), administrative_law (−7.1pp), anti_money_laundering (−6.7pp). These are subjects where 'the right answer is the conservative legal one,' so removing the model's tendency to default to the conservative answer hurts disproportionately.
- Did anything improve after abliteration?
- Yes — pure reasoning and math subjects went up slightly: logic_reasoning (+2.9pp), junior_math_exam (+1.7pp), junior_chinese_exam (+1.1pp). One interpretation: abliteration removes mid-thought hedging, so subjects that benefit from confident reasoning gain a tiny bit.
- Why did Taiwanese Hokkien get worse, not better?
- Hokkien dropped from 41.86% to 37.21% (−4.65pp). The blind spot has nothing to do with safety filtering — it's data scarcity (no standardized written form, tiny public corpus). Abliteration just slightly perturbs the weights, and on a domain the model never learned, perturbation is pure noise. There's no 'unlock' to find.