~/blog/tmmluplus-qwen-abliterated-cost

DGX Spark · part 22

[Benchmark] Abliteration Costs 1.85pp on Traditional Chinese — and 7.7pp on Trust Law

cat --toc

TL;DR

Same harness, same DGX Spark, same 22,690 questions as Part 21 — this time on the abliterated Qwen 3.6 35B. Aggregate 75.07% → 73.22%, −1.85pp. The cost isn't uniform: regulatory subjects bleed (信託 −7.7, 行政法 −7.1, 反洗錢 −6.7), pure logic actually improves (+2.9 logic_reasoning, +1.7 junior_math). Hokkien got worse, not better — abliteration doesn't fix data scarcity. Even after the loss, the abliterated Qwen still beats Gemma by +26.92pp on Traditional Chinese.

Plain-Language Version: What does removing the safety filter cost?

Part 21 showed that Qwen 3.6 35B beats Gemma 4 26B by 28 points on a Traditional Chinese benchmark. But the original Qwen still hedges and refuses on certain prompts — useful for daily writing, annoying for fiction, security research, or just having a frank conversation.

The community workaround is abliteration: a surgical weight modification that removes the model's tendency to refuse, without retraining. A widely used variant is from a team called huihui-ai.

Question: how much capability does abliteration cost? I ran the same Traditional Chinese benchmark on the abliterated version. Result: only 2 points lost overall — but the loss is not uniform. Subjects about trust law, administrative law, money laundering lost 6–8 points. Subjects about pure reasoning actually gained 1–3 points. One way to read this: abliteration removes a "default to the cautious / regulated answer" tendency, which hurts in compliance subjects but is harmless or helpful in pure logic. (The benchmark pattern is consistent with that reading; it doesn't prove it.)

The one thing abliteration didn't fix: Taiwanese Hokkien. It got worse, not better. The Hokkien blind spot was never about safety filtering — it's about training data not existing.


Preface

Part 21 compared Gemma 4 26B and Qwen 3.6 35B on TMMLU+. Qwen swept 51-of-51 subjects with a 28.77pp aggregate lead. Conclusion: pick Qwen for any Traditional Chinese work.

But "Qwen" in that context was the original Qwen, with all its alignment hedging intact. For practical local-LLM use — fiction, sharp feedback, security research, frank chat — most people reach for an abliterated variant. The natural follow-up question: what does abliteration cost in measurable accuracy?

Same harness, same machine, same 22,690 questions. One model swap.


Setup: same as Part 21, with one substitution

Hardware:    NVIDIA GB10 (DGX Spark)
Harness:     lm-evaluation-harness
Backend:     local-completions API hitting vLLM
Few-shot:    5
Concurrency: 8
Dataset:     ikala/tmmluplus
Model:       huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated (BF16, 67 GB)

The abliterated checkpoint ships in BF16 — no FP8 release from huihui-ai. I tried to convert it to FP8 myself for an apples-to-apples precision comparison with the Part 21 Qwen FP8 baseline, but llmcompressor pins transformers ≤ 4.57.6, while qwen3_5_moe first appears in transformers v5.2.0 (absent in v5.0.0 and v5.1.0). Newer than llmcompressor allows, in other words. I documented the toolchain conflict so the next run can monkey-patch around it; for this run, BF16 it is.

This means the −1.85pp delta below may include both abliteration impact and a precision difference (FP8 → BF16) I haven't separately measured on this model. I'd estimate the precision component is small relative to 1.85pp, but I'm not citing a number.


Headline: −1.85pp aggregate, abliterated still beats Gemma by +26.92pp

ModelTMMLU+ Aggregate
Gemma 4 26B-A4B FP846.30%
Qwen 3.6 35B FP8 (original)75.07%
Qwen 3.6 35B abliterated BF1673.22%
Δ from abliteration−1.85pp
Abliterated vs Gemma+26.92pp

If you were planning to deploy abliterated Qwen as a daily Traditional Chinese model, this is the answer: you give up under 2 percentage points of TMMLU+ accuracy in exchange for a model that doesn't refuse. The trade is favorable for any non-benchmark workload.


Where it hurts: regulatory and authority subjects bleed 5–8pp

Of 51 paired subjects, 38 got worse, 8 got better, 5 unchanged. Mean Δ = −1.97pp, stdev 2.29.

Top 5 most degraded by abliteration:

SubjectOriginalAbliteratedΔ
trust_practice (信託實務)62.154.4−7.7
administrative_law (行政法)71.264.0−7.1
anti_money_laundering (反洗錢)83.676.9−6.7
jce_humanities (人文國考)83.377.8−5.6
real_estate (不動產實務)58.753.3−5.4

The pattern is consistent: subjects where "the correct answer is the conservative / regulated / legal one" lose disproportionately. Abliteration removes the model's tendency to default toward the safe-and-rule-following answer. In a multiple-choice exam where the regulator-approved answer is the right one, that tendency was actually load-bearing.

This is a real cost, not a curiosity. If you're using abliterated Qwen for compliance, legal, or financial advice tasks, expect lower reliability than the original.


Where it doesn't hurt: pure reasoning improves slightly

Top 5 subjects that improved (or didn't change):

SubjectOriginalAbliteratedΔ
logic_reasoning46.849.6+2.9
junior_math_exam56.057.7+1.7
junior_chinese_exam90.992.0+1.1
advance_chemistry71.572.4+0.8
physical_education73.273.7+0.6

These aren't huge gains — within stderr — but the direction is consistent across five subjects. One speculative reading: abliteration also strips a small amount of mid-thought hedging ("on the other hand…", "however, one should consider…"), and pure reasoning subjects benefit from cleaner inference paths. Less defensible than the regulatory-bleed finding — treat it as a hypothesis, not a result.

For pure-logic agent work or math tutoring, abliterated Qwen is roughly comparable to the original in this run.


Hokkien blind spot: still there, slightly worse

ModelHokkien acc
Gemma 4 26B FP832.56%
Qwen 3.6 35B FP8 (original)41.86%
Qwen 3.6 35B abliterated37.21% (−4.65pp)

Random baseline = 25%.

Abliteration didn't help here, it slightly hurt. This is consistent with the Part 21 conclusion: the Hokkien gap is a data scarcity problem, not a safety-filter problem. Abliteration perturbs the weights to remove refusals, but on a domain the model never learned (because the public Hokkien corpus is hundreds of MB total), any perturbation is just noise on top of guessing.

Takeaway: if "uncensored" was your hope for unlocking Hokkien capability, this run says no. Use Yentinglin's Llama-3-Taiwan-70B-Instruct for Hokkien, or accept the gap.


What Was Gained

What cost the most time

The FP8 quantization toolchain. I tried to convert huihui-ai's BF16 abliterated weights to FP8 so the comparison would be apples-to-apples with the Part 21 Qwen FP8 baseline. llmcompressor (the standard tool) pins transformers ≤ 4.57.6 in setup.py. But qwen3_5_moe config first lands in transformers v5.2.0 (absent in v5.0.0 / v5.1.0). On top of that, transformers v5 removed the use_auth_token argument that llmcompressor's entrypoints/utils.py still passes (PR #41666). Three different version combinations failed for two distinct reasons. The fix path: fork llmcompressor, swap use_auth_token for token, install editable, then install transformers ≥ 5.2.0. About an hour of toolchain debugging that produced no FP8 weights this time.

The pragmatic call: run BF16, accept the precision confound, get the data. Whatever the precision cost is, the per-subject pattern (regulatory loss, logic gain) is too uneven to be explained by a uniform precision drop.

Transferable diagnostic: per-subject Δ tells you what was removed

Aggregate scores tell you "is this still a useful model" (yes, 73% vs 75%). Per-subject Δ tells you what kind of capability the modification correlates with. Regulatory subjects bleeding 6–8pp while pure logic gains 1–3pp is consistent with abliteration touching a deference-to-rules direction more than a general-intelligence direction — but this is one paired benchmark, not proof. The framing lines up with the refusal-direction paper commonly cited by abliteration implementations, which models refusal as a single direction in the residual stream.

If I'd only reported aggregate, I'd have lost the most interesting finding.

Universal pattern

When you modify a model, diff at the per-subject level, not just aggregate. The mean is a summary; the distribution is the story. The 1.85pp aggregate looks like noise. The 7.7pp drop on trust_practice is a signal.


Conclusion

If you're deploying abliterated Qwen 3.6 35B for daily Traditional Chinese work in April 2026:

  1. For general writing, chat, fiction, blunt feedback — go ahead. The 1.85pp aggregate cost is invisible in practice and you get a model that doesn't hedge.
  2. For compliance, legal, financial advice — use the original Qwen. The −7.7pp on trust law and −7.1pp on administrative law is a real reliability hit.
  3. For pure reasoning / math — abliterated is roughly comparable to the original in this run, possibly a hair better. Don't expect a big gap either way.
  4. For Hokkien — neither helps. Use Yentinglin's Llama-3-Taiwan-70B.
  5. If you want a clean apples-to-apples comparison — quantize abliterated to FP8 yourself (monkey-patch llmcompressor; ~1 hour of toolchain work) and re-run. That removes the precision confound from this analysis.

Run config and the full 51-subject Δ table are in ~/tmmlu-runs/qwen-abl-bf16-full/ on my GX10. The whole abliterated run took about 5 hours wall-clock at BF16; an FP8 run would be ~3 hours.


Also in this series:

FAQ

What is abliteration and why use it?
Abliteration is a weight-modification technique that suppresses a model's refusal behavior by neutralizing the projection direction associated with refusing. It's done after training, requires no fine-tune dataset, and produces a model that follows instructions instead of declining them. Used for fiction writing, blunt feedback, security research, and chat where alignment hedging gets in the way.
Did abliteration tank the model's Traditional Chinese ability?
No. Aggregate TMMLU+ dropped from 75.07% to 73.22%, a 1.85 percentage-point loss. Even after abliteration, the model still beats Gemma 4 26B-A4B by 26.92 percentage points on Traditional Chinese. The cost is real but smaller than the difference between models.
Where does abliteration cost the most?
Regulatory and authority-laden subjects: trust_practice (−7.7pp), administrative_law (−7.1pp), anti_money_laundering (−6.7pp). These are subjects where 'the right answer is the conservative legal one,' so removing the model's tendency to default to the conservative answer hurts disproportionately.
Did anything improve after abliteration?
Yes — pure reasoning and math subjects went up slightly: logic_reasoning (+2.9pp), junior_math_exam (+1.7pp), junior_chinese_exam (+1.1pp). One interpretation: abliteration removes mid-thought hedging, so subjects that benefit from confident reasoning gain a tiny bit.
Why did Taiwanese Hokkien get worse, not better?
Hokkien dropped from 41.86% to 37.21% (−4.65pp). The blind spot has nothing to do with safety filtering — it's data scarcity (no standardized written form, tiny public corpus). Abliteration just slightly perturbs the weights, and on a domain the model never learned, perturbation is pure noise. There's no 'unlock' to find.