AI Workflow · part 7
How a zh-TW Linter Found 128 Mainland-China Drift in My Own Writing
❯ cat --toc
- What this article is actually about
- The trigger: GSC can't see "Mainland flavor"
- The tool: zhtw-mcp
- ⚠️ The first trap: don't run `--fix` on Markdown with frontmatter
- Workflow design: Surgical Python beats batch --fix
- Round 1: high-frequency single-word drift (84 subs / 42 files)
- Round 2: the borderline tier (36 subs)
- Round 3: same-character-different-meaning (8 subs)
- The real insight: blindspot is asymmetric
- The fix: convert vigilance into workflow
- Pre-publish lint hook
- Memory file as Claude context
- Codex zh-TW review (already in place)
- Takeaways
- What I'm not doing
- P.S. This very article got flagged ~50 times
- Related reading
TL;DR
sysprog21/zhtw-mcp (a Rust-based Traditional Chinese linter by jserv) ran across all 72 zh-TW articles on ai-muninn. Three sweeps, 128 cross-strait substitutions across 42 unique files. cross_strait issues dropped from 207 to 135. But the real takeaway wasn't the count — it was the distribution. The semantically dangerous tier (same characters, different meaning across straits) was the cleanest in my writing. The boring high-frequency single-word translations were where I'd default to the Mainland form without ever doubting myself. The blindspot isn't "I don't know" — it's "I don't auto-check."
What this article is actually about
Most articles about Traditional Chinese vs Simplified Chinese focus on the political angle, or on character conversion. This article isn't about either. It's about a more concrete bias: even when you natively know the local term, your training data — which for any reader of online tech content is overwhelmingly Mainland-Chinese — pushes you toward the wrong form by default.
LLMs have this problem too. There's a FAccT 2025 paper showing most models default to Mainland terminology when asked to write Taiwan-style. CC-100 has a 2.6:1 zh-CN to zh-TW ratio. Models trained on web Chinese reproduce that ratio in their output.
What I didn't expect: as a native zh-TW writer who's been doing this for decades, I have the same bias. I just didn't have a tool to catch it until now.
If you write any kind of bilingual technical content, or care about how training-data distribution influences your own outputs, this is a case study you can adapt.
The trigger: GSC can't see "Mainland flavor"
A few days earlier (2026-05-02), I was writing a deep technical post in zh-TW about LLM quantization on DGX Spark. I used the term 並發 (Mainland: concurrency) several times. Gemini 2.5 Pro flagged it. I argued back — "but Taiwan also uses this." User pushback (i.e., my own correction after seeing the linter output): no, Taiwan does not use 並發 for concurrency, that's Mainland-Chinese drift. The Taiwan term is 並行.
That single article got fixed. But the experience nagged at me: how many other articles have similar drift? Google Search Console can't tell me — "Mainland flavor" doesn't show up as a metric. Real readers feel it as subtle wrongness. I needed a corpus-level scan.
A few days later, sysprog21/zhtw-mcp showed up in a Telegram group I'm in. By jserv — a notable Taiwanese kernel/embedded engineer, instant trust. Zero second-guessing the tool's direction.
The tool: zhtw-mcp
zhtw-mcp enforces three official Taiwan standards:
- Revised Punctuation Handbook (full-width punctuation rules)
- Standard Form of National Characters (variant forms —
裏→裡,着→著) - Cross-strait vocabulary normalization — built on OpenCC's TWPhrases / TWVariants plus its own ruleset
1,100+ vocabulary rules and 15 casing rules compile into a single Rust binary. CLI mode for local linting; MCP server mode for AI assistants.
There are no pre-built binaries on the GitHub releases yet, so build from source:
git clone https://github.com/sysprog21/zhtw-mcp ~/Projects/zhtw-mcp
cd ~/Projects/zhtw-mcp
make # python codegen downloads OpenCC tables → cargo build, ~50s
CLI usage:
zhtw-mcp lint file.md # list issues
zhtw-mcp lint file.md --content-type markdown # skip code blocks
zhtw-mcp lint file.md --relaxed # loosen punctuation/grammar
zhtw-mcp lint file.md --fix --dry-run # preview fixes
Quick CLI shape, sensible defaults. The MCP integration is well-thought-out too. Recommend tier: high.
⚠️ The first trap: don't run --fix on Markdown with frontmatter
My first instinct was "lint, then --fix everything in batch." I tested on one article. Disaster:
zhtw-mcp by default converts ASCII quotes ("...") inside YAML frontmatter to corner brackets (「...」):
# Before (parses fine)
description: "Full local AI agent stack..."
# After --fix (YAML parser dies)
description: 「Full local AI agent stack...」
YAML doesn't recognize 「」 as string delimiters. The whole file fails to parse, the article doesn't render. Run on all 72 files = entire site down.
--fix also force-converts half-width punctuation throughout body text. ai-muninn's voice deliberately keeps half-width punctuation (, : ? ( )) in mixed Chinese-English technical writing — it reads more like code commentary than formal prose. Auto-fix would erase that.
Workflow design: Surgical Python beats batch --fix
Final design:
- Scan: zhtw-mcp
--format compact --relaxedlists issues per file - Filter: Python script picks only
cross_straitrule types (skips punctuation, translationese, etc.) - Apply: Python script does its own substitutions, scoping to body text only — explicitly skipping YAML frontmatter, fenced code blocks, and inline backticks
def split_sections(text):
# Yields (kind, lines, start_index) for frontmatter / code / body
...
def fix_text(text):
out_lines = []
for kind, chunk, start in split_sections(text):
for line in chunk:
if kind == 'body':
# Apply cross_strait substitutions, but only outside `inline code`
...
Three layers of protection (frontmatter / code block / inline backtick) and explicit substitution dict that I can iterate per round.
Round 1: high-frequency single-word drift (84 subs / 42 files)
This round handled "single technical terms I default to the Mainland form on" — the boring high-frequency drift that any zh-TW reader notices instantly:
| Mainland → Taiwan | Hits | English | Why I drifted |
|---|---|---|---|
| 數據 → 資料 | 36 | data | Almost every tech article uses this |
| 用戶 → 使用者 | 9 | user | Direct translation of "user" |
| 連接 → 連線 | 7 | connection | Network connection |
| 發送 → 傳送 | 5 | send | Email/message context |
| 模板 → 範本 | 5 | template | Software templates |
| 導航 → 導覽 | 4 | navigation | UI navigation |
| 只讀 → 唯讀 | 4 | readonly | File permissions |
| 設備 → 裝置 | 3 | device | Hardware |
| 性能 → 效能 | 2 | performance | Benchmarking |
| 卸載 → 解除安裝 | 2 | uninstall | |
| 溢出 → 溢位 | 2 | overflow | Stack overflow, etc. |
| 在線 → 線上 | 2 | online | |
| 擴展 → 擴充 | 2 | extension | Software context |
| 緩存 → 快取 | 1 | cache | KV cache, etc. |
These 14 terms are now in my calque blindspot memory file — to be loaded as context whenever I open a Claude Code session for blog writing. The memory file previously had 5 terms (並發 family, 複現, 矽片, 內存, 視頻); after this sweep it's 19.
Round 2: the borderline tier (36 subs)
Two terms where both forms exist in zh-TW software writing, and which form to use is author preference rather than a correctness call:
優化→最佳化(30 hits) — "optimize"算法→演算法(6 hits) — "algorithm"
The Ministry of Education prefers 最佳化 and 演算法, but 優化 and 算法 are widely used in Taiwan software circles too. I opted in because once I'd fixed the harder Round 1 drift, consistency with MoE forms reads cleaner.
算法 → 演算法 has a regex trap: a naive replace creates 演演算法 because 演算法 itself contains 算法 as a substring. Need negative lookbehind:
pattern = re.compile(r'(?<!演)算法')
Other borderline terms I left untouched:
通過/透過(through/via) — both common前綴/字首(prefix) — technical context, both fine分配/配置(allocate) — programming context, both fine版本號/版本號碼(version number) —版本號is universal in tech writing消息/訊息— high-risk false-positive trap:好消息("good news") and收到消息("received word") are idiomatic zh-TW for news/info, not "message". Auto-substituting these would corrupt meaning. Explicitly excluded.
Round 3: same-character-different-meaning (8 subs)
This round tackled the most semantically dangerous tier — same characters, but the meaning is different (or even reversed) across straits:
| English | zh-CN | zh-TW | Note |
|---|---|---|---|
| concurrency | 並發 | 並行 | In zh-CN, 並行 = "parallel" — opposite mapping |
| parallel | 並行 | 平行 | Same characters, swapped meaning |
| process (OS) | 進程 | 行程 / 程序 | In zh-TW, 進程 = "progress," not OS process |
| file | 文件 | 檔案 | 文件 in zh-TW = "document," not "file" |
| document | 文檔 | 文件 | Mirror conflict |
| render | 渲染 | 算繪 | In zh-TW, 渲染 = "exaggerate" (a painting technique) |
| traverse | 遍歷 | 走訪 | In zh-TW, 遍歷 is reserved for Ergodic theory |
Surprise: this tier was almost entirely correct in my corpus. I wrote:
並行9 times — all in correct zh-TW "concurrent" sense平行2 times — correct zh-TW "parallel"文件57 times — all correct zh-TW "document" (I consistently use檔案for "file")檔案115 times — correct zh-TW "file"行程4 times — all in itinerary context ("Kyoto itinerary," "travel itinerary") or substring false positive ("幾十行程式碼" = lines of code)進程1 time — false positive (substring of "鑽進程式碼裡" = "delve into the code")
Only two real failures:
並發7 hits (cluster on "concurrent load," "theoretical concurrency 7.48x") → fixed to並行遍歷1 hit ("traverserequest.messages") → fixed to走訪
I deliberately left 渲染 (4 hits) untouched — it's used in software context, where 渲染 is widely accepted in Taiwan tech writing (Apple's zh-TW documentation uses it, for example), even though MoE prefers 算繪.
The real insight: blindspot is asymmetric
After fixing 128 substitutions, I stared at the distribution for five minutes:
| Tier | Hits fixed | Risk profile |
|---|---|---|
| Tier 1 (high-freq drift) | 84 | Largest — single-word direct translations |
| Tier 2 (borderline) | 36 | Medium — both forms exist, author preference |
| Tier 3 (semantic conflict) | 8 | Smallest — I wrote correctly by default |
The most semantically dangerous tier — where mistranslation actively confuses meaning — was the cleanest in my writing. The least dangerous tier — where readers just feel "this isn't quite Taiwan-flavored" — was where I drifted most.
Why?
When I'm typing a word like 並行 (concurrency in zh-TW, parallel in zh-CN) or 文件 (document in zh-TW, file in zh-CN), my brain auto-fires a "wait, this is a conflict word" alert. I'm consciously aware it has different meanings across straits, so I self-check.
When I'm typing 數據 or 用戶 or 連接, no alert fires. These feel like "neutral terms" — and my default training data (years of reading Mainland-dominated tech content online) supplies the Mainland form.
So my blindspot isn't "I don't know the Taiwan term" — I correctly wrote 並行 9 times, 檔案 115 times, 文件 (as document) 57 times. It's "when the Mainland term surfaces, I don't auto-doubt it". Self-check fires for danger words, not for the boring high-frequency drift.
This isn't a vocabulary problem. It's a vigilance distribution problem.
The fix: convert vigilance into workflow
Once you know your blindspot is asymmetric, the solution is clear: don't try to self-check every word manually — let a lint tool catch the words your alarm doesn't fire on.
Pre-publish lint hook
I added zhtw-mcp to my pre-publish workflow:
# alias in ~/.zshrc
zhtw_check() {
~/Projects/zhtw-mcp/target/release/zhtw-mcp lint "$1" \
--content-type markdown --relaxed --format compact \
| grep -E ':(W|E):cross_strait:'
}
# Run before publishing a new article
zhtw_check content/blog/zh-TW/new-article.mdx
Filtered to cross_strait only — skips punctuation and translationese to match my style preferences.
Memory file as Claude context
I updated my zh-TW calque blindspot memory file with all 14 high-frequency terms, 7 borderline terms, and 2 semantic-conflict terms. This file gets loaded into context whenever I start a Claude Code session for blog writing — so when I draft a paragraph, Claude already has the table of "if you see these terms, double-check."
Codex zh-TW review (already in place)
A separate zh-TW review gate using Codex catches narrative-level translationese. zhtw-mcp catches lexical-level drift. Different granularities, layered: Codex for sentence-level, zhtw-mcp for word-level.
Takeaways
If you write bilingual technical content (any language pair where the high-resource language dominates training data), here are five practical recommendations:
- Don't trust your "native sense" — I assumed my zh-TW writing was clean. The first lint pass found 207 issues across 72 articles. Discount your native intuition by some margin.
- Install zhtw-mcp if you write Traditional Chinese — single binary, no API keys, Apache 2.0. 5 minutes to build from source.
- Don't
--fixblindly — at minimum on .mdx files, the default behavior corrupts YAML frontmatter. Write a surgical wrapper that scopes substitutions to body content. - The semantic-conflict tier probably isn't your problem — if you write decent zh-TW at all, you self-check on words like 並行/平行 and 文件/檔案. Your real drift is on high-frequency single words (數據/用戶/連接/發送/模板...) where no alarm fires.
- Build a self-check trigger list — every term the linter catches goes into a memory file your AI assistant loads at session start. Vigilance becomes context-engineering, not willpower.
What I'm not doing
- Not fixing
translationese(456 hits — high false-positive rate for individual style choices, needs per-article author judgment) - Not converting half-width punctuation (deliberate ai-muninn style: half-width punctuation in mixed Chinese-English technical writing reads more like code commentary)
- Not auto-fixing terms like
渲染(industry-standard in Taiwan software writing — Apple's zh-TW documentation uses it — even though MoE prefers算繪)
P.S. This very article got flagged ~50 times
Right after publishing, I ran zhtw-mcp on this article. ~50 cross_strait hits: 數據 flagged 7 times, 用戶 6 times, 並發 6 times, 遍歷 4 times…
But every hit is a citation, not authorial drift. The Tier 1 table that lists "數據 → 資料 (36 hits)"? The linter flags 數據 in the row header. The line where I write "好消息 / 收到消息 are zh-TW idioms (news ≠ message)"? It flags 消息. The line citing "進程 in Taiwan means progress, not OS process"? It flags 進程.
If I had run --fix blindly on this very article — the one arguing against blind --fix — it would have destroyed itself:
- The Tier 1 table
數據 → 資料would become資料 → 資料(identical character on both sides; reader has no idea what's being demonstrated) - "好消息 is an idiom (news, not message)" would become "好訊息 is an idiom" — the explanation contradicts itself: I'm explaining why the idiom shouldn't change, while changing it
This lands the article's thesis cleanly: linter output always needs human review. Citation context is the most common legitimate false positive for vocabulary-rule linters. Surgical workflows beat batch --fix not just because of YAML frontmatter protection, but because lint rules operate at vocabulary level without semantic context understanding. A human pass at the end is non-negotiable.
One more rule to add to the workflow: don't just count hits, look at the surrounding lines for each hit. I changed my zhtw-mcp grep to grep -B 1 -A 1 — citation context becomes obvious immediately.
Related reading
- Part 5 — Claude Code self-audit
/slimskill - Part 6 — Claude Code burning tokens? 8 fixes
- LLM Deep Dive Part 2 — TurboQuant KV cache benchmark — one of the technical articles that got linted
- sysprog21/zhtw-mcp on GitHub
- FAccT 2025 paper on LLM zh-CN/zh-TW bias
FAQ
- What is zhtw-mcp and how is it different from OpenCC?
- [sysprog21/zhtw-mcp](https://github.com/sysprog21/zhtw-mcp) is a Rust-based Traditional Chinese (zh-TW) linter by jserv. It compiles 1,100+ vocabulary rules, MoE standard character forms, and cross-strait disambiguation tables into a single binary. OpenCC handles character-level Simplified→Traditional conversion; zhtw-mcp goes further with context-sensitive same-character-different-meaning rules, translationese detection, and full-width/half-width punctuation enforcement. One sentence: OpenCC converts characters, zhtw-mcp converts how Taiwan actually writes.
- Why can't you just run zhtw-mcp's --fix flag?
- Default --fix converts YAML frontmatter quote marks ("...") to corner brackets (「...」), which breaks YAML parsing and crashes article rendering. It also force-converts half-width punctuation that I deliberately keep for the ASCII-friendly mixed-language voice ai-muninn uses. The safe approach is a surgical Python wrapper that scopes substitutions to body text only, skipping frontmatter and code blocks.
- What were the highest-frequency Mainland terms in your writing?
- From my 72-article corpus the top three are 數據 (36 hits → should be 資料 for 'data'), 用戶 (9 hits → 使用者 for 'user'), and 連接 (7 hits → 連線 for 'connection'). All three are common technical terms, and I'd default to the Mainland form when translating from English without realizing — even though I knew the right Taiwan form. That's the blindspot.
- Did the same-character-different-meaning tier surprise you?
- Yes — but in the opposite direction. The most semantically dangerous tier (where the same characters mean opposite things across straits — 並行/平行 swap meanings for parallel/concurrent, 文件/檔案 swap for file/document) was actually the cleanest in my corpus. I wrote 並行 9 times correctly as 'concurrent', 文件 57 times correctly as 'document'. My blindspot wasn't there. It was on the high-frequency single-word translations where there's no semantic conflict to trigger self-doubt.