~/blog/zhtw-mcp-calque-blindspot-sweep

AI Workflow · part 7

How a zh-TW Linter Found 128 Mainland-China Drift in My Own Writing

cat --toc

TL;DR

sysprog21/zhtw-mcp (a Rust-based Traditional Chinese linter by jserv) ran across all 72 zh-TW articles on ai-muninn. Three sweeps, 128 cross-strait substitutions across 42 unique files. cross_strait issues dropped from 207 to 135. But the real takeaway wasn't the count — it was the distribution. The semantically dangerous tier (same characters, different meaning across straits) was the cleanest in my writing. The boring high-frequency single-word translations were where I'd default to the Mainland form without ever doubting myself. The blindspot isn't "I don't know" — it's "I don't auto-check."

What this article is actually about

Most articles about Traditional Chinese vs Simplified Chinese focus on the political angle, or on character conversion. This article isn't about either. It's about a more concrete bias: even when you natively know the local term, your training data — which for any reader of online tech content is overwhelmingly Mainland-Chinese — pushes you toward the wrong form by default.

LLMs have this problem too. There's a FAccT 2025 paper showing most models default to Mainland terminology when asked to write Taiwan-style. CC-100 has a 2.6:1 zh-CN to zh-TW ratio. Models trained on web Chinese reproduce that ratio in their output.

What I didn't expect: as a native zh-TW writer who's been doing this for decades, I have the same bias. I just didn't have a tool to catch it until now.

If you write any kind of bilingual technical content, or care about how training-data distribution influences your own outputs, this is a case study you can adapt.


The trigger: GSC can't see "Mainland flavor"

A few days earlier (2026-05-02), I was writing a deep technical post in zh-TW about LLM quantization on DGX Spark. I used the term 並發 (Mainland: concurrency) several times. Gemini 2.5 Pro flagged it. I argued back — "but Taiwan also uses this." User pushback (i.e., my own correction after seeing the linter output): no, Taiwan does not use 並發 for concurrency, that's Mainland-Chinese drift. The Taiwan term is 並行.

That single article got fixed. But the experience nagged at me: how many other articles have similar drift? Google Search Console can't tell me — "Mainland flavor" doesn't show up as a metric. Real readers feel it as subtle wrongness. I needed a corpus-level scan.

A few days later, sysprog21/zhtw-mcp showed up in a Telegram group I'm in. By jserv — a notable Taiwanese kernel/embedded engineer, instant trust. Zero second-guessing the tool's direction.


The tool: zhtw-mcp

zhtw-mcp enforces three official Taiwan standards:

  1. Revised Punctuation Handbook (full-width punctuation rules)
  2. Standard Form of National Characters (variant forms — , )
  3. Cross-strait vocabulary normalization — built on OpenCC's TWPhrases / TWVariants plus its own ruleset

1,100+ vocabulary rules and 15 casing rules compile into a single Rust binary. CLI mode for local linting; MCP server mode for AI assistants.

There are no pre-built binaries on the GitHub releases yet, so build from source:

git clone https://github.com/sysprog21/zhtw-mcp ~/Projects/zhtw-mcp
cd ~/Projects/zhtw-mcp
make    # python codegen downloads OpenCC tables → cargo build, ~50s

CLI usage:

zhtw-mcp lint file.md                          # list issues
zhtw-mcp lint file.md --content-type markdown  # skip code blocks
zhtw-mcp lint file.md --relaxed                # loosen punctuation/grammar
zhtw-mcp lint file.md --fix --dry-run          # preview fixes

Quick CLI shape, sensible defaults. The MCP integration is well-thought-out too. Recommend tier: high.


⚠️ The first trap: don't run --fix on Markdown with frontmatter

My first instinct was "lint, then --fix everything in batch." I tested on one article. Disaster:

zhtw-mcp by default converts ASCII quotes ("...") inside YAML frontmatter to corner brackets (「...」):

# Before (parses fine)
description: "Full local AI agent stack..."

# After --fix (YAML parser dies)
description: 「Full local AI agent stack...」

YAML doesn't recognize 「」 as string delimiters. The whole file fails to parse, the article doesn't render. Run on all 72 files = entire site down.

--fix also force-converts half-width punctuation throughout body text. ai-muninn's voice deliberately keeps half-width punctuation (, : ? ( )) in mixed Chinese-English technical writing — it reads more like code commentary than formal prose. Auto-fix would erase that.


Workflow design: Surgical Python beats batch --fix

Final design:

  1. Scan: zhtw-mcp --format compact --relaxed lists issues per file
  2. Filter: Python script picks only cross_strait rule types (skips punctuation, translationese, etc.)
  3. Apply: Python script does its own substitutions, scoping to body text only — explicitly skipping YAML frontmatter, fenced code blocks, and inline backticks
def split_sections(text):
    # Yields (kind, lines, start_index) for frontmatter / code / body
    ...

def fix_text(text):
    out_lines = []
    for kind, chunk, start in split_sections(text):
        for line in chunk:
            if kind == 'body':
                # Apply cross_strait substitutions, but only outside `inline code`
                ...

Three layers of protection (frontmatter / code block / inline backtick) and explicit substitution dict that I can iterate per round.


Round 1: high-frequency single-word drift (84 subs / 42 files)

This round handled "single technical terms I default to the Mainland form on" — the boring high-frequency drift that any zh-TW reader notices instantly:

Mainland → TaiwanHitsEnglishWhy I drifted
數據 → 資料36dataAlmost every tech article uses this
用戶 → 使用者9userDirect translation of "user"
連接 → 連線7connectionNetwork connection
發送 → 傳送5sendEmail/message context
模板 → 範本5templateSoftware templates
導航 → 導覽4navigationUI navigation
只讀 → 唯讀4readonlyFile permissions
設備 → 裝置3deviceHardware
性能 → 效能2performanceBenchmarking
卸載 → 解除安裝2uninstall
溢出 → 溢位2overflowStack overflow, etc.
在線 → 線上2online
擴展 → 擴充2extensionSoftware context
緩存 → 快取1cacheKV cache, etc.

These 14 terms are now in my calque blindspot memory file — to be loaded as context whenever I open a Claude Code session for blog writing. The memory file previously had 5 terms (並發 family, 複現, 矽片, 內存, 視頻); after this sweep it's 19.


Round 2: the borderline tier (36 subs)

Two terms where both forms exist in zh-TW software writing, and which form to use is author preference rather than a correctness call:

  • 優化最佳化 (30 hits) — "optimize"
  • 算法演算法 (6 hits) — "algorithm"

The Ministry of Education prefers 最佳化 and 演算法, but 優化 and 算法 are widely used in Taiwan software circles too. I opted in because once I'd fixed the harder Round 1 drift, consistency with MoE forms reads cleaner.

算法演算法 has a regex trap: a naive replace creates 演演算法 because 演算法 itself contains 算法 as a substring. Need negative lookbehind:

pattern = re.compile(r'(?<!演)算法')

Other borderline terms I left untouched:

  • 通過 / 透過 (through/via) — both common
  • 前綴 / 字首 (prefix) — technical context, both fine
  • 分配 / 配置 (allocate) — programming context, both fine
  • 版本號 / 版本號碼 (version number) — 版本號 is universal in tech writing
  • 消息 / 訊息 — high-risk false-positive trap: 好消息 ("good news") and 收到消息 ("received word") are idiomatic zh-TW for news/info, not "message". Auto-substituting these would corrupt meaning. Explicitly excluded.

Round 3: same-character-different-meaning (8 subs)

This round tackled the most semantically dangerous tier — same characters, but the meaning is different (or even reversed) across straits:

Englishzh-CNzh-TWNote
concurrency並發並行In zh-CN, 並行 = "parallel" — opposite mapping
parallel並行平行Same characters, swapped meaning
process (OS)進程行程 / 程序In zh-TW, 進程 = "progress," not OS process
file文件檔案文件 in zh-TW = "document," not "file"
document文檔文件Mirror conflict
render渲染算繪In zh-TW, 渲染 = "exaggerate" (a painting technique)
traverse遍歷走訪In zh-TW, 遍歷 is reserved for Ergodic theory

Surprise: this tier was almost entirely correct in my corpus. I wrote:

  • 並行 9 times — all in correct zh-TW "concurrent" sense
  • 平行 2 times — correct zh-TW "parallel"
  • 文件 57 times — all correct zh-TW "document" (I consistently use 檔案 for "file")
  • 檔案 115 times — correct zh-TW "file"
  • 行程 4 times — all in itinerary context ("Kyoto itinerary," "travel itinerary") or substring false positive ("幾十行程式碼" = lines of code)
  • 進程 1 time — false positive (substring of "鑽進程式碼裡" = "delve into the code")

Only two real failures:

  • 並發 7 hits (cluster on "concurrent load," "theoretical concurrency 7.48x") → fixed to 並行
  • 遍歷 1 hit ("traverse request.messages") → fixed to 走訪

I deliberately left 渲染 (4 hits) untouched — it's used in software context, where 渲染 is widely accepted in Taiwan tech writing (Apple's zh-TW documentation uses it, for example), even though MoE prefers 算繪.


The real insight: blindspot is asymmetric

After fixing 128 substitutions, I stared at the distribution for five minutes:

TierHits fixedRisk profile
Tier 1 (high-freq drift)84Largest — single-word direct translations
Tier 2 (borderline)36Medium — both forms exist, author preference
Tier 3 (semantic conflict)8Smallest — I wrote correctly by default

The most semantically dangerous tier — where mistranslation actively confuses meaning — was the cleanest in my writing. The least dangerous tier — where readers just feel "this isn't quite Taiwan-flavored" — was where I drifted most.

Why?

When I'm typing a word like 並行 (concurrency in zh-TW, parallel in zh-CN) or 文件 (document in zh-TW, file in zh-CN), my brain auto-fires a "wait, this is a conflict word" alert. I'm consciously aware it has different meanings across straits, so I self-check.

When I'm typing 數據 or 用戶 or 連接, no alert fires. These feel like "neutral terms" — and my default training data (years of reading Mainland-dominated tech content online) supplies the Mainland form.

So my blindspot isn't "I don't know the Taiwan term" — I correctly wrote 並行 9 times, 檔案 115 times, 文件 (as document) 57 times. It's "when the Mainland term surfaces, I don't auto-doubt it". Self-check fires for danger words, not for the boring high-frequency drift.

This isn't a vocabulary problem. It's a vigilance distribution problem.


The fix: convert vigilance into workflow

Once you know your blindspot is asymmetric, the solution is clear: don't try to self-check every word manually — let a lint tool catch the words your alarm doesn't fire on.

Pre-publish lint hook

I added zhtw-mcp to my pre-publish workflow:

# alias in ~/.zshrc
zhtw_check() {
  ~/Projects/zhtw-mcp/target/release/zhtw-mcp lint "$1" \
    --content-type markdown --relaxed --format compact \
    | grep -E ':(W|E):cross_strait:'
}

# Run before publishing a new article
zhtw_check content/blog/zh-TW/new-article.mdx

Filtered to cross_strait only — skips punctuation and translationese to match my style preferences.

Memory file as Claude context

I updated my zh-TW calque blindspot memory file with all 14 high-frequency terms, 7 borderline terms, and 2 semantic-conflict terms. This file gets loaded into context whenever I start a Claude Code session for blog writing — so when I draft a paragraph, Claude already has the table of "if you see these terms, double-check."

Codex zh-TW review (already in place)

A separate zh-TW review gate using Codex catches narrative-level translationese. zhtw-mcp catches lexical-level drift. Different granularities, layered: Codex for sentence-level, zhtw-mcp for word-level.


Takeaways

If you write bilingual technical content (any language pair where the high-resource language dominates training data), here are five practical recommendations:

  1. Don't trust your "native sense" — I assumed my zh-TW writing was clean. The first lint pass found 207 issues across 72 articles. Discount your native intuition by some margin.
  2. Install zhtw-mcp if you write Traditional Chinese — single binary, no API keys, Apache 2.0. 5 minutes to build from source.
  3. Don't --fix blindly — at minimum on .mdx files, the default behavior corrupts YAML frontmatter. Write a surgical wrapper that scopes substitutions to body content.
  4. The semantic-conflict tier probably isn't your problem — if you write decent zh-TW at all, you self-check on words like 並行/平行 and 文件/檔案. Your real drift is on high-frequency single words (數據/用戶/連接/發送/模板...) where no alarm fires.
  5. Build a self-check trigger list — every term the linter catches goes into a memory file your AI assistant loads at session start. Vigilance becomes context-engineering, not willpower.

What I'm not doing

  • Not fixing translationese (456 hits — high false-positive rate for individual style choices, needs per-article author judgment)
  • Not converting half-width punctuation (deliberate ai-muninn style: half-width punctuation in mixed Chinese-English technical writing reads more like code commentary)
  • Not auto-fixing terms like 渲染 (industry-standard in Taiwan software writing — Apple's zh-TW documentation uses it — even though MoE prefers 算繪)

P.S. This very article got flagged ~50 times

Right after publishing, I ran zhtw-mcp on this article. ~50 cross_strait hits: 數據 flagged 7 times, 用戶 6 times, 並發 6 times, 遍歷 4 times…

But every hit is a citation, not authorial drift. The Tier 1 table that lists "數據 → 資料 (36 hits)"? The linter flags 數據 in the row header. The line where I write "好消息 / 收到消息 are zh-TW idioms (news ≠ message)"? It flags 消息. The line citing "進程 in Taiwan means progress, not OS process"? It flags 進程.

If I had run --fix blindly on this very article — the one arguing against blind --fix — it would have destroyed itself:

  • The Tier 1 table 數據 → 資料 would become 資料 → 資料 (identical character on both sides; reader has no idea what's being demonstrated)
  • "好消息 is an idiom (news, not message)" would become "好訊息 is an idiom" — the explanation contradicts itself: I'm explaining why the idiom shouldn't change, while changing it

This lands the article's thesis cleanly: linter output always needs human review. Citation context is the most common legitimate false positive for vocabulary-rule linters. Surgical workflows beat batch --fix not just because of YAML frontmatter protection, but because lint rules operate at vocabulary level without semantic context understanding. A human pass at the end is non-negotiable.

One more rule to add to the workflow: don't just count hits, look at the surrounding lines for each hit. I changed my zhtw-mcp grep to grep -B 1 -A 1 — citation context becomes obvious immediately.


FAQ

What is zhtw-mcp and how is it different from OpenCC?
[sysprog21/zhtw-mcp](https://github.com/sysprog21/zhtw-mcp) is a Rust-based Traditional Chinese (zh-TW) linter by jserv. It compiles 1,100+ vocabulary rules, MoE standard character forms, and cross-strait disambiguation tables into a single binary. OpenCC handles character-level Simplified→Traditional conversion; zhtw-mcp goes further with context-sensitive same-character-different-meaning rules, translationese detection, and full-width/half-width punctuation enforcement. One sentence: OpenCC converts characters, zhtw-mcp converts how Taiwan actually writes.
Why can't you just run zhtw-mcp's --fix flag?
Default --fix converts YAML frontmatter quote marks ("...") to corner brackets (「...」), which breaks YAML parsing and crashes article rendering. It also force-converts half-width punctuation that I deliberately keep for the ASCII-friendly mixed-language voice ai-muninn uses. The safe approach is a surgical Python wrapper that scopes substitutions to body text only, skipping frontmatter and code blocks.
What were the highest-frequency Mainland terms in your writing?
From my 72-article corpus the top three are 數據 (36 hits → should be 資料 for 'data'), 用戶 (9 hits → 使用者 for 'user'), and 連接 (7 hits → 連線 for 'connection'). All three are common technical terms, and I'd default to the Mainland form when translating from English without realizing — even though I knew the right Taiwan form. That's the blindspot.
Did the same-character-different-meaning tier surprise you?
Yes — but in the opposite direction. The most semantically dangerous tier (where the same characters mean opposite things across straits — 並行/平行 swap meanings for parallel/concurrent, 文件/檔案 swap for file/document) was actually the cleanest in my corpus. I wrote 並行 9 times correctly as 'concurrent', 文件 57 times correctly as 'document'. My blindspot wasn't there. It was on the high-frequency single-word translations where there's no semantic conflict to trigger self-doubt.