How do you find a bug in a kana-to-romaji table without knowing Japanese?

Use the round-trip as an oracle. Romanize each kana and compare it to the equivalent kana elsewhere in the same table — its hiragana or katakana twin, or its voiced pair. If ず romanizes as zu but its twin づ romanizes as u, the entry is inconsistent with its own sibling, which is a bug regardless of the language.

Why do じ and ぢ both romanize as ji in Hepburn?

In Hepburn the yotsugana じ and ぢ represent the same sound and both write as ji; likewise ず and づ both write as zu. Kunrei-shiki and Nihon-shiki keep them distinct (zi/di, zu/du). A table that gives ぢ the Kunrei reading zi in its Hepburn column has leaked a value across standards.

How do I test a Japanese transliteration library?

Property-test the round-trip. For every kana in the table, romanize it and assert the result is non-empty and matches the equivalent kana elsewhere (hiragana vs katakana of the same mora); dropped and empty mappings fall out immediately. Then diff the siblings — voiced and unvoiced pairs, hiragana and katakana halves, the three romanization standards — and a cell that disagrees with its row is the suspect.

Kana-to-romaji tables drift, and the sibling entry is the fix

A Japanese name goes into a romanizer and comes back subtly wrong: 鼻血 (hanaji, a nosebleed) prints as Hanazi. Nobody mistyped it. One row in the library’s kana-to-romaji table drifted, and ぢ picked up the reading of its Kunrei-shiki cousin instead of its Hepburn sibling じ.

I collect CJK and Unicode bugs from open-source libraries into a corpus of 97 entries. Eight are kana-romaji table bugs, and every one has the same shape: not a hard linguistics problem, just one cell out of sync with the cell right next to it.

The oracle: kana to romaji to kana should be stable

Romanization is lossy in one direction — じ and ぢ both write as ji, so ji back to kana is ambiguous. But the other direction is a property you can assert: take a kana, romanize it, and the reading should match what the same library produces for the equivalent kana somewhere else in the table. If ず romanizes as zu but its voiced twin づ romanizes as u, you have found a bug without knowing a word of Japanese. The sibling is the specification.

Receipt: cutlet romanized ぢ as zi, but じ was already ji

In cutlet, a Japanese-to-romaji library, the Hepburn table mapped ぢ to zi while じ mapped to ji. In Hepburn these two kana are the same sound (this is yotsugana) and both write as ji. The table already handled the parallel pair correctly — both ず and づ map to zu, not the Nihon-shiki du — so ぢ was the odd one out.

The cause was inheritance. The tables are built Hepburn, then Kunrei-shiki as a copy of Hepburn, then Nihon-shiki as a copy of that. ぢ never got an explicit Kunrei override, so the Kunrei reading zi leaked up into the Hepburn base. The fix set ぢ to ji in Hepburn and pinned KUNREISHIKI["ぢ"] = "zi" so Kunrei and Nihon stayed correct. Five lines.

Text

はなぢ (鼻血, Hepburn)   before: Hanazi   after: Hanaji

Receipt: polm/cutlet#74 (merged)

Receipt: encoding.js converted two of the four wa-row voiced kana

encoding.js decomposes and recomposes the wa-row voiced katakana. It already handled ヷ (U+30F7) to わ゛ and ヺ (U+30FA) to を゛. The other two, ヸ (U+30F8) and ヹ (U+30F9), were left out, so they passed through unconverted and broke the round-trip. The base letters ヰ/ヱ already convert to ゐ/ゑ through the normal katakana-to-hiragana offset, so the voiced forms just needed to follow the pattern their siblings used.

Text

toHiraganaCase('ヷヸヹヺ')   before: 'わ゛ヸヹを゛'   after: 'わ゛ゐ゛ゑ゛を゛'

Receipt: polygonplanet/encoding.js#62 (merged)

The same shape across the corpus

Those two fixes were mine. The corpus catalogues eight more of exactly this kind, and reading the last column is the whole method: in each case a neighbour in the same table was already correct.

Library	Lang	Kana	Wrong output	Should be	Sibling that was right
hepburn	JS	ン before vowel/Y	`SHINYOU`	`SHIN'YOU`	hiragana ん uses the apostrophe
kana-romaji	JS	づ	`u`	`zu`	ず maps to `zu`
jaco-js	JS	ヲ / ヺ	reversed dakuten	correct pair	the other voiced-mark rows
jaconv	Python	ヵ / ヶ	dropped	`ka` / `ke`	full-size カ / ケ convert
romaji-conv	JS	ゐ / ゑ	dropped	`wi` / `we`	modern kana all present
pykakasi	Python	っでぃ (ddi)	no entry	`ddi`	other sokuon entries exist
pykakasi	Python	half-width kana + U+FF9E	garbled	`ga` after NFKC	full-width ガ converts
unidecode	Python	half-width dakuten kana	artifacts	correct after NFKC	full-width katakana converts

The hepburn one is worth a second look: without the apostrophe, シンヨウ (SHINYOU) collides with シニョウ. The apostrophe is not decoration, it is what keeps n plus a vowel distinct from a ny mora.

Why these tables drift

Three recurring reasons, all boring, all real:

Standard inheritance. Hepburn, Kunrei-shiki, and Nihon-shiki share most of their table and differ in a handful of rows (shi/si, ji/zi, du/zu). Copying one standard from another and forgetting a single override is how the cutlet bug happened.
Rare and historical kana. ゐ/ゑ (wi/we), small ヵ/ヶ, and the wa-row voiced kana are uncommon, so they get skipped when the common table is written and no test ever exercises them.
Half-width and combining marks. Half-width katakana with a combining voiced mark (U+FF9E/U+FF9F) needs NFKC normalization before lookup, or it romanizes to garbage. Two of the eight — pykakasi and unidecode — are exactly this, and both were closed rather than merged, because the maintainers treated normalization as the caller’s job. That is a defensible call, and worth stating plainly.

Honest limits

Not every difference here is a bug. Romanization has several valid standards, so si versus shi is a choice, not an error; a fix has to target one standard and leave the others intact, which is why the cutlet patch pinned the Kunrei value. Traditional Hepburn also writes ん as m before b, p, and m, so n is not universally correct either. The round-trip oracle tells you an entry is inconsistent with its own siblings. It does not tell you which romanization standard you should be using. That part is a human decision.

How to test it

Property-test the round-trip. For every kana in your table, romanize it and assert the result is non-empty and matches the equivalent kana elsewhere (hiragana vs katakana of the same mora). Dropped and empty mappings fall out immediately.
Diff the siblings. Line up the voiced and unvoiced pairs, the hiragana and katakana halves, and the three romanization standards. A cell that disagrees with its row is your suspect.

Full repros, affected versions, and the sibling for each are in the CJK failure corpus. The same “correct in English, wrong the moment it’s Japanese” shape shows up in a word counter that reads a whole paragraph as one word; when the problem is the width of the characters rather than their reading, it becomes text that overflows a terminal because length is not display width; and the half-width combining marks two of these rows trip over are the same ones that make walking a string by code point tear a character apart.

The oracle: kana to romaji to kana should be stable

Receipt: cutlet romanized ぢ as zi, but じ was already ji

Text

はなぢ (鼻血, Hepburn)   before: Hanazi   after: Hanaji

Receipt: polm/cutlet#74 (merged)

Receipt: encoding.js converted two of the four wa-row voiced kana

Text

toHiraganaCase('ヷヸヹヺ')   before: 'わ゛ヸヹを゛'   after: 'わ゛ゐ゛ゑ゛を゛'

Receipt: polygonplanet/encoding.js#62 (merged)

The same shape across the corpus

Those two fixes were mine. The corpus catalogues eight more of exactly this kind, and reading the last column is the whole method: in each case a neighbour in the same table was already correct.

Library	Lang	Kana	Wrong output	Should be	Sibling that was right
hepburn	JS	ン before vowel/Y	`SHINYOU`	`SHIN'YOU`	hiragana ん uses the apostrophe
kana-romaji	JS	づ	`u`	`zu`	ず maps to `zu`
jaco-js	JS	ヲ / ヺ	reversed dakuten	correct pair	the other voiced-mark rows
jaconv	Python	ヵ / ヶ	dropped	`ka` / `ke`	full-size カ / ケ convert
romaji-conv	JS	ゐ / ゑ	dropped	`wi` / `we`	modern kana all present
pykakasi	Python	っでぃ (ddi)	no entry	`ddi`	other sokuon entries exist
pykakasi	Python	half-width kana + U+FF9E	garbled	`ga` after NFKC	full-width ガ converts
unidecode	Python	half-width dakuten kana	artifacts	correct after NFKC	full-width katakana converts

Why these tables drift

Three recurring reasons, all boring, all real:

Standard inheritance. Hepburn, Kunrei-shiki, and Nihon-shiki share most of their table and differ in a handful of rows (shi/si, ji/zi, du/zu). Copying one standard from another and forgetting a single override is how the cutlet bug happened.
Rare and historical kana. ゐ/ゑ (wi/we), small ヵ/ヶ, and the wa-row voiced kana are uncommon, so they get skipped when the common table is written and no test ever exercises them.
Half-width and combining marks. Half-width katakana with a combining voiced mark (U+FF9E/U+FF9F) needs NFKC normalization before lookup, or it romanizes to garbage. Two of the eight — pykakasi and unidecode — are exactly this, and both were closed rather than merged, because the maintainers treated normalization as the caller’s job. That is a defensible call, and worth stating plainly.

Honest limits

How to test it

Property-test the round-trip. For every kana in your table, romanize it and assert the result is non-empty and matches the equivalent kana elsewhere (hiragana vs katakana of the same mora). Dropped and empty mappings fall out immediately.
Diff the siblings. Line up the voiced and unvoiced pairs, the hiragana and katakana halves, and the three romanization standards. A cell that disagrees with its row is your suspect.

# The oracle: kana to romaji to kana should be stable

# Receipt: cutlet romanized ぢ as zi, but じ was already ji

# Receipt: encoding.js converted two of the four wa-row voiced kana

# The same shape across the corpus

# Why these tables drift

# Honest limits

# How to test it

# The oracle: kana to romaji to kana should be stable

# Receipt: cutlet romanized ぢ as zi, but じ was already ji

# Receipt: encoding.js converted two of the four wa-row voiced kana

# The same shape across the corpus

# Why these tables drift

# Honest limits

# How to test it

The oracle: kana to romaji to kana should be stable

Receipt: cutlet romanized ぢ as zi, but じ was already ji

Receipt: encoding.js converted two of the four wa-row voiced kana

The same shape across the corpus

Why these tables drift

Honest limits

How to test it

The oracle: kana to romaji to kana should be stable

Receipt: cutlet romanized ぢ as zi, but じ was already ji

Receipt: encoding.js converted two of the four wa-row voiced kana

The same shape across the corpus

Why these tables drift

Honest limits

How to test it