Kana-to-romaji tables drift, and the sibling entry is the fix
Transliteration libraries store kana-to-romaji as a lookup table, and a single row quietly drifts — a kana gets the wrong reading, or gets dropped. You rarely have to guess the right value, because a sibling in the same table (hiragana vs katakana, じ vs ぢ, the voiced pair beside it) already does it right. The oracle is the round-trip: kana to romaji and back should be stable. When it isn't, compare the broken entry to its sibling.
A Japanese name goes into a romanizer and comes back subtly wrong: 鼻血 (hanaji, a nosebleed) prints as Hanazi. Nobody mistyped it. One row in the library’s kana-to-romaji table drifted, and ぢ picked up the reading of its Kunrei-shiki cousin instead of its Hepburn sibling じ.
I collect CJK and Unicode bugs from open-source libraries into a corpus of 97 entries. Eight are kana-romaji table bugs, and every one has the same shape: not a hard linguistics problem, just one cell out of sync with the cell right next to it.
The oracle: kana to romaji to kana should be stable
Romanization is lossy in one direction — じ and ぢ both write as ji, so ji back to kana is ambiguous. But the other direction is a property you can assert: take a kana, romanize it, and the reading should match what the same library produces for the equivalent kana somewhere else in the table. If ず romanizes as zu but its voiced twin づ romanizes as u, you have found a bug without knowing a word of Japanese. The sibling is the specification.
Receipt: cutlet romanized ぢ as zi, but じ was already ji
In cutlet, a Japanese-to-romaji library, the Hepburn table mapped ぢ to zi while じ mapped to ji. In Hepburn these two kana are the same sound (this is yotsugana) and both write as ji. The table already handled the parallel pair correctly — both ず and づ map to zu, not the Nihon-shiki du — so ぢ was the odd one out.
The cause was inheritance. The tables are built Hepburn, then Kunrei-shiki as a copy of Hepburn, then Nihon-shiki as a copy of that. ぢ never got an explicit Kunrei override, so the Kunrei reading zi leaked up into the Hepburn base. The fix set ぢ to ji in Hepburn and pinned KUNREISHIKI["ぢ"] = "zi" so Kunrei and Nihon stayed correct. Five lines.
はなぢ (鼻血, Hepburn) before: Hanazi after: Hanaji- Receipt: polm/cutlet#74 (merged)
Receipt: encoding.js converted two of the four wa-row voiced kana
encoding.js decomposes and recomposes the wa-row voiced katakana. It already handled ヷ (U+30F7) to わ゛ and ヺ (U+30FA) to を
toHiraganaCase('ヷヸヹヺ') before: 'わ゛ヸヹを゛' after: 'わ゛ゐ゛ゑ゛を゛'- Receipt: polygonplanet/encoding.js#62 (merged)
The same shape across the corpus
Those two fixes were mine. The corpus catalogues eight more of exactly this kind, and reading the last column is the whole method: in each case a neighbour in the same table was already correct.
| Library | Lang | Kana | Wrong output | Should be | Sibling that was right |
|---|---|---|---|---|---|
| hepburn | JS | ン before vowel/Y | SHINYOU |
SHIN'YOU |
hiragana ん uses the apostrophe |
| kana-romaji | JS | づ | u |
zu |
ず maps to zu |
| jaco-js | JS | ヲ / ヺ | reversed dakuten | correct pair | the other voiced-mark rows |
| jaconv | Python | ヵ / ヶ | dropped | ka / ke |
full-size カ / ケ convert |
| romaji-conv | JS | ゐ / ゑ | dropped | wi / we |
modern kana all present |
| pykakasi | Python | っで |
no entry | ddi |
other sokuon entries exist |
| pykakasi | Python | half-width kana + U+FF9E | garbled | ga after NFKC |
full-width ガ converts |
| unidecode | Python | half-width dakuten kana | artifacts | correct after NFKC | full-width katakana converts |
The hepburn one is worth a second look: without the apostrophe, シンヨウ (SHINYOU) collides with シニョウ. The apostrophe is not decoration, it is what keeps n plus a vowel distinct from a ny mora.
Why these tables drift
Three recurring reasons, all boring, all real:
- Standard inheritance. Hepburn, Kunrei-shiki, and Nihon-shiki share most of their table and differ in a handful of rows (
shi/si,ji/zi,du/zu). Copying one standard from another and forgetting a single override is how the cutlet bug happened. - Rare and historical kana. ゐ/ゑ (wi/we), small ヵ/ヶ, and the wa-row voiced kana are uncommon, so they get skipped when the common table is written and no test ever exercises them.
- Half-width and combining marks. Half-width katakana with a combining voiced mark (U+FF9E/U+FF9F) needs NFKC normalization before lookup, or it romanizes to garbage. Two of the eight — pykakasi and unidecode — are exactly this, and both were closed rather than merged, because the maintainers treated normalization as the caller’s job. That is a defensible call, and worth stating plainly.
Honest limits
Not every difference here is a bug. Romanization has several valid standards, so si versus shi is a choice, not an error; a fix has to target one standard and leave the others intact, which is why the cutlet patch pinned the Kunrei value. Traditional Hepburn also writes ん as m before b, p, and m, so n is not universally correct either. The round-trip oracle tells you an entry is inconsistent with its own siblings. It does not tell you which romanization standard you should be using. That part is a human decision.
How to test it
- Property-test the round-trip. For every kana in your table, romanize it and assert the result is non-empty and matches the equivalent kana elsewhere (hiragana vs katakana of the same mora). Dropped and empty mappings fall out immediately.
- Diff the siblings. Line up the voiced and unvoiced pairs, the hiragana and katakana halves, and the three romanization standards. A cell that disagrees with its row is your suspect.
Full repros, affected versions, and the sibling for each are in the CJK failure corpus. The same “correct in English, wrong the moment it’s Japanese” shape shows up in a word counter that reads a whole paragraph as one word; when the problem is the width of the characters rather than their reading, it becomes text that overflows a terminal because length is not display width; and the half-width combining marks two of these rows trip over are the same ones that make walking a string by code point tear a character apart.