String length is not display width: the CJK bug hiding in tables and terminals
A CJK character like 生 is one code point but fills two columns in a monospace terminal. Code that budgets, truncates, or aligns text by character count instead of display width overflows the moment the text is Japanese, Chinese, or Korean. Measure with a width function — runewidth in Go, unicode-width in Rust, wcwidth in Python, string-width in JS — not len, .length, or rune count.
A command-line tool prints a table. The ASCII rows line up. The row with a Japanese name pushes its right border two columns past the rest and the box tears open. Nobody typed anything wrong. The layout measured that name by counting characters, and a kanji counts as one character but fills two columns.
This one keeps coming back. In a corpus of 97 real CJK and Unicode bugs I’ve catalogued from open-source libraries, six are width and normalization, and the display-width variant turns up in table renderers, CLI progress bars, and editor autocomplete alike. Same mistake, different repo.
A character can be one code point and two columns wide
Terminals are grids. A Latin letter fills one cell. A wide character — most kanji, hanzi, hangul, full-width forms — fills two. Unicode calls this East Asian Width. So there are at least four different numbers you can get back when you “measure” a string, and they are not the same:
| What you measure | JS | Go | Rust | Python | When it is wrong |
|---|---|---|---|---|---|
| UTF-16 code units | s.length |
n/a | n/a | n/a | splits astral chars (𠮷 counts as 2) |
| Code points / runes | [...s].length |
utf8.RuneCountInString |
s.chars().count() |
len(s) |
ignores width (生 counts as 1) |
| Grapheme clusters | Intl.Segmenter |
rivo/uniseg |
unicode-segmentation |
regex \X |
still not columns |
| Display columns | string-width |
mattn/go-runewidth |
unicode-width |
wcwidth |
this is the one terminals use |
For alignment and truncation in a fixed-width grid you want the last row. Everything above it undercounts wide text.
The bug: budgeting by rune count instead of width
I hit this in go-pretty, the Go table and progress-bar library. Its text.Trim kept maxLen runes regardless of how wide they were. But every caller passes a display-width budget: the table trims a rendered line only after its measured width exceeds WidthMax, and the progress renderer trims to terminal width. So a two-column character was charged as one column, and wide East Asian rows overflowed the box.
The before:
text.Trim("生命生命", 4) // => "生命生命" display width 8, budget was 4
text.Snip("生命生命生命", 5, "~") // => "生命生命~" overflows the 5-column budgetThe fix adds RuneWidth per rune and stops once a rune no longer fits, while still copying any trailing escape sequences so color codes stay closed. Twenty-six lines, with wide-character cases added to the existing TestTrim and TestSnip. It merged.
- Receipt: jedib0t/go-pretty#410 (merged)
The same bug, one repo over
The reason I call this a pattern and not an incident: the identical mistake sits open in the micro editor. Its command-bar autocomplete scrolls by CharacterCountInString while the renderer draws by runewidth, so a suggestion containing full-width CJK gets pushed partly or fully off screen. Different codebase, the same two functions disagreeing about what a “length” is.
- Same pattern: micro-editor/micro#4135 (open) — measures scroll by character count, draws by display width.
If you maintain anything that lays text out in cells, this is the first place to look.
Where display width gets genuinely hard
I don’t want to oversell the fix, because the easy 90% (kanji is 2, ASCII is 1) hides a messy tail where “width” is not even well defined:
- Ambiguous width. East Asian Width category A (some symbols, box-drawing, Greek in a CJK context) is one column or two depending on locale and font.
go-runewidthexposes anEastAsianWidthflag for exactly this. I filed a related one in Zed, where the block cursor was misaligned over ambiguous-width characters because it used the glyph’s intrinsic width, not the rendered cell width (zed-industries/zed#60017, open). - Combining marks. A base plus a combining mark is two code points, one grapheme, and the width of the base. Width-aware code that iterates code points instead of grapheme clusters can split a dakuten off its kana. That is what
tableddid when wrapping table cells (zhiburt/tabled#585, open). - ZWJ emoji. A family emoji is many code points joined by zero-width joiners, one grapheme, usually width 2. Counting code points here is nonsense.
- Terminals disagree. Emulators don’t all implement the same width tables, so a “correct” width can still render one column off in some terminal. This part is genuinely underspecified, not a library bug.
So: use a display-width library for the budget, iterate by grapheme cluster when you slice, and accept that ambiguous width has no single right answer.
How to test it in five minutes
Feed the layout one wide string and assert on the rendered columns, not the character count.
- Trim and truncate:
assert width(trim("生命生命", 4)) <= 4. Rune-count code returns the whole string here. - Alignment and padding: render a table with one ASCII row and one CJK row of the same character count; the borders should still line up.
- Wrapping: wrap a string with a combining mark at the boundary and check the mark stays with its base.
None of this needs a Japanese keyboard. 生命 and one width assertion catch the common case.
The full width and normalization set, with repros and the sibling that already did it right, is in the CJK failure corpus. The nearest neighbours are measuring problems too: walking a string by the wrong unit is where a slice cuts a character in half, and a word counter that reads a Japanese paragraph as one word is the same “counted the wrong thing” mistake one level up. When the text is Japanese specifically, a romaji table that drifts one row is the pattern moved into transliteration.