The substring that cuts a character in half
Slice a string that holds an emoji or a rare kanji and you can split one character into two, because JavaScript indexes by UTF-16 code unit, not by character. `[...str]` and `Array.from` fix surrogate pairs but still tear ZWJ emoji and combining marks apart; the only walk that respects every visible character is `Intl.Segmenter` at grapheme granularity.
Truncate a username to fit a card. The name ends in 𠮷 — a real kanji, the one in the sushi chain 𠮷野家, from the far end of Unicode. Your slice lands between its two halves and the cell renders a replacement box, or worse, silent garbage. You cut a character in half and nothing threw.
If you only ever handle ASCII you will not reproduce this, because every character you test with is exactly one code unit. That is the trap. The code looks correct until someone types an emoji or a name from outside the Basic Multilingual Plane, and by then it is in production.
Of 97 CJK and Unicode bugs I’ve catalogued in open-source libraries, 11 are this class — text walked by the wrong unit. It shows up as a table cell splitting a wide character (cli-table3), a UI truncation slicing raw code units (Clerk), styled-text offsets drifting after one astral kanji (kaplay), and a rainbow-flag emoji coming apart into its pieces (grapheme-splitter, lodash). The whole list, with repros and fixes, is the CJK failure corpus.
String.prototype.length counts code units, not characters
A JavaScript string is UTF-16. Characters in the Basic Multilingual Plane are one code unit; everything above it — most emoji, and CJK extensions like 𠮷 (U+20BB7) — is a surrogate pair, two code units standing in for one character.
"a".length; // 1
"あ".length; // 1 (BMP)
"😀".length; // 2 (surrogate pair)
"𠮷".length; // 2 (supplementary-plane kanji)So str.slice(0, 8) does not keep the first 8 characters. It keeps the first 8 code units, and if character 8 is a surrogate pair, it takes half of it. slice, substring, substr, and a bare for (let i = 0; i < str.length; i++) all share this — they index by code unit.
[...str] and Array.from walk by code point
Spread and Array.from iterate the string with its built-in iterator, which yields code points, so surrogate pairs stay whole:
[..."😀"]; // ["😀"] length 1
[..."𠮷野家"]; // ["𠮷", "野", "家"] length 3
Array.from("😀").length; // 1This is the fix for most truncation and length bugs, and it is what closed several of the corpus entries. [...str].slice(0, 8).join("") truncates without cutting a character in half. Reach for this first.
Where code points still aren’t enough: ZWJ emoji and combining marks
Here is the second cliff. A code point is not always a character as a reader means it. Some visible characters are built from several code points glued together:
- a family emoji is several people joined by zero-width joiners (U+200D),
- a flag is two regional-indicator letters,
écan bee+ a combining acute accent,- Indic scripts join consonants into conjunct clusters.
[...str] splits all of those, because it walks code points:
[..."👨👩👧"]; // ["👨", "", "👩", "", "👧"] — five pieces, ZWJ and all
[..."🇯🇵"]; // ["🇯", "🇵"] — the flag comes apartThat is a real bug class, not a curiosity: Slate put the cursor mid-cluster in Hindi and Bengali, grapheme-splitter tore the rainbow flag apart, and wenmode — the one merged fix among these — mistook combining marks around CJK for punctuation.
Intl.Segmenter walks by grapheme cluster
The unit a reader actually perceives as one character is a grapheme cluster (Unicode UAX #29). Intl.Segmenter with granularity: "grapheme" is the standard, built-in way to walk it:
function chars(str) {
const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
return Array.from(seg.segment(str), (s) => s.segment);
}
chars("👨👩👧"); // ["👨👩👧"] — one character
chars("🇯🇵"); // ["🇯🇵"] — one flag
chars("é"); // ["é"] — base + combining mark, kept togetherIntl.Segmenter ships in Node 16+ and every current browser, and grapheme segmentation is locale-independent, so you can pass undefined for the locale. Use it when the count has to match what a human sees — cursor movement, character limits shown to a user, truncation of display text with mixed emoji.
Three units, and which one you want
The whole bug is picking the wrong unit for the job. There are three:
| You walk by | With | Keeps surrogate pairs (😀, 𠮷) whole | Keeps ZWJ emoji / flags / combining marks whole |
|---|---|---|---|
| code unit | .length, .slice, .substring, str[i] |
no | no |
| code point | [...str], Array.from(str), for…of |
yes | no |
| grapheme cluster | Intl.Segmenter(locale, { granularity: "grapheme" }) |
yes | yes |
Most “length is wrong” and “truncation makes mojibake” bugs are the jump from row one to row two, and [...str] is the whole fix. The jump to row three matters when you count or cut display text with emoji and accents in it. Match the unit to the question; do not reach for the heaviest tool by default.
Where it stops
Counting characters correctly is not the same as laying them out correctly. A wide CJK character or an emoji occupies two terminal columns even though it is one grapheme, which is a separate axis — the cli-table3 fix had to be code-point-safe and width-aware, and those are two different problems. And Intl.Segmenter follows the Unicode rules of its runtime; a very old engine without it needs a polyfill. Knowing where a character ends is step one. Whether your layout survives a two-column character is the next question, and the answer, as usual, is to actually render some Japanese and look.