Why is "😀".length equal to 2 in JavaScript?

A string's `length` counts UTF-16 code units, not characters. An emoji or a supplementary-plane character (like the kanji 𠮷) is stored as a surrogate pair — two code units — so `length` reports 2. To count actual characters, iterate with `[...str]` or `Array.from(str)`.

What is the difference between a code point and a grapheme cluster?

A code point is one Unicode scalar value. A grapheme cluster is what a reader sees as one character, which can be several code points — a ZWJ emoji like a family, or a base letter plus combining marks. `[...str]` splits by code point; `Intl.Segmenter` at grapheme granularity splits by grapheme cluster.

How do I split a string into characters without breaking emoji?

For most cases `[...str]` or `Array.from(str)` is enough — it keeps surrogate pairs whole. If you need ZWJ emoji, flag sequences, and combining marks to stay together too, walk it with `Intl.Segmenter` at grapheme granularity and collect each segment back into a string.

The substring that cuts a character in half

Truncate a username to fit a card. The name ends in 𠮷 — a real kanji, the one in the sushi chain 𠮷野家, from the far end of Unicode. Your slice lands between its two halves and the cell renders a replacement box, or worse, silent garbage. You cut a character in half and nothing threw.

If you only ever handle ASCII you will not reproduce this, because every character you test with is exactly one code unit. That is the trap. The code looks correct until someone types an emoji or a name from outside the Basic Multilingual Plane, and by then it is in production.

Of 97 CJK and Unicode bugs I’ve catalogued in open-source libraries, 11 are this class — text walked by the wrong unit. It shows up as a table cell splitting a wide character (cli-table3), a UI truncation slicing raw code units (Clerk), styled-text offsets drifting after one astral kanji (kaplay), and a rainbow-flag emoji coming apart into its pieces (grapheme-splitter, lodash). The whole list, with repros and fixes, is the CJK failure corpus.

`String.prototype.length` counts code units, not characters

A JavaScript string is UTF-16. Characters in the Basic Multilingual Plane are one code unit; everything above it — most emoji, and CJK extensions like 𠮷 (U+20BB7) — is a surrogate pair, two code units standing in for one character.

JavaScript

"a".length;   // 1
"あ".length;  // 1  (BMP)
"😀".length;  // 2  (surrogate pair)
"𠮷".length;  // 2  (supplementary-plane kanji)

So str.slice(0, 8) does not keep the first 8 characters. It keeps the first 8 code units, and if character 8 is a surrogate pair, it takes half of it. slice, substring, substr, and a bare for (let i = 0; i < str.length; i++) all share this — they index by code unit.

`[...str]` and `Array.from` walk by code point

Spread and Array.from iterate the string with its built-in iterator, which yields code points, so surrogate pairs stay whole:

JavaScript

[..."😀"];            // ["😀"]        length 1
[..."𠮷野家"];         // ["𠮷", "野", "家"]  length 3
Array.from("😀").length; // 1

This is the fix for most truncation and length bugs, and it is what closed several of the corpus entries. [...str].slice(0, 8).join("") truncates without cutting a character in half. Reach for this first.

Where code points still aren’t enough: ZWJ emoji and combining marks

Here is the second cliff. A code point is not always a character as a reader means it. Some visible characters are built from several code points glued together:

a family emoji is several people joined by zero-width joiners (U+200D),
a flag is two regional-indicator letters,
é can be e + a combining acute accent,
Indic scripts join consonants into conjunct clusters.

[...str] splits all of those, because it walks code points:

JavaScript

[..."👨‍👩‍👧"];  // ["👨", "‍", "👩", "‍", "👧"]  — five pieces, ZWJ and all
[..."🇯🇵"];      // ["🇯", "🇵"]  — the flag comes apart

That is a real bug class, not a curiosity: Slate put the cursor mid-cluster in Hindi and Bengali, grapheme-splitter tore the rainbow flag apart, and wenmode — the one merged fix among these — mistook combining marks around CJK for punctuation.

`Intl.Segmenter` walks by grapheme cluster

The unit a reader actually perceives as one character is a grapheme cluster (Unicode UAX #29). Intl.Segmenter with granularity: "grapheme" is the standard, built-in way to walk it:

JavaScript

function chars(str) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  return Array.from(seg.segment(str), (s) => s.segment);
}

chars("👨‍👩‍👧");  // ["👨‍👩‍👧"]  — one character
chars("🇯🇵");      // ["🇯🇵"]      — one flag
chars("é");        // ["é"]        — base + combining mark, kept together

Intl.Segmenter ships in Node 16+ and every current browser, and grapheme segmentation is locale-independent, so you can pass undefined for the locale. Use it when the count has to match what a human sees — cursor movement, character limits shown to a user, truncation of display text with mixed emoji.

Three units, and which one you want

The whole bug is picking the wrong unit for the job. There are three:

You walk by	With	Keeps surrogate pairs (😀, 𠮷) whole	Keeps ZWJ emoji / flags / combining marks whole
code unit	`.length`, `.slice`, `.substring`, `str[i]`	no	no
code point	`[...str]`, `Array.from(str)`, `for…of`	yes	no
grapheme cluster	`Intl.Segmenter(locale, { granularity: "grapheme" })`	yes	yes

Most “length is wrong” and “truncation makes mojibake” bugs are the jump from row one to row two, and [...str] is the whole fix. The jump to row three matters when you count or cut display text with emoji and accents in it. Match the unit to the question; do not reach for the heaviest tool by default.

Where it stops

Counting characters correctly is not the same as laying them out correctly. A wide CJK character or an emoji occupies two terminal columns even though it is one grapheme, which is a separate axis — the cli-table3 fix had to be code-point-safe and width-aware, and those are two different problems. And Intl.Segmenter follows the Unicode rules of its runtime; a very old engine without it needs a polyfill. Knowing where a character ends is step one. Whether your layout survives a two-column character is the next question, and the answer, as usual, is to actually render some Japanese and look.

`String.prototype.length` counts code units, not characters

JavaScript

"a".length;   // 1
"あ".length;  // 1  (BMP)
"😀".length;  // 2  (surrogate pair)
"𠮷".length;  // 2  (supplementary-plane kanji)

`[...str]` and `Array.from` walk by code point

Spread and Array.from iterate the string with its built-in iterator, which yields code points, so surrogate pairs stay whole:

JavaScript

[..."😀"];            // ["😀"]        length 1
[..."𠮷野家"];         // ["𠮷", "野", "家"]  length 3
Array.from("😀").length; // 1

Where code points still aren’t enough: ZWJ emoji and combining marks

Here is the second cliff. A code point is not always a character as a reader means it. Some visible characters are built from several code points glued together:

a family emoji is several people joined by zero-width joiners (U+200D),
a flag is two regional-indicator letters,
é can be e + a combining acute accent,
Indic scripts join consonants into conjunct clusters.

[...str] splits all of those, because it walks code points:

JavaScript

[..."👨‍👩‍👧"];  // ["👨", "‍", "👩", "‍", "👧"]  — five pieces, ZWJ and all
[..."🇯🇵"];      // ["🇯", "🇵"]  — the flag comes apart

`Intl.Segmenter` walks by grapheme cluster

The unit a reader actually perceives as one character is a grapheme cluster (Unicode UAX #29). Intl.Segmenter with granularity: "grapheme" is the standard, built-in way to walk it:

JavaScript

function chars(str) {
  const seg = new Intl.Segmenter(undefined, { granularity: "grapheme" });
  return Array.from(seg.segment(str), (s) => s.segment);
}

chars("👨‍👩‍👧");  // ["👨‍👩‍👧"]  — one character
chars("🇯🇵");      // ["🇯🇵"]      — one flag
chars("é");        // ["é"]        — base + combining mark, kept together

Three units, and which one you want

The whole bug is picking the wrong unit for the job. There are three:

You walk by	With	Keeps surrogate pairs (😀, 𠮷) whole	Keeps ZWJ emoji / flags / combining marks whole
code unit	`.length`, `.slice`, `.substring`, `str[i]`	no	no
code point	`[...str]`, `Array.from(str)`, `for…of`	yes	no
grapheme cluster	`Intl.Segmenter(locale, { granularity: "grapheme" })`	yes	yes

# String.prototype.length counts code units, not characters

# [...str] and Array.from walk by code point

# Where code points still aren’t enough: ZWJ emoji and combining marks

# Intl.Segmenter walks by grapheme cluster

# Three units, and which one you want

# Where it stops

# String.prototype.length counts code units, not characters

# [...str] and Array.from walk by code point

# Where code points still aren’t enough: ZWJ emoji and combining marks

# Intl.Segmenter walks by grapheme cluster

# Three units, and which one you want

# Where it stops

`String.prototype.length` counts code units, not characters

`[...str]` and `Array.from` walk by code point

Where code points still aren’t enough: ZWJ emoji and combining marks

`Intl.Segmenter` walks by grapheme cluster

Three units, and which one you want

Where it stops

`String.prototype.length` counts code units, not characters

`[...str]` and `Array.from` walk by code point

Where code points still aren’t enough: ZWJ emoji and combining marks

`Intl.Segmenter` walks by grapheme cluster

Three units, and which one you want

Where it stops