Why does splitting on spaces count a Japanese sentence as one word?

Japanese, Chinese, and Korean text is written with no spaces between words, so `text.split(/\s+/)` returns the whole sentence as a single token. The count is 1 no matter how long the passage is.

How do I count words in Japanese or Chinese in JavaScript?

For a reading-time estimate, count CJK characters separately from space-delimited words — for example Latin words at ~220 per minute plus CJK characters at ~500 per minute. For actual word segmentation, use `Intl.Segmenter` at word granularity and keep the segments where `isWordLike` is true.

Does Intl.Segmenter work for Japanese word counting?

Yes, `Intl.Segmenter` at word granularity with the "ja" locale segments Japanese into word-like units using the runtime's dictionary. It is approximate — Japanese has no single agreed word boundary — but far better than splitting on spaces, which does not split CJK at all.

Your word counter thinks a Japanese paragraph is one word

An onboarding step asks the user to say a little about themselves and won’t advance until the answer has at least two words. A Japanese user types 東京に住んでいます — a complete sentence, “I live in Tokyo” — and the step rejects it. It counted one word. The user did nothing wrong; the gate did.

This is the omi bug, almost verbatim: len(transcript.split()) >= 2. split() on whitespace returns a single element for text that has no whitespace, so a real Japanese answer never reaches the check behind it and the question stays marked unanswered. Change the language and a working feature quietly stops working.

It is a small class in my corpus — 2 of 97 catalogued CJK bugs — but each one silently breaks a real feature for every CJK user at once. The other is emdash: an editor footer whose word count and reading time came from a space-splitting counter, so a 2,000-character Japanese draft displayed “1 word · 1 min read.” The full list is the CJK failure corpus.

Splitting on whitespace does not split CJK

Word counters almost always start the same way:

JavaScript

const words = text.trim().split(/\s+/).filter(Boolean).length;

That works for English because English puts a space between words. Japanese, Chinese, and Korean do not. There is no delimiter to split on, so the whole run is one token:

JavaScript

"I live in Tokyo".split(/\s+/).length;   // 4
"東京に住んでいます".split(/\s+/).length;  // 1  ← the entire sentence

The failure is invisible in English review and total in Japanese: every CJK passage, of any length, counts as one word. Reading-time labels collapse to “1 min,” minimum-length validators pass empty-feeling answers and reject full ones, and content gates lock out the exact users they were meant to serve.

Fix one: count CJK characters separately

For a reading-time estimate you do not need real word segmentation. You need a number that tracks how long the text takes to read, and CJK reads at a roughly steady characters-per-minute rate. So count the two systems on their own terms — Latin words by whitespace, CJK by character — and add them:

JavaScript

function estimateMinutes(text) {
  const cjk = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/gu;
  const cjkChars = (text.match(cjk) ?? []).length;
  const latinWords = text.replace(cjk, " ").split(/\s+/).filter(Boolean).length;
  return Math.max(1, Math.round(latinWords / 220 + cjkChars / 500));
}

This is what runs on the page you’re reading — the reading time in the header comes from exactly this shape, Latin words at ~220 wpm plus CJK characters at ~500 cpm. It is not a “word” count and it does not pretend to be; it is an honest estimate of reading effort, which is what the label actually promises.

Fix two: segment with `Intl.Segmenter` at word granularity

When you genuinely need words — a real count, a tokenizer, search — Intl.Segmenter at word granularity uses the runtime’s dictionary to split CJK into word-like units:

JavaScript

function countWords(text, locale) {
  const seg = new Intl.Segmenter(locale, { granularity: "word" });
  let n = 0;
  for (const s of seg.segment(text)) if (s.isWordLike) n += 1;
  return n;
}

countWords("東京に住んでいます", "ja"); // ~4 word-like units, not 1

Keep only the segments where isWordLike is true, so punctuation and spaces drop out. It ships in Node 16+ and current browsers. The cost is that it is locale-aware and dictionary-based, so it is heavier than a character count and its Japanese output is approximate — which is the honest limit below.

Which fix for which job

Approach	Counts CJK at all	Needs a locale	Cost	Good for
`split(/\s+/)`	no — whole passage is 1	no	trivial	English-only text, and nothing else
CJK-char + Latin-word count	yes (by character)	no	trivial	reading time, length gates, “min read”
`Intl.Segmenter` word granularity	yes (by word)	yes	moderate	real word counts, tokenizing, search

For reading time and minimum-length gates — the two places these bugs actually bite — the character count is not a compromise, it is the right tool. Reach for Intl.Segmenter only when you need words as words.

Where it stops

“Word” is a fuzzy idea in Japanese. Even Intl.Segmenter disagrees with human intuition and with other tokenizers about where 住んでいます breaks, because there is no single correct answer — segmentation is a model, not a fact. So for a reading-time estimate, counting characters is not just simpler than tokenizing; it sidesteps a question that has no clean answer. Know which one your feature actually needs. A length gate needs “is there enough here,” which characters answer directly; only a true word count needs the segmenter, and even then the number is an estimate wearing a confident face. The related cases — an Enter that fires mid-conversion and lines that break in the wrong place — are the same shape: code that was correct in English and never got read with a Japanese keyboard in mind.

Splitting on whitespace does not split CJK

Word counters almost always start the same way:

JavaScript

const words = text.trim().split(/\s+/).filter(Boolean).length;

That works for English because English puts a space between words. Japanese, Chinese, and Korean do not. There is no delimiter to split on, so the whole run is one token:

JavaScript

"I live in Tokyo".split(/\s+/).length;   // 4
"東京に住んでいます".split(/\s+/).length;  // 1  ← the entire sentence

Fix one: count CJK characters separately

JavaScript

function estimateMinutes(text) {
  const cjk = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/gu;
  const cjkChars = (text.match(cjk) ?? []).length;
  const latinWords = text.replace(cjk, " ").split(/\s+/).filter(Boolean).length;
  return Math.max(1, Math.round(latinWords / 220 + cjkChars / 500));
}

Fix two: segment with `Intl.Segmenter` at word granularity

When you genuinely need words — a real count, a tokenizer, search — Intl.Segmenter at word granularity uses the runtime’s dictionary to split CJK into word-like units:

JavaScript

function countWords(text, locale) {
  const seg = new Intl.Segmenter(locale, { granularity: "word" });
  let n = 0;
  for (const s of seg.segment(text)) if (s.isWordLike) n += 1;
  return n;
}

countWords("東京に住んでいます", "ja"); // ~4 word-like units, not 1

Which fix for which job

Approach	Counts CJK at all	Needs a locale	Cost	Good for
`split(/\s+/)`	no — whole passage is 1	no	trivial	English-only text, and nothing else
CJK-char + Latin-word count	yes (by character)	no	trivial	reading time, length gates, “min read”
`Intl.Segmenter` word granularity	yes (by word)	yes	moderate	real word counts, tokenizing, search

# Splitting on whitespace does not split CJK

# Fix one: count CJK characters separately

# Fix two: segment with Intl.Segmenter at word granularity

# Which fix for which job

# Where it stops

# Splitting on whitespace does not split CJK

# Fix one: count CJK characters separately

# Fix two: segment with Intl.Segmenter at word granularity

# Which fix for which job

# Where it stops

Splitting on whitespace does not split CJK

Fix one: count CJK characters separately

Fix two: segment with `Intl.Segmenter` at word granularity

Which fix for which job

Where it stops

Splitting on whitespace does not split CJK

Fix one: count CJK characters separately

Fix two: segment with `Intl.Segmenter` at word granularity

Which fix for which job

Where it stops