Your word counter thinks a Japanese paragraph is one word
Count words with `text.split(/\s+/)` and a whole Japanese or Chinese paragraph comes back as one word, because CJK is written with no spaces between words. Reading-time estimates read "1 min" and length gates reject valid answers. The fix is to count CJK characters separately, or segment with `Intl.Segmenter` at word granularity.
An onboarding step asks the user to say a little about themselves and won’t advance until the answer has at least two words. A Japanese user types 東京に
This is the omi bug, almost verbatim: len(transcript.split()) >= 2. split() on whitespace returns a single element for text that has no whitespace, so a real Japanese answer never reaches the check behind it and the question stays marked unanswered. Change the language and a working feature quietly stops working.
It is a small class in my corpus — 2 of 97 catalogued CJK bugs — but each one silently breaks a real feature for every CJK user at once. The other is emdash: an editor footer whose word count and reading time came from a space-splitting counter, so a 2,000-character Japanese draft displayed “1 word · 1 min read.” The full list is the CJK failure corpus.
Splitting on whitespace does not split CJK
Word counters almost always start the same way:
const words = text.trim().split(/\s+/).filter(Boolean).length;That works for English because English puts a space between words. Japanese, Chinese, and Korean do not. There is no delimiter to split on, so the whole run is one token:
"I live in Tokyo".split(/\s+/).length; // 4
"東京に住んでいます".split(/\s+/).length; // 1 ← the entire sentenceThe failure is invisible in English review and total in Japanese: every CJK passage, of any length, counts as one word. Reading-time labels collapse to “1 min,” minimum-length validators pass empty-feeling answers and reject full ones, and content gates lock out the exact users they were meant to serve.
Fix one: count CJK characters separately
For a reading-time estimate you do not need real word segmentation. You need a number that tracks how long the text takes to read, and CJK reads at a roughly steady characters-per-minute rate. So count the two systems on their own terms — Latin words by whitespace, CJK by character — and add them:
function estimateMinutes(text) {
const cjk = /[\p{Script=Han}\p{Script=Hiragana}\p{Script=Katakana}\p{Script=Hangul}]/gu;
const cjkChars = (text.match(cjk) ?? []).length;
const latinWords = text.replace(cjk, " ").split(/\s+/).filter(Boolean).length;
return Math.max(1, Math.round(latinWords / 220 + cjkChars / 500));
}This is what runs on the page you’re reading — the reading time in the header comes from exactly this shape, Latin words at ~220 wpm plus CJK characters at ~500 cpm. It is not a “word” count and it does not pretend to be; it is an honest estimate of reading effort, which is what the label actually promises.
Fix two: segment with Intl.Segmenter at word granularity
When you genuinely need words — a real count, a tokenizer, search — Intl.Segmenter at word granularity uses the runtime’s dictionary to split CJK into word-like units:
function countWords(text, locale) {
const seg = new Intl.Segmenter(locale, { granularity: "word" });
let n = 0;
for (const s of seg.segment(text)) if (s.isWordLike) n += 1;
return n;
}
countWords("東京に住んでいます", "ja"); // ~4 word-like units, not 1Keep only the segments where isWordLike is true, so punctuation and spaces drop out. It ships in Node 16+ and current browsers. The cost is that it is locale-aware and dictionary-based, so it is heavier than a character count and its Japanese output is approximate — which is the honest limit below.
Which fix for which job
| Approach | Counts CJK at all | Needs a locale | Cost | Good for |
|---|---|---|---|---|
split(/\s+/) |
no — whole passage is 1 | no | trivial | English-only text, and nothing else |
| CJK-char + Latin-word count | yes (by character) | no | trivial | reading time, length gates, “min read” |
Intl.Segmenter word granularity |
yes (by word) | yes | moderate | real word counts, tokenizing, search |
For reading time and minimum-length gates — the two places these bugs actually bite — the character count is not a compromise, it is the right tool. Reach for Intl.Segmenter only when you need words as words.
Where it stops
“Word” is a fuzzy idea in Japanese. Even Intl.Segmenter disagrees with human intuition and with other tokenizers about where 住んでいます breaks, because there is no single correct answer — segmentation is a model, not a fact. So for a reading-time estimate, counting characters is not just simpler than tokenizing; it sidesteps a question that has no clean answer. Know which one your feature actually needs. A length gate needs “is there enough here,” which characters answer directly; only a true word count needs the segmenter, and even then the number is an estimate wearing a confident face. The related cases — an Enter that fires mid-conversion and lines that break in the wrong place — are the same shape: code that was correct in English and never got read with a Japanese keyboard in mind.