@scriptin/is-han
v1.0.1
Published
Unicode-aware Han characters (hanzi, kanji, hanja) detection
Downloads
164
Maintainers
Readme
is-han
Unicode-aware Han characters (hanzi, kanji, hanja) detection
npm i @scriptin/is-han
Usage
Note You need to use Unicode-aware methods/operators in JavaScript -
Array.from(str)
andfor/of
loops - in order to process all Han characters. Some of them have code points which don't fit into 16 bits, and JavaScript uses UTF-16.
Examples of correct usage:
import { isHan } from "@scriptin/is-han";
for (const char of "漢字") {
console.log(isHan(char));
}
// or
Array.from("漢字").filter(isHan)
Incorrect usage:
'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']
// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']
API
isHan(char: string): boolean
- Checks if a character is a Han script character: hanzi, kanji, hanjaisHanExt(char: string): boolean
- Checks if a character is an "extended" Han script character. Useful when you're looking for obscure characters which contain Han script, e.g. symbols like 🈲, 🈯, 🈳, 🉐, 🉑, ㊄, ㋋, ㏾, ㍰, etc. "Extended" means all Unicode characters which:- contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
- contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
- contain parts of Han characters, such as CJK strokes
- 々 IDEOGRAPHIC ITERATION MARK (see below)
- 〆 IDEOGRAPHIC CLOSING MARK (see below)
isIterationMark(char: string): boolean
- Checks if character is 々 IDEOGRAPHIC ITERATION MARK. This mark means "repeat previous character". Can be useful if you want to replace this mark with the character it repeats/represents. See Wiktionary article about 々isClosingMark(char: string): boolean
- Checks if character is 〆 IDEOGRAPHIC CLOSING MARK. This mark is used in place of another Han character. See Wiktionary article about 〆Some constants are also exported in case you need to extend the functionality.
FAQ
❓ Why do I have to use Array.from(str)
and for/of
?
Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent
additions into Unicode don't fit into 16 bit. In such cases, characters are represented
with surrogates.
Array.from()
and for/of
were added in more recent versions of ECMAScript and are Unicode-aware.
This library cannot change this JavaScript feature, so you have to use these two methods,
and avoid using Array.split()
, String.codePointAt()
, String.charCodeAt()
, etc.
❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?
No. Because of the Han unification most of CJK characters are represented with shared code points. Each code point can be associated with multiple versions/variants of the same character, including regional, stylistic, and other variations. In order to determine a language, you need to know some context. For example, language can be set as an attribute of a web page or a PDF document, or as a setting in an operating system.
This library doesn't provide methods to distinguish between languages.
❓ Can I distinguish between Traditional and Simplified Chinese characters?
In some cases, yes. In others, traditional and simplified variants share the same code points. See this article. For a sufficiently big text, you can determine if it's traditional or simplified by looking for specific code points.
This library doesn't provide methods to distinguish between traditional and simplified scripts.