npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@scriptin/is-han

v1.0.1

Published

Unicode-aware Han characters (hanzi, kanji, hanja) detection

Downloads

164

Readme

is-han

Unicode-aware Han characters (hanzi, kanji, hanja) detection

npm i @scriptin/is-han

Usage

Note You need to use Unicode-aware methods/operators in JavaScript - Array.from(str) and for/of loops - in order to process all Han characters. Some of them have code points which don't fit into 16 bits, and JavaScript uses UTF-16.

Examples of correct usage:

import { isHan } from "@scriptin/is-han";

for (const char of "漢字") {
  console.log(isHan(char));
}

// or

Array.from("漢字").filter(isHan)

Incorrect usage:

'𠀋'.split('').filter(isHan); // -> empty array
// because code point of '𠀋' is '2000B' which is more than 16 bit long,
// so it is split into a surrogate pair
console.log('𠀋'.split('')); // -> ['\uD840', '\uDC0B']

// Compare to:
console.log(Array.from('𠀋')); // -> ['𠀋']

API

  • isHan(char: string): boolean - Checks if a character is a Han script character: hanzi, kanji, hanja

  • isHanExt(char: string): boolean - Checks if a character is an "extended" Han script character. Useful when you're looking for obscure characters which contain Han script, e.g. symbols like 🈲, 🈯, 🈳, 🉐, 🉑, ㊄, ㋋, ㏾, ㍰, etc. "Extended" means all Unicode characters which:

    • contain Han characters with additional wrappers, such as characters inside brackets, circles, etc.
    • contain multiple "compacted" Han characters, such as Japanese "square era names", etc.
    • contain parts of Han characters, such as CJK strokes
    • 々 IDEOGRAPHIC ITERATION MARK (see below)
    • 〆 IDEOGRAPHIC CLOSING MARK (see below)
  • isIterationMark(char: string): boolean - Checks if character is 々 IDEOGRAPHIC ITERATION MARK. This mark means "repeat previous character". Can be useful if you want to replace this mark with the character it repeats/represents. See Wiktionary article about 々

  • isClosingMark(char: string): boolean - Checks if character is 〆 IDEOGRAPHIC CLOSING MARK. This mark is used in place of another Han character. See Wiktionary article about 〆

  • Some constants are also exported in case you need to extend the functionality.

FAQ

❓ Why do I have to use Array.from(str) and for/of?

Because JavaScript (and TypeScript) use UTF-16 for strings, and some of more recent additions into Unicode don't fit into 16 bit. In such cases, characters are represented with surrogates. Array.from() and for/of were added in more recent versions of ECMAScript and are Unicode-aware.

This library cannot change this JavaScript feature, so you have to use these two methods, and avoid using Array.split(), String.codePointAt(), String.charCodeAt(), etc.

❓ Can I detect language (Chinese/Japanese/Korean) for a given Han character?

No. Because of the Han unification most of CJK characters are represented with shared code points. Each code point can be associated with multiple versions/variants of the same character, including regional, stylistic, and other variations. In order to determine a language, you need to know some context. For example, language can be set as an attribute of a web page or a PDF document, or as a setting in an operating system.

This library doesn't provide methods to distinguish between languages.

❓ Can I distinguish between Traditional and Simplified Chinese characters?

In some cases, yes. In others, traditional and simplified variants share the same code points. See this article. For a sufficiently big text, you can determine if it's traditional or simplified by looking for specific code points.

This library doesn't provide methods to distinguish between traditional and simplified scripts.