npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

chars

v2.3.0

Published

Split strings into array of characters by “Semmantically-correct” way.

Downloads

23

Readme

chars

Build Status Greenkeeper badge

Split strings into array of characters by “Semmantically-correct” way.

This module is inspired by esrever.

Why not just use string.split('')!?!?!?!?

Well, esrever's README sufficiently explains why, but I can simply answer that the chars of JavaScript is not just the chars of Unicode.

In Unicode, a pair of surrogates constitutes (In other word, “expresses”) one character by itself. In JavaScript, thanks to its UCS-2-like behavior, the pair is treated as two characters.

> '𝟙𝟚𝟛'.split('')
[ '�', '�', '�', '�', '�', '�' ]

And what is worse, if you reverse strings by famous .split('').reverse().join('') trick, you may even get wrong character by joining partial surrogate pairs.

> '𝟙𝟚𝟛'.split('').reverse().join('')
'�𝟚𝟙�' // Huh?

OK, now “chars” is here for you. Instead of using string.split(''), you can gracefully use chars(string) at ease.

> chars('𝟙𝟚𝟛').reverse().join('')
'𝟛𝟚𝟙' // Gotcha!

“chars” recognizes surrogate pairs and other Unicode mechanisms to split strings into array of characters. Be like a boss, y'all!

But sorry!

If you just want something to recognize only surrogate pairs to split strings into array of characters, you already have it as an ES6 feature.

> Array.from('𝟙𝟚𝟛').reverse().join('')
'𝟛𝟚𝟙'

Usage

$ npm i chars
const chars = require('chars');

chars('cafe\u0301'); // -> ['c', 'a', 'f', 'e\u0301']

// All features are on by default.
// You can switch off some specific features by option.
chars('cafe\u0301', {combiningMark: false}); // -> ['c', 'a', 'f', 'e', '\u0301']

// You can also get detailed information about each character.
chars('cafe\u0301', {detailed: true});
/* -> [ { type: [], char: 'c', broken: false },
        { type: [], char: 'a', broken: false },
        { type: [], char: 'f', broken: false },
        { type: [ 'combiningMark' ], char: 'é', broken: false } ] */

Mechanisms

Currently this module recognizes the following mechanisms of Unicode.

Upcoming:

  • Zero-Width Joiner
  • Prepended Concatenation Marks
  • Emoji Sequence
  • Emoji Tag Sequence
  • Adeg Adeg
  • Generic virama such as Devanagari (Hard way...!)
  • Bugenese Ligature (Includes “iya” only)

Surrogate Pairs

Read the section above.

Parsing surrogate pairs is basic functionality of this module and you cannnot turn this feature off by option. If you don't need even this feature, use string.split('').

Combining Marks

Some Unicode characters works as “modifier” and append additional parts to the preceding character. These characters should be semantically interpreted as one combined character.

These characters contain:

Example:

> chars('dépôt')
[ 'd', 'é', 'p', 'ô', 't' ]

You can turn this feature off by {combiningMark: false}.

IDS (Ideographic Description Sequences)

Ideographs used in east asia are very complex and mostly they can be described by composition of another ideographs.

Unicode supports description of these composition by special meta-characters, which is called “IDC (Ideographic Description Characters).” They constitute some character sequences by simple algorithm and express single character whose parts are represented by another character.

Semantically they are one “character” with a bunch of portion. Unicode Standard 8.0.0 describes how these sequences should be interpreted programatically.

Ideographic Description characters are not combining characters, and there is no requirement that they affect character or word boundaries. Thus U+2FF1 U+4E95 U+86D9 may be treated as a sequence of three characters or even three words.

Implementations of the Unicode Standard may choose to parse Ideographic Description Sequences when calculating word and character boundaries. Note that such a decision will make the algorithms involved significantly more complicated and slower.

Then, this module choosed to parse IDS as character.

Example:

> chars('⿱女⿰女女しい')
[ '⿱女⿰女女', 'し', 'い' ]

You can turn this feature off by {ids: false}

Kharoshthi Virama

Kharoshthi (aka Kharosthi) is an ancient script used in ancient India (Wikipedia). In this script, we have to handle a strange modifier called “Kharoshthi Virama.” It behaves like ZWJ when the both side of the Virama is Kharoshthi consonants, and otherwise it modifies preceding character to be a modifier, and makes it to be written in the bottom-left of the character preceding it. It means, this character may affect the preceding character but one!

> chars('𐨫𐨿𐨤𐨑𐨿𐨐𐨿𐨮𐨨𐨿𐨪𐨢𐨁𐨐𐨿')
[ '𐨫𐨿𐨤', '𐨑𐨿𐨐𐨿𐨮', '𐨨𐨿𐨪', '𐨢𐨁𐨐𐨿' ]

Note: If you have problem for reading this script, just install Noto Sans Kharoshthi font.

You can turn this feature off by {kharoshthiVirama: false}

Further readings

Regional Indicator Symbols

Regional Indicator Symbols is a part of the emoji symbol specification, which is to encode a country by the combination of two-letter country codes. Every succeeding pairs of the Regional Indicator Symbols should be considered as the existing country code and therefore be one character.

> chars('FREEDOM🇺🇸')
[ 'F', 'R', 'E', 'E', 'D', 'O', 'M', '🇺🇸' ]