npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

smnormalize

v2.0.1

Published

String normalization utilities for Unicode strings

Downloads

15

Readme

SMNormalize

Version Downloads/week License Build Status devDependency Status

String normalization utilities for Unicode strings and IDs.

In a world where everyone types in Unicode (including emojis!), there are many things to consider when you accept input from users and are planning to use those strings as identifiers, among other things. For example, when dealing with tags, ids, labels, titles… When developers are facing with these situations, there a few common issues:

  • Form: while the characters è and è might look identical, they might in fact be in two separate byte sequences, and need to be normalized or string comparisons will fail (learn more)
  • Diacritics (accents): sometimes you'll want to remove accents and other diacritics from characters, for example turning über into uber, and papà into papa
  • Remove non-letter characters: SMNormalize allows you to remove all characters that are not letters or numbers, in any alphabet used around the world – or just in the latin one
  • Keep emojis: you can optionally keep emojis, because who doesn't love emojis as identifiers? 🙃

Data used by this module is based on Unicode 12.1.0, released in May 2019.

This module is written in TypeScript and transpiled to JavaScript. All typings are available alongside the code.

This code is licensed under the terms of the MIT license (see LICENSE.md).

Full documentation

Full documentation is available on GitHub pages.

Add to your project

Install from NPM:

npm install smnormalize

API Guide

The module exports symbols as named exports.

Normalize(str, options)

const {Normalize} = require('smnormalize')

Normalize(str, options)

The method accepts an input string str and normalizes it with three steps:

  1. Decomposing the Unicode string using the compatibility form (NFKD)
  2. Removing all diacritics/accents
  3. Re-composing the string in NFC (canonical composition) form

In addition to that, you can perform other operations depending on the mode of operation.

The options argument is an object with the following properties:

  • options.mode is the mode of operation, and could be one of the following:
    • 'basic' (this is the default value): in this mode, all diacritics/accents are removed from the string, and the string is nornalized in the NFKC form. Whitespaces, including newlines, tabs, etc, are removed; spaces are converted to the character defined in options.preserveCharacters. All control characters (unprintable characters) are removed too.
    • 'alphabetic' in addition to what basic mode does, all characters that are not letters (in any script/alphabet) are removed, including symbols, spaces, etc.
    • 'latin' similar to the alphabetic mode, but only allows letters that are part of the latin alphabet.
  • options.removeNumbers (boolean, default: false) when false, numbers are always allowed. In alphabetic mode, every kind of number is preserved, while in latin mode only latin numbers are allowed (0-9). This option has no effect in basic mode.
  • options.allowEmoji (boolean, default: false) if true, does not remove emojis from identifiers. Note that the characters 0-9 (latin numbers) are considered valid emojis, and so are preserved regardless of the value of options.removeNumbers. This option has no effect in basic mode.
  • options.convertSpaces (string, default: -) character to replace space characters (codepoints U+0020 and U+00A0) with. To preserve spaces as is, set this to ' ' (a single space character); note that non-breaking spaces (U+00A0) will be converted to normal spaces regardless. You can set it to null or to an empty string to remove spaces entirely. Note that other whitespace characters, such as newlines, tabs, etc, are removed as part of the basic normalization.
  • options.preserveCharacters (string, default: -_.) optional list of individual characters that should not be removed, regardless of modes of operation. By default, this includes the dash -, the underscore _ and the dot .. You can disable this by setting this to an empty string.
  • options.lowercase (boolean, default: false) optionally lowercases the string before returning it.

To show the difference between multiple modes of operation and options, consider this string as example: Hello Шѻrld_!1߁🤗

| | "basic" mode | "alphabetic" mode | "latin" mode | |-------------------------------------------|------------------|-------------------|--------------| | removeNumbers = false, keepEmojis = false | Hello-Шѻrld_!1߁🤗 | Hello-Шѻrld_1߁ | Hello-rld_1 | | removeNumbers = true, keepEmojis = false | Hello-Шѻrld_!1߁🤗 | Hello-Шѻrld_ | Hello-rld_ | | removeNumbers = false, keepEmojis = true | Hello-Шѻrld_!1߁🤗 | Hello-Шѻrld_1߁🤗 | Hello-rld_1🤗 | | removeNumbers = true, keepEmojis = true | Hello-Шѻrld_!1߁🤗 | Hello-Шѻrld_1🤗 | Hello-rld_1🤗 |

Note that in basic mode the removeNumbers and keepEmojis options have no effect, because no characters (aside from whitespaces and control characters) are removed. In alphabetic and latin mode, latin numbers are always present when emojis are allowed (but not numbers in other scripts); also, note that the exclamation mark was removed, but the underscore was kept because it's in the default preserveCharacters list.