smnormalize
v2.0.1
Published
String normalization utilities for Unicode strings
Downloads
15
Maintainers
Readme
SMNormalize
String normalization utilities for Unicode strings and IDs.
In a world where everyone types in Unicode (including emojis!), there are many things to consider when you accept input from users and are planning to use those strings as identifiers, among other things. For example, when dealing with tags, ids, labels, titles… When developers are facing with these situations, there a few common issues:
- Form: while the characters
è
andè
might look identical, they might in fact be in two separate byte sequences, and need to be normalized or string comparisons will fail (learn more) - Diacritics (accents): sometimes you'll want to remove accents and other diacritics from characters, for example turning
über
intouber
, andpapà
intopapa
- Remove non-letter characters: SMNormalize allows you to remove all characters that are not letters or numbers, in any alphabet used around the world – or just in the latin one
- Keep emojis: you can optionally keep emojis, because who doesn't love emojis as identifiers? 🙃
Data used by this module is based on Unicode 12.1.0, released in May 2019.
This module is written in TypeScript and transpiled to JavaScript. All typings are available alongside the code.
This code is licensed under the terms of the MIT license (see LICENSE.md).
Full documentation
Full documentation is available on GitHub pages.
Add to your project
Install from NPM:
npm install smnormalize
API Guide
The module exports symbols as named exports.
Normalize(str, options)
const {Normalize} = require('smnormalize')
Normalize(str, options)
The method accepts an input string str
and normalizes it with three steps:
- Decomposing the Unicode string using the compatibility form (NFKD)
- Removing all diacritics/accents
- Re-composing the string in NFC (canonical composition) form
In addition to that, you can perform other operations depending on the mode of operation.
The options
argument is an object with the following properties:
options.mode
is the mode of operation, and could be one of the following:'basic'
(this is the default value): in this mode, all diacritics/accents are removed from the string, and the string is nornalized in the NFKC form. Whitespaces, including newlines, tabs, etc, are removed; spaces are converted to the character defined inoptions.preserveCharacters
. All control characters (unprintable characters) are removed too.'alphabetic'
in addition to what basic mode does, all characters that are not letters (in any script/alphabet) are removed, including symbols, spaces, etc.'latin'
similar to the alphabetic mode, but only allows letters that are part of the latin alphabet.
options.removeNumbers
(boolean, default:false
) when false, numbers are always allowed. In alphabetic mode, every kind of number is preserved, while in latin mode only latin numbers are allowed (0-9). This option has no effect in basic mode.options.allowEmoji
(boolean, default:false
) if true, does not remove emojis from identifiers. Note that the characters0-9
(latin numbers) are considered valid emojis, and so are preserved regardless of the value ofoptions.removeNumbers
. This option has no effect in basic mode.options.convertSpaces
(string, default:-
) character to replace space characters (codepoints U+0020 and U+00A0) with. To preserve spaces as is, set this to' '
(a single space character); note that non-breaking spaces (U+00A0) will be converted to normal spaces regardless. You can set it tonull
or to an empty string to remove spaces entirely. Note that other whitespace characters, such as newlines, tabs, etc, are removed as part of the basic normalization.options.preserveCharacters
(string, default:-_.
) optional list of individual characters that should not be removed, regardless of modes of operation. By default, this includes the dash-
, the underscore_
and the dot.
. You can disable this by setting this to an empty string.options.lowercase
(boolean, default:false
) optionally lowercases the string before returning it.
To show the difference between multiple modes of operation and options, consider this string as example: Hello Шѻrld_!1߁🤗
| | "basic" mode | "alphabetic" mode | "latin" mode |
|-------------------------------------------|------------------|-------------------|--------------|
| removeNumbers = false, keepEmojis = false | Hello-Шѻrld_!1߁🤗
| Hello-Шѻrld_1߁
| Hello-rld_1
|
| removeNumbers = true, keepEmojis = false | Hello-Шѻrld_!1߁🤗
| Hello-Шѻrld_
| Hello-rld_
|
| removeNumbers = false, keepEmojis = true | Hello-Шѻrld_!1߁🤗
| Hello-Шѻrld_1߁🤗
| Hello-rld_1🤗
|
| removeNumbers = true, keepEmojis = true | Hello-Шѻrld_!1߁🤗
| Hello-Шѻrld_1🤗
| Hello-rld_1🤗
|
Note that in basic mode the removeNumbers
and keepEmojis
options have no effect, because no characters (aside from whitespaces and control characters) are removed. In alphabetic and latin mode, latin numbers are always present when emojis are allowed (but not numbers in other scripts); also, note that the exclamation mark was removed, but the underscore was kept because it's in the default preserveCharacters
list.