npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

yads

v1.0.6

Published

Yet-another-diacritic-stripper that also properly removes combining characters. Performance should be close to optimal.

Downloads

21

Readme

Yet Another Diacritic Remover

But better, I think, read on...

Coveralls Status Build Status

The Problem

This is a non-broken diacritic remover. There are number of similar modules available but they're all based on the same code and are all deficient in some cases, which is why I wrote this one. This implementation is also much faster than most of the diacritic removers I've tested and ever-so-slightly faster than the fastest, which may be important to you, if you're grinding through a lot of strings. This is based on the same code as all the rest, original is here: http://web.archive.org/web/20121231230126/http://lehelk.com:80/2011/05/06/script-to-remove-diacritics/

As far as I can tell, the version referenced above and all of the derived examples I have found, suffer from not being able to strip all diacritics. Specifically, they fail when encountering words like "Rügen" it fails to remove the diaresis forming the umlaut for the letter "u". This because of the nature of the algorithm which prevents it from detecting and removing combining characters, in this case the Combining Diacritical Marks (0300–036F) unicode block. The letter in question in our example is actually a completely normal letter "u" which therefore does not get stripped (it doesn't need to be) followed by a combining diaresis which modifies the preceeding letter. The diaresis does not register as a letter and will be ignore by the standard algorithm.

To illustrate with our example word above:

// Word: Rügen
// Raw hex, before processing by V8: 0x52 0x75 0xCC 0x88 0x67 0x65 0x6E

Letter: R, code point: 0x52,   char code: 0x52
Letter: u, code point: 0x75,   char code: 0x75
Letter: ̈, code point:  0x0308, char code: 0x0308
Letter: g, code point: 0x67,   char code: 0x67
Letter: e, code point: 0x65,   char code: 0x65
Letter: n, code point: 0x6e,   char code: 0x6e

Notice how the the five letters in the word are unaccented. Also, notice how the diaresis appears on top of the comma. That's how a combining character works. This is why the normal strip routine fails, all it sees are five normal letters and one "letter" that isn't recongized as such.

I have modified the code below to correctly strip out combining characters. The combining characters are

  • Combining Diacritical Marks (0300–036F), since version 1.0, with modifications in subsequent versions down to 4.1
  • Combining Diacritical Marks Extended (1AB0–1AFF), version 7.0
  • Combining Diacritical Marks Supplement (1DC0–1DFF), versions 4.1 to 5.2
  • Combining Diacritical Marks for Symbols (20D0–20FF), since version 1.0, with modifications in subsequent versions down to 5.1
  • Combining Half Marks (FE20–FE2F), versions 1.0, with modifications in subsequent versions down to 8.0

Other Options

You can perform unicode normalization and avoid doing the combining character strip but that would actually involve more work. You can also check your string to see if it comprises only precomposed characters and can therefore benefit from using a simplified strip function. Of course, that test has a computational cost that probably makes that process not worthwhile. Note that you'll find several modules on npm that claim to perform normalizing of unicode strings. They are using the term in a misleading manner. All the ones I've seen do the same as this module: strip diacritics. A proper normalization function would take the combining character and the character it modifies and replace them with the precomposed version.

Usage

const
     strip = require( 'yads' ),
     testStr = `Rügen caractères spéciaux contrairement à la langue française`;

console.log( `${testStr} =>\n${strip.precomposed( testStr )}` );
// Rügen caractères spéciaux contrairement à la langue française =>
// Rügen caracteres speciaux contrairement a la langue francaise

console.log( `${testStr} =>\n${strip.combining( testStr )}` );
// Rügen caractères spéciaux contrairement à la langue française =>
// Rugen caracteres speciaux contrairement a la langue francaise

Note that precomposed() and combining() switches default functions:

console.log( `${testStr} =>\n${strip.remove_diacritics( testStr )}` );
// Rügen caractères spéciaux contrairement à la langue française =>
// Rugen caracteres speciaux contrairement a la langue francaise

strip.precomposed();

console.log( `${testStr} =>\n${strip.remove_diacritics( testStr )}` );
// Rügen caractères spéciaux contrairement à la langue française =>
// Rügen caracteres speciaux contrairement a la langue francaise

In addition, some utility functions are included that might be useful for various searches, especially typeaheads.

const simpleStr = `42 caractères spéciaux!`;

console.log( `keep letters only: ${strip.letters_only( simpleStr )}` );
// caracteres speciaux

console.log( `keep letters and numbers only: ${strip.alphanum_only( simpleStr )}` );
// 42 caracteres speciaux

console.log( `packed letters only: ${strip.packed( simpleStr )}` );
// caracteresspeciaux

console.log( `packed letters and numbers only: ${strip.packed_alphanum( simpleStr )}` );
// 42caracteresspeciaux