npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@datagica/fast-index

v0.1.0

Published

Fast Index

Downloads

387

Readme

Datagica Fast-Index

A library to lookup if a word is inside an index, even if the spelling is a bit different.

Usage

Installation

$ npm install @datagica/fast-index --save

Building the index

import "FastIndex" from "@datagica/fast-index";

const index = new FastIndex({

  // fields to be indexed
  fields: [
    'label',
    'aliases'
  ],

  // a custom spelling generation function
  spellings: (map, word) => {
    // replace "le " or "el " by "the " with an arbitrary similarity score
   // of 0.5 (you can choose any value between 0 and 1)
    map.set(word.replace(/(?:le|el) /gi, 'the '), 0.5)
  }
})

// now we load some dataset
index.loadSync([
  { label: 'the chef', type: 'movie' },

  // duplicate entries are supported and will be returned in the results
  { label: 'the chef', type: 'book' },

  // duplicates inside an entry are simply skipped
  { label: 'el chef', aliases: [ 'el chef' ] }
]);


// side-note: here is the internal representation of the data inside the index:
[ [ 'the chef',
    [ { value: { label: 'the chef', type: 'movie' }, score: 1 },
      { value: { label: 'the chef', type: 'book' }, score: 1 },
      { value: { label: 'el chef', type: 'unknow', aliases: [ 'el chef' ] },
        score: 0.5 } ] ],
  [ 'el chef',
    [ { value: { label: 'el chef', type: 'unknow', aliases: [ 'el chef' ] },
        score: 1 } ] ] ]

Querying the index

const matches = index.get("le chef");

// this will output
[
  { value: { label: 'the chef', type: 'movie' },
    score: 0.5 },

  { value: { label: 'the chef', type: 'book' },
    score: 0.5 },

  // note how "el chef" has a lower score, although it would be closer using
  // a distance function. That's because we choose a naive spelling function
  // that converts everything into a single locale (english).
  // a better function would be more fine-tuned and store each locale
  // individually
  { value: { label: 'el chef',  type: 'unknow', aliases: [ 'el chef' ] },
    score: 0.25 }
]

History

Problem

The original algorithm used for fuzzy matching entities in all Datagica projects (@datagica/fuzzy-index) was based on a lookup inside a tree of possible spellings, using a Finite State Levenshtein Transducer. This was nice because it allowed us to match a university name (for instance) even if it was spelled a bit differently eg (universidad instead of university).

However for huge datasets this proved quite slow, incompatible with real-time lookup, as it searched more alternative spellings than necessary.

Solution

The new Fast Index simply converts its inputs into a simple representation of the word, where punctuation, accents, useless spaces etc.. have been removed.

These transformations are quite opinionated, but some can be opted-out.

In addition to these default transforms, Fast-Index also gives you a way to define custom alternative spellings.

For instance, you can tell Fast-Index to automatically convert:

  • de -> of
  • ité -> ity

Now, if the input word is université de Poudlard it will match university of Poudlard!

When using the spelling generator, you decide yourself which distance score should be allocated. This helps the Fast-Index picks the best match it find later.