npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

futzy

v0.1.1

Published

A fuzzy string matching library

Downloads

1

Readme

futzy

Futzy is a configurable fuzzy matching library for tools which search over structured input. Unlike many other fuzzy matching libraries which match arbitrary characters, Futzy matches based on tokens. Tokenization unlocks meaningful performance improvements (especially for very large datasets), and provides more accurate search results (at the cost of a small number of keypresses).

Futzy breaks strings in the provided dataset into tokens. Tokens are case-insensitive and must be alphanumeric. Non-alphanumeric characters are not indexed. Results are sorted based on relevance, where "relevance" prioritizes strings which have matching tokens earlier in the string, and strings where matching tokens have the fewest possible tokens between them.

Performance is achieved in a few ways:

  • Minimizing allocations (search operations avoid allocating new objects or arrays), which avoids garbage collection
  • Using efficient data structures to reduce time complexity of insertions and lookups
  • Trading memory use for performance: Futzy uses extra memory in its index to reduce the time complexity of searches
  • Short-circuiting aggressively
  • Avoiding full scans of the provided dataset

Usage

const {Index} = require('futzy');

const testCorpus = [
  "abc.def",
  "xxxx.yyyy",
  "xxx.abc.yyy.zzz.def",
  "xxx.abc.yyy.zzz",
  "xxx.abc.yyy",
  "xxx.abc.defg",
];
const index = new Index(testCorpus, {});
const results = index.search('a d');
/*
results ==
[
  "abc.def",
  "xxx.abc.defg",
  "xxx.abc.yyy.zzz.def",
]
*/

The Index class takes two parameters:

  • An Array<string>, which is the dataset
  • Optionally, an object with options.

The options object may have these members:

  • performRawSearch (Default false): If there are fewer results than resultLimit, strings in the dataset that contain the search query as a substring are appended to the results, up to resultLimit.
  • performRawSearchWhenNoResults (Default true): Only performs the performRawSearch behavior if there would otherwise be no results. This is useful as a backup to support weird queries.
  • resultLimit (Default 20): The maximum number of strings to return in the results. Keeping this value small improves performance for large datasets.

Matching

Input is first tokenized. E.g., fo ba za becomes ['fo', 'ba', 'za']. Each string is tested. To be a plausible result, the string must contain each of the provided tokens, in order. For example, the following input strings would match:

  • foo.bar.zap
  • fomo is bad but zalgo does not care
  • FOO BAR ZAP
  • fo ba za

The following input strings would not match:

  • football zap
  • foo zap bar
  • f oo bar zap
  • foobarzap

Usefulness

This algorithm is useful for the following types of UIs:

  1. Database table search
  2. Metric name search
  3. Long lists of URLs
  4. File lookup (e.g., the output of git ls-files)
  5. Software namespaces or software package names

Any dataset which has a reasonable cardinality of tokens (relative to the number of strings), where each string is cleanly divided into tokens, is likely a good fit for Futzy.

Known limitations

  • Tokens cannot contain non-alphanumeric characters (e.g., accented characters). This is likely a straightforward fix in tokenizer.js, and contributions are welcome.
  • Constants used for scoring relevance are non-configurable.
  • Indexes cannot be updated

Additionally, the following performance issues are known, though they are generally only problematic with large datasets:

  • Datasets with many strings that contain one-character or two-character tokens may exhibit poor search performance for short queries. You may wish to require a minimum length (e.g., 3-4 characters) when querying.
  • Datasets with very large numbers of unique tokens (e.g., a set of many distinct UUIDs) may have very high memory use