npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

paxx

v1.1.10

Published

inverted index query engine (full text search)

Downloads

40

Readme

simple inverted index search engine

good for short corpuses, such as city names, food ingredients etc, and especially powerfull if you have some external popularity to be used in combination for the scoring, wouldnt recommend to be used for searching long corpuses (such as news papers), where the term frequency is of much importance.

example usage:

const {
  Index,
  AND,
  OR,
  TERM,
  CONSTANT,
  DISMAX,
  analyzers
} = require("paxx");

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keyword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

// iterate over all the matches
ix.forEach(
  new OR(
    ...ix.terms("name", "creme"),
    new AND(
      // matches on k9 because it splits k and 9
      ...ix.terms("name", "9k hell"),
    ),
    new AND(...ix.terms("name", "ja"), new CONSTANT(1, new OR(...ix.terms("type", "user")))),
    new DISMAX(...ix.terms("name", "doe"), ...ix.terms("type", "user"))
  ),
  // callback called with the document and its IDF score
  function(doc, score) {
    console.log({ doc, score });
  }
);

outputs:

{ doc: { name: 'john Crème Brulée', type: 'user' },
  score: 2.6931471805599454 }
{ doc: { name: 'hello world k777bb k9 bzz', type: 'user' },
  score: 7.673976433571672 }
{ doc: { name: 'doe' }, score: 2.6931471805599454 }

queries

NB: the queries is statefull, and can not be reused, you must create one query per request

  • TERM

new TERM(numberOfDocumentsInIndex, postingsList)
e.g.

let t = new Term(5, [1,2,34])

the term is the most primitive query, it uses binary search to advance its position. (same as lucens's TermQuery)

its score is (unnormalized) IDF as follows:

this.idf = 1 + Math.log(nDocumentsInIndex / (postingsList.length + 1));

tf is not used or stored.

  • AND
new AND(queryA, queryB, queryC)

returns (queryA AND queryB AND queryC) boolean query (similar to lucene's Bool MUST), its score is the sum of the scores of the matching subqueries (unnormalized)

  • OR
new OR(queryA, queryB, queryC)

returns (queryA OR queryB OR queryC) boolean query (similar to lucene's Bool SHOULD), its score is the sum of the scores of the matching subqueries (unnormalized)

  • DISMAX
new DISMAX(tiebreaker, queryA, queryB, queryC)

e.g.
let q = new DISMAX(0.1, queryA, queryB, queryC)

returns (queryA DISMAX queryB DISMAX queryC) boolean OR query (similar to lucene's DisMax), its score is the max of the matching subqueries plus the tiebreaker multiplier by the rest of the scores.

  • CONSTANT
new CONSTANT(boost, query)

e.g.
let q = new CONSTANT(0.1, query)

returns a constant score query, that will score with whatever boost you give it, in the example used new CONSTANT(0.1...) it will score with 0.1

TOP N

let limit = 2 // -1 for all matches, sorted by idf score
let matches = ix.topN(
  new DISMAX(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  limit
)

outputs: 
[ { name: 'hello world k777bb k9 bzz', type: 'user' } ]

more examples

ix.forEach(
  new OR(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

ix.forEach(
  new DISMAX(
    // tiebreaker
    0.5,
    // variable argument list of queries
    ...ix.terms("name", "hello"),
    new CONSTANT(1000, new OR(...ix.terms("name", "world")))
  ),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

Index

create inverted index (which is a handy way to store the postings lists and create term queries)

to create an index you need to pass per-field analyzer, e.g. for the 'name' field you could use autocomplete analyzer, but for the 'type' field you could use ID analyzer (noop)

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

create an array term queries out of a field: ix.terms("field", "token") e.g. ix.terms("name","john"), you can wrap those queries in AND/OR/DISMAX etc

analyzers

analyzer is a group of tokenizers and normalizers

  • Autocomplete (analyzers.autocomplete)
tokenize at index: whitespace, edge
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits
  • Basic (analyzers.basic)
tokenize at index: whitespace
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits
  • Keyword (analyzers.keyword)
tokenize at index: noop
tokenize at search: noop
normalize: noop
  • SOUNDEX (analyzers.soundex)
tokenize at index: whitespace, soundex
tokenize at search: whitespace, soundex
normalize: lowercase, unaccent, spaceBetweenDigits

tokenizers

a tokenizer takes a string and produces tokens from that string, at the moment those are available:

  • whitespace: 'a b c' -> ['a','b','c']
  • noop: 'a b c' -> ['a b c']
  • soundex: 'halo hello' -> ['H400', 'H400']
  • edge: 'hello' -> ['h','he','hel','hell','hello']

any object that has apply([string]) -> [string] function can be used as tokenizer

normalizers

normalizes apply transformation to the string, used both at search and index time

  • lowercase: 'ABC' -> 'abc'
  • unaccent: 'Crème' - 'Creme'
  • removeNonAlphanumeric: 'a/b/c' -> 'a b c'
  • space between digits: k9 -> 'k 9'

any object that has apply(string) -> string function can be used as normalizer