paxx

v1.1.10

Published

2 years ago

inverted index query engine (full text search)

Downloads

0High
0Medium
0Low

jackdoe

full text search fts inverted index query

simple inverted index search engine

good for short corpuses, such as city names, food ingredients etc, and especially powerfull if you have some external popularity to be used in combination for the scoring, wouldnt recommend to be used for searching long corpuses (such as news papers), where the term frequency is of much importance.

example usage:

const {
  Index,
  AND,
  OR,
  TERM,
  CONSTANT,
  DISMAX,
  analyzers
} = require("paxx");

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keyword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

// iterate over all the matches
ix.forEach(
  new OR(
    ...ix.terms("name", "creme"),
    new AND(
      // matches on k9 because it splits k and 9
      ...ix.terms("name", "9k hell"),
    ),
    new AND(...ix.terms("name", "ja"), new CONSTANT(1, new OR(...ix.terms("type", "user")))),
    new DISMAX(...ix.terms("name", "doe"), ...ix.terms("type", "user"))
  ),
  // callback called with the document and its IDF score
  function(doc, score) {
    console.log({ doc, score });
  }
);

outputs:

{ doc: { name: 'john Crème Brulée', type: 'user' },
  score: 2.6931471805599454 }
{ doc: { name: 'hello world k777bb k9 bzz', type: 'user' },
  score: 7.673976433571672 }
{ doc: { name: 'doe' }, score: 2.6931471805599454 }

queries

NB: the queries is statefull, and can not be reused, you must create one query per request

TERM


new TERM(numberOfDocumentsInIndex, postingsList)
e.g.

let t = new Term(5, [1,2,34])

the term is the most primitive query, it uses binary search to advance its position. (same as lucens's TermQuery)

its score is (unnormalized) IDF as follows:

this.idf = 1 + Math.log(nDocumentsInIndex / (postingsList.length + 1));

tf is not used or stored.

new AND(queryA, queryB, queryC)

returns (queryA AND queryB AND queryC) boolean query (similar to lucene's Bool MUST), its score is the sum of the scores of the matching subqueries (unnormalized)

new OR(queryA, queryB, queryC)

returns (queryA OR queryB OR queryC) boolean query (similar to lucene's Bool SHOULD), its score is the sum of the scores of the matching subqueries (unnormalized)

DISMAX

new DISMAX(tiebreaker, queryA, queryB, queryC)

e.g.
let q = new DISMAX(0.1, queryA, queryB, queryC)

returns (queryA DISMAX queryB DISMAX queryC) boolean OR query (similar to lucene's DisMax), its score is the max of the matching subqueries plus the tiebreaker multiplier by the rest of the scores.

CONSTANT

new CONSTANT(boost, query)

e.g.
let q = new CONSTANT(0.1, query)

returns a constant score query, that will score with whatever boost you give it, in the example used new CONSTANT(0.1...) it will score with 0.1

TOP N

let limit = 2 // -1 for all matches, sorted by idf score
let matches = ix.topN(
  new DISMAX(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  limit
)

outputs: 
[ { name: 'hello world k777bb k9 bzz', type: 'user' } ]

more examples

ix.forEach(
  new OR(0.5, ...ix.terms("name", "hello"), ...ix.terms("name", "world")),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

ix.forEach(
  new DISMAX(
    // tiebreaker
    0.5,
    // variable argument list of queries
    ...ix.terms("name", "hello"),
    new CONSTANT(1000, new OR(...ix.terms("name", "world")))
  ),
  function(doc, score, docID) {
    console.log({ doc, score, docID });
  }
);

Index

create inverted index (which is a handy way to store the postings lists and create term queries)

to create an index you need to pass per-field analyzer, e.g. for the 'name' field you could use autocomplete analyzer, but for the 'type' field you could use ID analyzer (noop)

let ix = new Index({
  name: analyzers.autocomplete,
  type: analyzers.keword
});

ix.doIndex(
  // documents to be indexed
  [
    { name: "john Crème Brulée", type: "user" },
    { name: "hello world k777bb k9 bzz", type: "user" },
    { name: "jack", type: "admin" },
    { name: "doe" }
  ],
  // which fields to index (must be strings)
  ["name", "type"]
);

create an array term queries out of a field: ix.terms("field", "token") e.g. ix.terms("name","john"), you can wrap those queries in AND/OR/DISMAX etc

analyzers

analyzer is a group of tokenizers and normalizers

Autocomplete (analyzers.autocomplete)

tokenize at index: whitespace, edge
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits

Basic (analyzers.basic)

tokenize at index: whitespace
tokenize at search: whitespace
normalize: lowercase, unaccent, spaceBetweenDigits

Keyword (analyzers.keyword)

tokenize at index: noop
tokenize at search: noop
normalize: noop

SOUNDEX (analyzers.soundex)

tokenize at index: whitespace, soundex
tokenize at search: whitespace, soundex
normalize: lowercase, unaccent, spaceBetweenDigits

tokenizers

a tokenizer takes a string and produces tokens from that string, at the moment those are available:

whitespace: 'a b c' -> ['a','b','c']
noop: 'a b c' -> ['a b c']
soundex: 'halo hello' -> ['H400', 'H400']
edge: 'hello' -> ['h','he','hel','hell','hello']

any object that has apply([string]) -> [string] function can be used as tokenizer

normalizers

normalizes apply transformation to the string, used both at search and index time

lowercase: 'ABC' -> 'abc'
unaccent: 'Crème' - 'Creme'
removeNonAlphanumeric: 'a/b/c' -> 'a b c'
space between digits: k9 -> 'k 9'

any object that has apply(string) -> string function can be used as normalizer

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme