@datagica/fast-index
v0.1.0
Published
Fast Index
Downloads
387
Readme
Datagica Fast-Index
A library to lookup if a word is inside an index, even if the spelling is a bit different.
Usage
Installation
$ npm install @datagica/fast-index --save
Building the index
import "FastIndex" from "@datagica/fast-index";
const index = new FastIndex({
// fields to be indexed
fields: [
'label',
'aliases'
],
// a custom spelling generation function
spellings: (map, word) => {
// replace "le " or "el " by "the " with an arbitrary similarity score
// of 0.5 (you can choose any value between 0 and 1)
map.set(word.replace(/(?:le|el) /gi, 'the '), 0.5)
}
})
// now we load some dataset
index.loadSync([
{ label: 'the chef', type: 'movie' },
// duplicate entries are supported and will be returned in the results
{ label: 'the chef', type: 'book' },
// duplicates inside an entry are simply skipped
{ label: 'el chef', aliases: [ 'el chef' ] }
]);
// side-note: here is the internal representation of the data inside the index:
[ [ 'the chef',
[ { value: { label: 'the chef', type: 'movie' }, score: 1 },
{ value: { label: 'the chef', type: 'book' }, score: 1 },
{ value: { label: 'el chef', type: 'unknow', aliases: [ 'el chef' ] },
score: 0.5 } ] ],
[ 'el chef',
[ { value: { label: 'el chef', type: 'unknow', aliases: [ 'el chef' ] },
score: 1 } ] ] ]
Querying the index
const matches = index.get("le chef");
// this will output
[
{ value: { label: 'the chef', type: 'movie' },
score: 0.5 },
{ value: { label: 'the chef', type: 'book' },
score: 0.5 },
// note how "el chef" has a lower score, although it would be closer using
// a distance function. That's because we choose a naive spelling function
// that converts everything into a single locale (english).
// a better function would be more fine-tuned and store each locale
// individually
{ value: { label: 'el chef', type: 'unknow', aliases: [ 'el chef' ] },
score: 0.25 }
]
History
Problem
The original algorithm used for fuzzy matching entities in all Datagica projects
(@datagica/fuzzy-index
) was based on a lookup inside a tree of possible
spellings, using a Finite State Levenshtein Transducer.
This was nice because it allowed us to match a university name (for instance)
even if it was spelled a bit differently eg (universidad instead of university).
However for huge datasets this proved quite slow, incompatible with real-time lookup, as it searched more alternative spellings than necessary.
Solution
The new Fast Index simply converts its inputs into a simple representation of the word, where punctuation, accents, useless spaces etc.. have been removed.
These transformations are quite opinionated, but some can be opted-out.
In addition to these default transforms, Fast-Index also gives you a way to define custom alternative spellings.
For instance, you can tell Fast-Index to automatically convert:
de
->of
ité
->ity
Now, if the input word is université de Poudlard
it will match university of Poudlard
!
When using the spelling generator, you decide yourself which distance score should be allocated. This helps the Fast-Index picks the best match it find later.