@chcaa/text-search-lite
v0.16.2
Published
Full-text search engine for node.js
Downloads
56
Readme
Text Search Lite
A full-text search engine with support for phrase, prefix, and fuzzy searches using the bm25f scoring algorithm. A build in mini query language is provided for advanced search features as well as a programmatic query interface. Aggregations and filters are supported as well.
Installation
npm install @chcaa/text-search-lite
Troubleshooting
If the build fails because of the node-snowball-package, try to install:
sudo apt-get install -y build-essential
// TODO what is the equivalent on Windows, some .NET build tools package?
Getting Started
Any POJO object with an id
property (>= 1) can be indexed by text-search-lite
. Documents (objects) are indexed
in a SearchIndex
instance which provides the main interface for adding, updating, deleting, and searching
documents of the index. When creating a new SearchIndex
the fields to search, sort, filter, or aggregate on
must be defined in a schema definition for the SearchIndex
to handle them correctly. All fields must define
which type it should be indexed/stored as which determines how the values of the fields will be processed for searches,
filtering, and aggregations. Additional options can be configured varying on the type of the field to further
control what should be indexed and stored and how values should be processed. This will be discussed in detail in
the Document Schema chapter.
import { SearchIndex } from '@chcaa/text-search-lite';
let persons = [
{ id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
{ id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
{ id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsIndex = new SearchIndex([
{ name: 'name', type: SearchIndex.fieldType.TEXT },
{ name: 'gender', type: SearchIndex.fieldType.KEYWORD },
{ name: 'age', type: SearchIndex.fieldType.NUMBER },
{ name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true }
]);
personsIndex.addAll(persons);
When the search index is created and have some documents added it can be searched using the search()
method. The query to search for can
either be expressed in a string-based query-string language or as a combination of different query objects.
The query-string language will be used in the following examples.
// find all persons named "Jane", case does not matter
let janes = personsIndex.search('jane');
// find all female persons who can swim, "+" means the term must be present
let femalesWhoCanSwim = personsIndex.search('+female +swimming');
// narrow the search to only target specific fields
let femalesWhoCanSwimPrecise = personsIndex.search('+gender:(female) +hobbies:(swimming)');
// prefix search, wildcard single character, fuzzy search
let proximitySearch = personsIndex.search('J* 3? cyclist~');
The result of the queries will include an array of matching results with the id of the document and the relevance score of the document in relation to the query. If there are more than 10 results, only the first 10 results will be included (this can be controlled using the pagination option).
{
results: [
{
id: 1,
score: 0.4458314786416938
}
],
sorting: {
field: "_score",
order: "desc"
},
pagination: {
offset: 0,
limit: 10,
total: 1
},
query: {
queryString: "jane",
errors: []
}
}
To include the source object and/or a highlighted version of the source object the highlight
and includeSource
query options can be set. For
includeSource
or highlight
to be able to resolve the source objects source.store
must be enabled (the default) or the idToSourceResolver
function must
be configured in the query options.
let persons = [
{ id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
{ id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
{ id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsById = new Map(); // this could be from a db/repository
persons.forEach(p => personsById.set(p.id, p));
// find all persons named "Jane" and highlight them
let janes = personsIndex.search('jane', {
highlight: { enabled: true }
});
The result will for each document include a highlight.source
property where the terms matching the search
will be enclosed in html <mark>
elements.
{
results: [
{
id: 1,
score: 0.4458314786416938,
highlight: {
source: {
id: 1,
name: "<mark>Jane</mark>",
gender: "female",
age: 54,
hobbies: ["Cycling", "Swimming"]
}
}
}
],
// ...
}
Aggregations about all non text
fields can be collected using the aggregations
part of the queryOptions
. (See more in the Aggregations
chapter).
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chca/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies'),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
When aggregations are requested, the result includes an array of aggregations with each of the requested aggregations.
Term aggregations are sorted by docCount:DESC, term:ASC
.
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'swimming', docCount: 2 },
{ key: 'cycling', docCount: 1 }
],
missingDocCount: 1 // person with id=3 does not have any hobbies
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
The aggregations are only collected for the documents matching the search query and filters (if applied), so if we search for "jane"
we only get aggregations for the documents matching this query.
// get aggregations about the documents matching the query
let all = personsIndex.search('jane', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies'),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'cycling', docCount: 1 },
{ key: 'swimming', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 0 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
Filters, e.g., coming from user-selected facets (created from the aggregations) can be applied using the filters
part
of the queryOptions
. Multiple filters must be combined into a single composite filter using a BooleanFilter
which determines
how the results each filter should be combined. Filters can be nested using BooleanFilter
's in as many levels as needed.
import { greaterThanOrEqualFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35
let all = personsIndex.search('', {
filter: greaterThanOrEqualFilter('age', 35)
});
As we did a match all query (empty string) and only narrowed the results using a filter, the score for all documents is 0, as filters do not
score results, only queries do.
Furthermore, a match all query also changes the default sorting to id
instead of _score
because of this.
{
results: [
{ id: 1, score: 0 },
{ id: 3, score: 0 }
],
sorting: { field: 'id', order: 'asc' },
pagination: { offset: 0, limit: 10, total: 2 },
query: { queryString: '' }
}
The two filters below are combined using AND
logic, meaning that a document must pass both filters to be included.
import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
filter: andFilter([
greaterThanOrEqualFilter('age', 35),
termFilter('hobbies', 'swimming')
])
});
Only a single person ("Jane") matches the filters.
{
results: [
{ id: 1, score: 0 }
],
// '''
}
Document Schema
Each document field to index for searching or/and to use for filtering, sorting, and aggregations must be defined as a part of
the document schema for the SearchIndex
. A field is defined as an object which always must have a name
and a type
and then depending on the type can have a set of additional properties set to further specify how the values for the field should
be processed and stored.
The fields are passed to the SearchIndex
as array including all the fields the SearchIndex
should know about. Additionally,
a schema options object for advanced configuration of search index can be passed as a second argument.
let personsIndex = new SearchIndex([
{ name: 'name', type: SearchIndex.fieldType.TEXT, },
{ name: 'gender', type: SearchIndex.fieldType.KEYWORD, index: false },
{ name: 'age', type: SearchIndex.fieldType.NUMBER, index: true },
{ name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true, docValues: false }
], {
schema: { // general config of the schema
analyzer: SearchIndex.analyzer.LANGUAGE_ENGLISH,
score: { // ADVANCED change the default settings of scoring algorithm
k1: 1.5,
}
}
});
Schema options
The schema part of the options-object can be used to change the default general settings of the schema. All properties are optional, so the argument can be left out if no change is required.
analyzer: string
- The name of the default analyzer to use fortext
fields.text-search-lite
has a set of common analyzers built-in for different languages which is accessible throughSearchIndex.analyzer
or a custom analyzer can be installed and used. Defaults to'standard'
.sorting: object
- Sorting config.locale: string
- The default Unicode locale to use for sorting text-like fields. The language part is required, the region is optional. Defaults toen-US
.
score: object
- Scoring config.k1: number
-k1
for the bm25f scoring algorithm.k1
. The default value to use for all analyzers not having k1 assigned specifically. Defaults to1.2
.analyzerK1: object
-k1
for individual analyzers. Registerk1
for a specific analyzer by using the name of the analyzer as property name and setk1
as the value.
Field settings
The following properties are available for all field types:
name: string
- The path name of the field. E.g. "author.name" (for arrays of values the brackets should be excluded, e.g. "authors.name").type: ('text'|'keyword'|'tag'|'number'|'date'|'boolean')
- The type of the field.index?: boolean
- Set totrue
if the field should be searchable.docValues?: boolean
- Set totrue
if the field should be available in filters or be used for aggregations.array?: boolean
- Set totrue
if the property is an array or if it's a descendant of an array. Defaults tofalse
.boost?: number
- The relevance of the field when scoring it in a search. Must be >= 1. Defaults to1
.prefix?: object
- Prefix config. Only relevant ifindex=true
.eagerLoad?: boolean
- Set totrue
if prefix mappings should be eagerly loaded. Iffalse
prefixes will first be loaded when queried on the field. Defaults tofalse
.partitionDepth?: number
- The maximum partition depth of the prefix tree. In most cases, the default is fine only in cases where, e.g., all analyzed terms start with the same prefix such as "000-SomeValue" this should be set to a higher number. Defaults to3
.
fuzzy?: object
- Fuzzy config. Only relevant ifindex=true
.enabled?: boolean
-true
if fuzzy queries should be supported. Default totrue
.
score?: object
- The scoring paramters to use when calculating the score of the field. Only relevant ifindex=true
.b?: number
- The hyperparameterb
of bm25f. Defaults to0.75
.
docStats?: object
- Document statistics config. Only relevant ifindex=true
.length?: boolean
- Should doc field length be stored. Defaults tofalse
.termFrequencies?: boolean
- Should doc field length be stored. Defaults tofalse
.termPositions?: boolean
- Should doc term positions be stored. Only relevant fortext
fields. This is required for doing phrase searches. Defaults tofalse
.
sorting: boolean|object
- Primarily used for enabling sorting on fields of typetext
as all other field types are sortable if they havedocValues=true
.locale: string
- The Unicode locale to use for sorting text-like fields. Overrides the default setting taken from the schema options.transform: function
- A custom transform function to convert the value to the value to sort on. Must return the same type as the input type.
Some properties are only available for specific field types or have another value than the default. The additional fields and different default values are described for each field below.
A note on docStats
Even though
docStats
can be enabled for all field types it does not make sense for other fields thantext
as all other field types are not tokenized. The only exception is if a field hasarray=true
and the number of elements in the array should be taken into account when scoring the document.
E.g., if we have documents with ahobbies
array where doc-1 has ["bicycling"] and doc-2 has ["bicycling", "climbing"] and documents with more hobbies should score lower than documents with fewer hobbies thendocStats.length
could be se totrue
as the number of hobbies then will be used in calculating how relevant the match is.
text
Fields
A text
field is the primary field to use for full-text searches. A text
field is analyzed (normalized and tokenized) when indexed which
makes it ideal for efficient lookup of terms and phrases in the text of the field.
docValues not allowed
Atext
field cannot havedocValues=true
because it is tokenized. Therefore, atext
field cannot be used for filtering, aggregations, and sorting.
Default settings override
index: true
docStats
length: true
termFrequencies: true
termPositions: true
Text field-specific settings
indexExact?: boolean
- Set totrue
fortext
fields to enable phrase searches and more precise matching. Defaults totrue
.analyzer?: string
- The name of the analyzer to use for this field. Defaults toundefined
which resolves to the default analyzer configured for the search index.
Sorting
If sorting=true
the first 50 characters if the original text will be lowercased and used for sorting.
To change the default behavior a custom transform
function can be supplied instead e.g. sorting: { transform: v => v.substring(0, 20).toUpperCase() }
.
Sorting and memory usage
Enabling sorting for text fields creates a new hidden index which will consume extra memory, so only enable sorting for text fields which is actually needed.
keyword
Fields
A keyword
field is indexed as is without applying any form of analysis. To match the value of a keyword
field the same string as when the field
was indexed must be used.
Default settings override
index: true
docValues: true
tag
Fields
A tag
field is indexed in the same way as a keyword
field except that lowercase is applied making the value of the field case-insensitive.
Default settings override
index: true
docValues: true
number
Fields
A number
field is used to store numeric values such as age, weight, length, and other measures and is typically used for filtering and sorting but can
be indexed as well (disabled by default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
number
fields should in most cases only be indexed if the vocabulary is relatively small and made up of integers.
A large vocabulary either because of floating point numbers or large scale integers will be hard to match
in a search and would furthermore result in a large inverted index.
date
Fields
A date
field is used to store date and date-time values and is typically used for filtering and sorting but can be indexed as well (disabled by default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
Date field-specific settings
format: string
- A format string in one of the formatsyyyy
,yyyy-MM-dd
oryyyy-MM-dd'T'HH-mm-ssZ
.
Document value types
The value of the document field can express a date in one of the following ways:
number
- An integer in epoch millis. Negative values are allowed.string
- A date string iffield.format
is defined.
Dates in BC time can for all string formats be defined as negative values with 6-year digits: -yyyyyyy
. E.g. -000001-01-01
When format
is defined both epoch millis and date strings in the defined format is allowed as values for the field. If docValues
is enabled
for the field the date will be converted to the epoch millis version before storing it, which then will be used for filtering, aggregations, and sorting.
If index
is enabled for the field the date will be converted to the defined string format before indexing the date, so the date can be searched for
using the given format.
A date
fields can only be indexed if the format precision is set to yyyy
or yyyy-MM-dd
. A large vocabulary because of minute,
second, or even millisecond precision will be hard to match in a search and would furthermore result in a large inverted index.
Regarding Time Zones
Dates will internally always be stored as UTC. If date inputs include time using theyyyy-MM-dd'T'HH-mm-ssZ
format and no time zone is present, the date will be parsed as UTC.
boolean
Fields
A boolean
field is used to store the boolean values true|false
and is typically used for filtering and sorting but can be indexed as well (disabled by
default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
docId
Fields
A special field type used for storing the id
of a document in an optimized way. The field type cannot be configured on user-defined fields but
can still be encountered as the id
field is publicly available.
Create, Update and Delete Documents
Creating, updating, and deleting documents can be done using the following methods.
All documents must have an id
property with an integer value > 0
.
Method: add(document)
Adds a document to the index. If the document already exists, an error will be thrown.
Parameters:
document: object
- The document to add.
Method: addAll(documents)
Adds all documents to the index. If one of the documents already exists, an error will be thrown.
This method is performance optimized for adding many documents at once.
Parameters:
documents: object[]
- The documents to add.
Method: update(document)
Updates an existing document. If the document does not exist, the document will be added.
Parameters:
document: object
- The document to update.
Method: remove(document)
Removes the document from the index.
Parameters:
document: object
- The document to remove.
Method: removeById(id)
Removes the document with the id from the index.
Parameters:
id: number
- The id of the document to remove.
Searching
Searching the index is done using the search()
method. The query part of the search can be expressed in the
built-in query string language
or as a combination of query objects. Furthermore, filters, aggregations, sorting, and pagination can be applied/requested through the
optional
queryOptions
object which can be passed as a second argument to search()
.
Method: search(query, [queryOptions])
Searches the index.
Parameters:
query: string|Query|Query[]
- The query to search for.queryOptions?: object
- Query options.fields?: string[]
- The name of the fields to search. Defaults to all user-created indexed fields if not defined.pagination?: object
- The pagination to apply.offset?: number
- The pagination offset. Defaults to0
.limit?: number
- The pagination limit. Defaults to10
.
sorting?: object
- The sorting to apply.field?: string
- The field to sort by or"_score"
.order?: ('asc'|'desc')
- The sorting order.
filter?: Filter
- The filter to apply.aggregations?: Aggregation[]
- The aggregations to generate.highlight?: boolean|object
- Highlight options. Defaults tofalse
.enabled: boolean
- Should highlight be enabled.escapeHtml?: boolean
- Should HTML characters (<
,>
,&
,'
,"
) be replaced by their equivalent HTML entity name. Fields with no highlight will be escaped as well. Defaults tofalse
.prefix?: object
- Prefix highlight options.expand?: boolean
- Set totrue
if the matched term should be fully highlighted andfalse
if only the prefix part should. Defaults totrue
.
includeSource?: boolean
- - Should the source object be included in the result. Defaults tofalse
.idToSourceResolver?: function(number[]):{id:number}[]
- A function for resolving source objects from an array of ids. If configured, this function will be used to resolve source objects for the query instead of the source objects originally indexed.queryString?: object
- Query string options.parseOptions: object
- Query string parse options. Enable/disable which query-string expressions to parse.defaultOccurrence: ('should'|'must'|'mustNot')
- The default occurrence to use when no occurrence modifier is set. Defaults to'should'
.
See also the "search index options" section of the search index configuration chapter for configuring custom default query options.
Returns:
object
- The result of the search.results: object[]
- Information about each document matching the search and applied filters and pagination.results[].id: number
- The id of the document.results[].score: number
- The relevance score of the document. (This will be 0 in the case of sorting on something else than_score
or if a match-all query is performed).results[].source: object
- The source object if requested in thequeryOptions
.results[].highlight: object
- Highlight information if requested in thequeryOptions
.results[].highlight.source: object
- A highlighted version of the source object.sorting: object
- The sorting applied to the result.pagination: object
- The pagination applied to the result and the total number of matches.offset: number
limit: number
total: number
- The total match count.
aggregations: object
- The aggregation results. (See aggregations for the different result object structures).query: object
- The query-string and possible errors. This is only available of the query was performed using a query-string.queryString: string
- The query-string used for the search.errors: object[]
- The parse errors, if any, which occurred during parsing of the query-string.
Query String Language
Text-search-lite has a built-in query-string mini language for expressing text-based queries with support for expressing the same types of queries as the programmatic API does, such as boolean modifiers, phrases, wildcards, targeting specific fields, and grouping of statements. The query-string parser automatically converts any unparsable part of the query to regular "text" making it safe to expose the query-string language directly to the end-user.
The following modifiers and expressions are supported and can be turned on/off individually to limit what should be parsed and what should just be treated as regular text.
Phrases "A phrase"
A phrase is one or more terms in a specific form, which should be present in a particular order.
search for "a full sentence" or for a "single" specific spelling of a term
Must Operator +
The term, phrase, or group content must be present in the document for it to match.
+peace in the +world
Must Not Operator -
The term, phrase, or group content must not be present in the document for it to match.
peace not -war
Boost Operator ^NUMBER
Boost the relevance of the term, phrase or the content of a group.
peace^10 "love not war"^2
Prefix Operator *
The term must start with one or more characters, but the ending is undetermined. Prefix queries take the difference in length between the match and the prefix string into account when scoring is calculated.
love and pea*
Wildcard Operator ?
, *
The term can have single and multiple character spans which are undetermined. The single character wildcard is expressed by ?
and multiple character wildcard
is expressed by *
. Wildcard queries take the difference in length between the match and the prefix string into account when scoring is calculated.
love and p?a*e
Fuzzy Operator ~
, ~[0, 1, 2]
The term must match other terms within a maximum edit distance. When the edit distance is not defined specifically by one of [0, 1, 2]
the edit distance
is calculated based of the length of the term.
- length < 3: maxEdits = 0
- length < 6: maxEdits = 1
- length >= 6: maxEdits = 2
Fuzzy queries take the edit distance between the term and the result into account when scoring is calculated.
love~ and peace~2
Groups ()
Terms and phrases can be grouped together, and boolean operators and boost can be applied to a group, making it possible to express more complex queries.
+peace +(world earth) (love solidarity)^10
Field Groups FIELD1:FIELD2:()
Field groups offer the same possibilities as groups and additionally target one or more fields where the match must occur.
Multiple fields must be separated by a colon (:
).
Field Groups cannot be nested.
title:(world earth) title:description:(love solidarity)
Query String Options
The parsing of the query-string language can be configured in the queryOptions
of search()
and parseQueryStringToQueryObjects()
.
Where each language feature can be enabled/disabled. All features are enabled by default.
queryString: object
- The Query string options.parseOptions: object
- Enable/disable which query-string expressions to parse.quote: boolean
- Toggle parsing of"exact strings and phrases"
.group: boolean
- Toggle parsing of(terms in group)
.fieldGroup: boolean
- Toggle parsing oftitle:(terms in field group)
.mustOperator: boolean
- Toggle parsing of+mustOperator
.mustNotOperator: boolean
- Toggle parsing of-mustNotOperator
.prefixOperator: boolean
- Toggle parsing ofprefix*
.wildcardOperator: boolean
- Toggle parsing ofwil_c*d
.fuzzyOperator: boolean
- Toggle parsing offuzzy~1
.boostOperator: boolean
- Toggle parsing ofboost^10
.
defaultOccurrence: ('should'|'must'|'mustNot')
- The default occurrence to use when no occurrence modifier is set. Defaults to'should'
.
Parse Errors
When using the query-string language the search()
and parseQueryStringToQueryObjects()
methods includes information about any parse errors
and their exact location in the string. The parse errors are structured in the following format.
errors: object[]
- An array of error objects.errors[].type: string
- The type of the error.errors[].message: string
- A user-friendly message.errors[].startIndex: number
- The start index in the source string where the reported error occurs.errors[].spanSize: number
- The character span of the reported error.
The query-string language can also be validated directly using the validateQueryString()
method which e.g., could be used for user feedback while typing a
query.
Method: validateQueryString(queryString, [parseOptions])
Validates the query string. Any problems with the query string will be reported in the errors
array of the returned object.
Parameters:
queryString: string
- The query string to validate.parseOptions: object
- Options for configuring which parts of the query string language should be enabled. (See Query String Options).
Returns:
object
- The result of the validation.status: ('success'|'error')
- The status of the validation.errors: object[]
- The parse errors which occurred during parsing. (See above).queryString: string
- The query-string which was validated.
Query Objects
The query-language described in the previous chapter is converted into a combination of query objects which can also be created programmatically.
A single query object or an array of query objects can be passed as the query
to the search
method of a SearchIndex
instance.
import { prefixQuery, termQuery, fieldGroupQuery, Query } from "@chcaa/text-search-lite/query";
// get all persons with age >= 35 who can swim
let query = prefixQuery('jo');
let startingWithJo = personsIndex.search(query);
let query2 = [termQuery('cycling'), termQuery('climbing')];
let withOneOfHobbies = personsIndex.search(query2);
let query3 = [termQuery('cycling', Query.occurrence.MUST), termQuery('climbing', Query.occurrence.MUST)];
let withBothHobbies = personsIndex.search(query3);
let query4 = fieldGroupQuery(['hobbies'], [termQuery('cycling'), termQuery('climbing')]);
let withOneOfHobbiesInField = personsIndex.search(query4);
To convert from the query-string language to the equivalent query objects the SearchIndex
exposes the method parseQueryStringToQueryObjects()
making it possible to express the initial part of a query in the query-string language and then further modify the query (add, replace etc.) using
query objects.
Factory Functions
Factory functions for creating the different kinds of query objects are exported from the @chcaa/text-search-lite/query
package along
with the filter classes the factory functions produces.
The factory functions are the suggested way for creating queries where the classes can be used for type definitions.
Function: termQuery(term, [occurrence], [boost])
Creates a new TermQuery
for matching terms/tokens in a document.
The term (text) for the query will be analyzed using the analyzer of the field before performing the query. E.g.,
If the field to search is a tag
the term will be lower-cased, if the field is a text
field, the term will be normalized
and tokenized.
When searching text
fields, all the tokens in the term will need to be present in a document for the query to match. So a search for
the term "They went for a walk" will be analyzed to something like ['they', 'went', 'for', 'a', 'walk']
which then all will be matched against
the field of each document and only include the documents with all the tokens in the field.
So to be able to search for documents only containing one or some of the tokens, the term should be split into smaller queries, typically splitting on
whitespace.
Term queries can also be used on non-text fields such as number
, boolean
and date
and can be passed number
and boolean
type values.
If the passed in value is a supported non-text type, it will be transformed to the correct indexed version of the value
before querying, e.g., (boolean: true → "true"), (number: 1000 → "1000"), (date: 0 → "1970-01-01").
Parameters:
term: string|number|boolean
- The term to search for.occurrence?: ("should"|"must"|"mustNot")
- The occurrence of the term. Defaults to"should"
.boost?: number
- The boost to multiply the score of the term with when scoring the matching documents. Defaults to1
.
returns:
TermQuery
Function: phraseQuery(phrase, [occurrence], [boost])
Creates a new PhraseQuery
for matching documents containing a phrase.
Parameters:
phrase: string
- The phrase to search for.occurrence?: ("should"|"must"|"mustNot")
- The occurrence of the term. Defaults to"should"
.boost?: number
- The boost to multiply the score of the term with when scoring the matching documents. Defaults to1
.
returns:
PhraseQuery
Function: prefixQuery(term, [occurrence], [boost])
Creates a new PrefixQuery
for matching documents with terms starting with a prefix.
Parameters:
term: string
- The prefix-term the matched terms should start with.occurrence?: ("should"|"must"|"mustNot")
- The occurrence of the term. Defaults to"should"
.boost?: number
- The boost to multiply the score of the term with when scoring the matching documents. Defaults to1
.
returns:
PrefixQuery
Function: wildcardQuery(term, [occurrence], [boost])
Creates a new WildcardQuery
for matching documents on a term with wildcards.
?
matches a single character.*
matches 0-n characters.
The wildcard-term cannot start with a wildcard.
Parameters:
term: string
- The wildcard-term the matched terms should match.occurrence?: ("should"|"must"|"mustNot")
- The occurrence of the term. Defaults to"should"
.boost?: number
- The boost to multiply the score of the term with when scoring the matching documents. Defaults to1
.
returns:
WildcardQuery
Function: fuzzyQuery(term, [occurrence], [boost])
Creates a new FuzzyQuery
for matching documents matching an expanded (fuzzy) term.
A fuzzy query expands the term up to a maximum of 2 edit distances based on the Levenshtein edit distance and uses the expanded terms to perform an OR query on the fields to search.
If the limit of maxTopTermExpansionsPerField
is exceeded, the top terms will be selected based on simple idf relevance (docCount/df) and edit distance.
The auto maxEdit distance (the default) has the following values:
- length < 3: maxEdits = 0
- length < 6: maxEdits = 1
- length >= 6: maxEdits = 2
Parameters:
term: string
- The term to search for variants of.maxEdits?: (-1|0|1|2)
- The maximum edits allowed. Auto (-1) determines the max edits based on the length of the initial term. Defaults to-1
.occurrence?: ("should"|"must"|"mustNot")
- The occurrence of the term. Defaults to"should"
.boost?: number
- The boost to multiply the score of the term with when scoring the matching documents. Defaults to1
.maxTopTermExpansionsPerField?: number
- The maximum expansions to include per field. Defaults to50
.
returns:
FuzzyQuery
Function: groupQuery(children, [occurrence], [boost])
Creates a new GroupQuery
for matching documents fulfilling a group of queries.
Parameters:
children: Query[]
- The queries this query should combine based on theiroccurrence
.occurrence?: ("should"|"must"|"mustNot")
- The occurrence requirement of the group (if e.g., included in a parent group). Defaults to"should"
.boost?: number
- The boost to multiply the score of the children of the group with when scoring the matching documents. Defaults to1
.
returns:
GroupQuery
Function: fieldGroupQuery(fieldNames, children, [occurrence], [boost])
Creates a new FieldGroupQuery
for matching documents fulfilling a group of queries across one or more fields.
Parameters:
fieldNames: string[]
- The field names the children of this group should be matched against.children: Query[]
- The queries this query should combine based on theiroccurrence
.occurrence?: ("should"|"must"|"mustNot")
- The occurrence requirement of the group (if e.g., included in a parent group). Defaults to"should"
.boost?: number
- The boost to multiply the score of the children of the group with when scoring the matching documents. Defaults to1
.
returns:
FieldGroupQuery
Function: matchAllQuery()
Creates a new MatchAllQuery
matching all documents in the index.
returns:
MatchAllQuery
Filters
Filters can be used to narrow down the search result (or the full dataset) on any field with docValues=true
. Various filters are provided for filtering
on the different field types, and user-defined filters can be defined as well if needed.
To apply multiple filters, the filters must be combined into a single composite filter using a BooleanFilter
which determines
how the results each filter should be combined. Filters can be nested using BooleanFilter
's in as many levels as needed.
Filters are applied in the queryOptions
object.
import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
filter: andFilter([
greaterThanOrEqualFilter('age', 35),
termFilter('hobbies', 'swimming')
])
});
Caching of Filters
Most predicate filters are cacheable such as TermFilter
, RangeFilter
and PrefixFilter
for reuse of their result
but differ in whether caching is enabled by default or not based on their presumed use-case (consult each filter documentation for its cache settings).
A rule of thumb:
- If the filter is expected to be reused, e.g., is a fixed range used for facets or another predefined range, turn caching on.
- If the filter values vary a lot, e.g., is user defined with many possibilities, turn caching off.
If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculation of the filter.
Filter Types
The following filters are provided and described in detail in the next chapter.
Predicate filters: CustomFilter
, DateRangeFilter
, PrefixFilter
, RangeFilter
, RegexFilter
, TermFilter
.
Logical and special filters: BooleanFilter
, ExistsFilter
, IdsFilter
.
Factory Functions
Factory functions for creating the different kinds of filters are exported from the @chcaa/text-search-lite/filter
package along
with the filter classes the factory functions produces.
The factory functions are the suggested way for creating filters where the classes can be used for type definitions.
Function: termFilter(fieldName, term, [options])
Creates a new TermFilter
for filtering on keyword
, tag
, number
, date
, and boolean
fields.
Caching of the filter if disabled by default, as the search index can be used directly as cache and will be used instead.
Only in cases where index=false
and caching is required, caching should be set to true
.
Parameters:
fieldName: string
- The name of the field to filter on.term: string|number|boolean
- The term to filter on.options?: object
- Config options.cache?: boolean
- Set totrue
if the filter should be cached andindex=false
for the field. Defaults tofalse
.
returns:
TermFilter
Function: rangeFilter(fieldName, minValue, maxValue, [options])
Creates a new RangeFilter
for filtering documents on the presence of a range of number og characters.
Only one of minValue
or maxValue
is required, making it possible to express greater-than and less-than filters.
Caching of the filter if enabled by default, but in cases where the ranges can vary a lot e.g., by user-defined min and max values the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.
If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.
Parameters:
fieldName: string
- The name of the field to filter on.minValue: number|string
- The minimum value to accept (default inclusive).maxValue: number|string
- The maximum value to accept (default exclusive).options?: object
- Config options.minValueInclusive: boolean
- Set totrue
if theminValue
should be inclusive. Defaults totrue
.maxValueInclusive: boolean
- Set totrue
if themaxValue
should be inclusive. Defaults tofalse
.cache?: boolean
- Set totrue
if the filter should be cached. Defaults totrue
.
returns:
RangeFilter
Additionally, a set of convenience functions is supplied:
rangeFilterMaxValueInclusive(fieldName, minValue, maxValue)
- Creates a newRangeFilter
withmaxValueInclusive=true
.greaterThanFilter(fieldName, minValue)
- Creates a newRangeFilter
withmaxValue=undefined
andminValueInclusive=false
.greaterThanOrEqualFilter(fieldName, minValue)
- Creates a newRangeFilter
withmaxValue=undefined
andminValueInclusive=true
.lessThanFilter(fieldName, maxValue)
- Creates a newRangeFilter
withminValue=undefined
andmaxValueInclusive=false
.lessThanOrEqualFilter(fieldName, maxValue)
- Creates a newRangeFilter
withminValue=undefined
andmaxValueInclusive=true
.
Function: dateRangeFilter(fieldName, format, minDate, maxDate, [options])
Creates a new RangeFilter
for filtering documents on the presence of a range of dates.
Only one of minDate
or maxDate
is required, making it possible to express greater-than and less-than filters.
Caching of the filter if enabled by default, but in cases where the ranges can vary a lot e.g., by user-defined min and max dates, the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.
If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.
Parameters:
fieldName: string
- The name of the field to filter on.format: string
- The format ofminDate
andmaxDate
. E.g.yyyy-MM-dd
.minValue: number|string
- The minimum value to accept (default inclusive).maxValue: number|string
- The maximum value to accept (default exclusive).options?: object
- Config options.minDateInclusive: boolean
- Set totrue
if theminDate
should be inclusive. Defaults totrue
.maxDateInclusive: boolean
- Set totrue
if themaxDate
should be inclusive. Defaults tofalse
.cache?: boolean
- Set totrue
if the filter should be cached. Defaults totrue
.
returns:
DateRangeFilter
Additionally, a set of convenience functions is supplied:
dateRangeFilterMaxDateInclusive(fieldName, format, minDate, maxDate)
- Creates a newDateRangeFilter
withmaxDateInclusive=true
.greaterThanDateFilter(fieldName, format, minDate)
- Creates a newDateRangeFilter
withmaxDate=undefined
andminDateInclusive=false
.greaterThanOrEqualDateFilter(fieldName, format, minDate)
- Creates a newDateRangeFilter
withmaxDate=undefined
andminDateInclusive=true
.lessThanDateFilter(fieldName, format, maxDate)
- Creates a newDateRangeFilter
withminValue=undefined
andmaxDateInclusive=false
.lessThanOrEqualDateFilter(fieldName, format, maxDate)
- Creates a newDateRangeFilter
withminDate=undefined
andmaxDateInclusive=true
.
Function: prefixFilter(fieldName, prefix, [options])
Creates a new PrefixFilter
for filtering documents on the presence of a given term prefix in the document.
Caching of the filter if disabled by default as it is expected that the prefix will vary a lot and there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters, which is the case when caching is turned off.
If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.
Parameters:
fieldName: string
- The name of the field to filter on.prefix: string
- The prefix to filter on.options?: object
- Config options.cache?: boolean
- Set totrue
if the filter should be cached. Defaults tofalse
.
returns:
PrefixFilter
Function: regexFilter(fieldName, regex, [options])
Creates a new RegexFilter
for filtering documents on the presence of a given regex pattern in the document.
Caching of the filter if enabled by default, but in cases where the regex can vary a lot e.g., by a user defined regex the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.
If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.
Parameters:
fieldName: string
- The name of the field to filter on.regex: RegExp
- The regex to filter on.options?: object
- Config options.cache?: boolean
- Set totrue
if the filter should be cached. Defaults totrue
.
returns:
RegexFilter
Function: customFilter(fieldName, predicate)
Creates a new CustomFilter
for filtering documents on the result of a predicate function.
This filter is not cacheable as the predicate function cannot be guaranteed to produce the same result based on the same input because the predicate functions algorithm can rely on changing variables, time, etc.
Parameters:
fieldName: string
- The name of the field to filter on.predicate: function(value):boolean
- A predicate function which is passed each value from the field and should returntrue
|false
if the document with the value should be included or not.
returns:
CustomFilter
Function: existsFilter(fieldName)
Creates a new ExistsFilter
which tests for existence of a value for the given field. The field exists if the value is not null
, undefined
or []
.
Parameters:
fieldName: string
- The name of the field to filter on.
returns:
ExistsFilter
Function: idsFilter(ids)
Creates a new IdsFilter
for filtering documents on their id. The filter accepts an Iterable of ids.
Parameters:
ids: Iterable<number>
- The ids of the documents to include.
returns:
IdsFilter
Function: booleanFilter(filters, booleanOperator)
Creates a new BooleanFilter
for combining multiple Filter
instances results with one of AND
, OR
, AND_NOT
logic.
The AND_NOT
filter subtracts the result of its filters from the parent filter's (or query's) results.
In most cases using the convenience functions andFilter
, orFilter
, andNotFilter
is both easier and more expressive in terms of intent.
Parameters:
filters: Filter[]
- The filters to combine the results of with the passed in operator logic.booleanOperator: ("and"|"or"|"andNot")
- The boolean operator logic to combine the filters with. A boolean operator enum is available atBooleanFilter.operator
.
returns:
BooleanFilter
Aggregations
Aggregations can be used to collect aggregated statistics about the result of a query. This could, e.g., be:
- the top 10 hobbies of documents
- number of documents grouped by age ranges
- number of documents grouped by birth-year decade
- etc.
Multiple aggregations can be requested at the same time, and aggregations can be nested to create drill-down detail hierarchies.
To request one or more aggregation include the aggregations as part of the queryOptions
object.
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies', 2),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
The results of the requested aggregations are included as an array on the result object from the query. All aggregation results have the same set of base properties where only the bucket objects differ depending on the type of aggregation requested.
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
totalBucketCount: 4, // the total number of possible buckets (unique terms)
missingDocCount: 0,
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'swimming', docCount: 2 },
{ key: 'cycling', docCount: 1 }
],
missingDocCount: 1 // person with id=3 does not have any hobbies
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
Peculiarities of aggregation buckets
As documents with array fields can occur more than once in the aggregated statistics, the sum of the counted document values may exceed to total number of documents in the query. This is expected.
Factory Functions
Factory functions for creating the different kinds of aggregations are exported from the @chcaa/text-search-lite/aggregation
package along
with the aggregation classes the factory function produces. The factory functions are the suggested way for creating aggregation requests where the classes can
be used for type definitions.
Function: termAggregation(fieldName, [maxSize], [options])
Creates a new TermAggregation
for collecting statistics about keyword
, tag
, number
, date
, and boolean
fields. The occurrence of
each distinct value will be counted once per document and returned descending with the value with most documents at the top.
OBS
When a filter is set in the options-object, caching will be disabled for the aggregation results.
Parameters:
fieldName: string
- The name of the field to aggregate on.maxSize?: number
- The maximum number of buckets. Defaults to10
.options?: object
- See general config options in Config options.filter: PredicateFilter
- A predicate filter for filtering the terms to include in the aggregation.
returns:
TermAggregation
Bucket results
Buckets are sorted by docCount:DESC, term:ASC
.
{
// name, fieldName, etc...
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
totalBucketCount: 4, // the total number of possible buckets (unique terms)
missingDocCount: 0
}
Filter example
Filter bucket keys (terms) using a PrefixFilter
so we only get buckets where the key is starting with a "c"
.
import { termAggregation } from "@chcaa/text-search-lite/aggregation";
import { prefixFilter } from "@chcaa/text-search-lite/filter";
let all = personsIndex.search('', {
aggregations: [
termAggregation('hobbies', 10, {
filter: prefixFilter('hobbies', 'c')
})
]
});
Function: rangeAggregation(fieldName, ranges, [options])
Creates a new RangeAggregation
for collecting statistics about number
, keyword
and tag
fields.
Parameters:
fieldName: string
- The name of the field to aggregate on.ranges: object[]
- The ranges to create buckets for.ranges[].from: number|string
- The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.ranges[].to: number|string
-The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object
- Config options.
returns:
RangeAggregation
Bucket results
Buckets are sorted by the order they were requested.
{
// name, fieldName, etc...
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
],
missingDocCount: 0
}
Additionally, a set of convenience functions is supplied:
rangeAggregationWithIntegerAutoBuckets(fieldName, bucketCount, min, max, [options])
- Creates a newRangeAggregation
where the buckets are auto generated based on the input parameters.rangeAggregationWithIntegerAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options])
- Creates a new open-endedRangeAggregation
where the buckets are auto generated based on the input parameters. The first bucket will only haveto
defined and the last bucket onlyfrom
defined and the bucket ranges is thus open-ended.rangeAggregationWithNumberAutoBuckets(fieldName, bucketCount, min, max, [options])
- Creates a newRangeAggregation
where the buckets are auto generated based on the input parameters.rangeAggregationWithNumberAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options])
- Creates a new open-endedRangeAggregation
where the buckets are auto generated based on the input parameters. The first bucket will only haveto
defined and the last bucket onlyfrom
defined and the bucket ranges is thus open-ended.
Function: dateRangeAggregation(fieldName, ranges, [options])
Creates a new DateRangeAggregation
for collecting statistics about date
fields. Date range aggregations work in the same
way as range aggregations except that the bucket ranges can be expressed in a string date format.
Parameters:
fieldName: string
- The name of the field to aggregate on.format: string
- The date format of the ranges. One ofyyyy
,yyyy-MM-dd
oryyyy-MM-dd'T'HH-mm-ssZ
.ranges: object[]
- The ranges to create buckets for.ranges[].from: number|string
- The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.ranges[].to: number|string
-The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object
- Config options.
returns:
DateRangeAggregation
Bucket results
Buckets are sorted by the order they were requested.
{
// name, fieldName, etc...
buckets: [
{ key: '1940-1950', from: '1940', to: '1950', fromMillis: -946771200000, toMillis: -631152000000, docCount: 0 },
{ key: '1990-2000', from: '1990', to: '2000', fromMillis: 631152000000, toMillis: 946684800000, docCount: 2 },
],
missingDocCount: 0
}
Aggregation Options
All aggregations can additionally be configured to have user-defined name and to include nested aggregations using the following options object structure.
name?: string
- The name of the aggregation, e.g., to distinguish two aggregations on the same field. If undefined, the field name will be used.aggregations?: Aggregation[]
- Child aggregations to collect for each bucket of the aggregation.
Child aggregations can be requested as follows:
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender', {
aggregations: [
termAggregation('hobbies', 2) // Top 2 hobbies for each gender
]
})
]
});
The result of the child aggregation will be attached to each parent bucket.
{
// name, fieldName, etc...
buckets: [
{
key: 'female', docCount: 2,
aggregations: [
{
// name, fieldName, etc...
buckets: [
{ key: 'swimming', docCount: 1 },
{ key: 'cycling', docCount: 1 }
]
}
]
},
{
key: 'male', docCount: 1,
aggregations: [
{
// name, fieldName, etc...
buckets: [
{ key: 'swimming', docCount: 1 }
]
}
]
}
],
missingDocCount: 0
}
SearchIndex Configuration
A SearchIndex
can be further configured using the options
argument where schema configuration can be customized as described
in the Document Schema chapter as well as configuration of different default query and cache settings.
The options-object should be passed as the second argument to the SearchIndex
constructor.
import { SearchIndex } from "@chcaa/text-search-lite";
let searchIndex = new SearchIndex([/* fields */], {
schema: { /* settings*/ },
query: { /* settings */ },
filter: { /* settings*/ },
aggregation: { /* settings*/ },
sorting: { /* settings*/ }
});
Search index options
// TODO dokumenter source: { store, strategy }
schema?: object
- General schema configuration options. See the Document Schema chapter.source?: object
- Document source object storage configuration.store?: true
-true
if the document source object should be stored. This improves updates and deletes and makes highlighting possible without supplying anidToSourceResolver
. Defaults totrue
.strategy?: ("memory"|"db")
- The storage strategy to use for storing the source object. In both cases the storage is temporary and only spans the current program execution. Defaults to"memory"
. To use the"db"
strategybetter-sqlite3
must be included as a dependency in the projectspackage.json
.
query?: object
- Default configuration of query options.options?: object
- Custom overrides of the defaultqueryOptions
used insearch()
. Each possible query option can be configured to have a default fallback if not provided in the runtimequeryOptions
passed tosearch()
. The overrides will be merged with the system defined defaultqueryOptions
.
filter?: object
- General filter configuration options.cache?: object
- Filter cache configuration.maxSize?: number
- The maximum size of the filter cache. Defaults to100
.minDocCount?: number
- The minimum number of document inputs to a filter before the filter is cached. Defaults to100
.
aggregation?: object
- General aggregation configuration options.cache?: object
- Aggregation cache configuration.maxSize?: number
- The maximum size of the aggregation cache. Defaults to100
.minDocCount?: number
- The minimum number of document inputs to an aggregation before the aggregation is cached. Defaults to100
.
sorting?: object
- General sorting configuration options.cache?: object
- Sorting cache configuration.maxSize?: number
- The maximum size of the sorting cache. Defaults to100
.minDocCount?: number
- The minimum number of document inputs to be sorted before the sorted result is cached. Defaults to100
.
Filters, aggregations, and sorting of search results each have their own cache which can be configured independently.
The cache works like a queue where the oldest elements are removed first when the limit of the cache is exceeded. The cached elements are stored using a key representing the content of the entry, ensuring that entries with the same content are only stored once.
To avoid unnecessary recalculations of "hot" cached entries and still only allow the same entry once in the cache, any existing entries are moved to the back of the queue each time they are requested.
Bm25f Scoring Algorithm
The scoring algorithm builds on the bm25f algorithm as described in foundations of bm25 review and Okapi bm25. The algorithm groups all the fields of a document (included in the search) with the same analyzer into a one virtual field before scoring the term against the virtual field.
This approach gives typically better results than scoring each field individually and then combing the result after scoring as the importance of a term is considered across all fields instead of each field in isolation.
The boost of a field is integrated into the algorithm by using the boost as multiplier for the term frequency in the given field and thereby making the term boost-factor more important in the field.
Formula:
- streams/fields: s = 1, ..., S
- stream length: sls
- stream weight: vs
- stream term frequency: tfs,i
- avg. stream length across all docs: avsls
- term: i
- total docs with stream: n
- docs with i in stream: dfn,i
- stream length relevance: b
- term frequency relevance: k1
Tuning b
and k1
parameters
b
determines the impact of the field's length when calculating the score and is as default set to 0.75
and must be in the range [0–1].
Lower values mean smaller length impact and vice versa. b
can be configured on a per-field basis and should for fields with only short
text segments be considered to have a lower value so a change in length by only a few terms doesn't affect the score too much. E.g., could
a title field have a b
of 0.25
.
For fields like person.name
even a b
value of 0.0
should be considered as a search for Andersen
should probably yield the same score for both Gillian Andensen
and Hans Chrisitan Andersen
and not include the length of the name
in the score at all. Either the person has the name searched for or not, the length of the total name is not relevant.
k1
determines the impact of the term frequency in matching fields and is in bm25f
applied once pr. term to the score for all fields with
the same analyzer (see formula above). k1
has a default of 1.2
but can be changed for the whole document index or for each
analyzer individually.
It is also possible to change how the term frequency of a document affects the score by turning docStats.termFrequencies
off for a field, which
will result in the term count always being 1
if the term exists in the field, no matter the actual term frequency, and 0
if the term does not
exist in the field.
docStats.termFrequencies
is by default turned off for all other fields thantext
fields as other fields are not tokenized so counting term frequencies will in most cases not make any difference and just consume memory.
Method Summary SearchIndex
The SearchIndex
exposes the following properties and methods:
docCount
- The total number of documents in the index.indexedFields
- Name and type of all indexed fields.sortingFields
- Name and type of all fields that can be used for sorting.- `filte