@enterprise_search/indexer
v0.8.37
Published
cli for indexer
Downloads
2,611
Readme
Index
This is a tool to index content for elastic search or other search engines, and to provide support for Document Level Security (DLS). This means that we ensure only people that can see the original document can see the search result.
Installation
npm i -g @itsmworkspace/indexer # Needs to be run as a adminstrator or super user
Requirements:
- Node.js 16 or higher (it might work with lower versions, but it is not tested)
Example usage
Indexing
The main command is index
, and we use the indexer
sub command.
index indexer index # Just indexes all the data
index indexer index --api --keep # indexes all the data and launches an api for looking at metrics and keeps running when finished
index indexer index --since 1d # Just index things that have changed in the last day
Api Key Management
Remember to use --help to get more information on the commands especially command configuration (which user, the url, the username and password...)
Also be aware that the apiKeys are about 'an environment'. See the --help for more information
index apikey add [email protected] # Generates a DSL api key for the user, invalidating all other DSL api keys
index apikey api # creates an api with end point /apikey/[email protected] that returns the api key for the user (invalidating all others)
index apikey remove [email protected] # Invalidates all the DSL api keys for the user.
index apikey id [email protected] # Shows all the data that the user can access with their api key
Pushing data to elastic search
index es push # Pushes all the data to elastic search
index es push --help # Show options for the push command
index es push --elastic-search http://localhost:9200 # Pushes all the data to the elastic search at the given url
Configuration
There is a file call indexer.yaml
that is used to configure the indexer. The file is located in the same directory as
the indexer
executable.
The file is in YAML format and has the following structure:
Defaults
The first section of the file is the defaults
defaults:
# here we set defaults values. For example retry policies and throttles.
query: # Here all the defaults for queries. The 'getting of data out of the source' for example out of Jira, or Confluence or gitlab...
retryPolicy:
initialInterval: 2000; # In milliseconds
maximumInterval: 10000; # In milliseconds
maximumAttempts: 5
multiplier?: 2. # For exponential backoff. Default is 2
nonRecoverableErrors?: string[]; // List of errors that should not be retried. 'Not Found' is the one commonly used
throttle:
max: 100 # Imagine tokens in a jar. This is the size of the jar. To do a request you need to take a token from the jar.
tokensPer100ms: 0.1 # this is how many tokens are added to the jar every 100ms
throttlingDelay: 50; // Max random delay before retrying if have run out of tokens in ms defaults 50ms
countOnTooManyErrors: -500 // If we get a 429 error we will set the number of tokens to this and also reduce the tokensPer100ms a bit
auth: # see below of other options
method: ApiKey
credentials:
apiKey: "{source}_APIKEY"
target: # Now we have the defaults for the target. This is currently just storing a file system but I expect to add more
retry: { } # Same as above
throttle: { } # Same as above. Probably not needed for the file system.
file: "target/index/{index}/{name}_{num}.json" # The filename we write to
max: 10000000 # The maximum number of documents in a file. When this is reached a new file is created
Sources
The next section is where we get data from. Here is a sample
index: # Must be the word index
jiraAcl: # This is the name of the source. Other things like index name and type default to this
scan: # Where we get the data from
groupMembersFile: 'group.members.csv'
index: '.search-acl-filter-jira-prod'
jiraProd:
type: jira
scan:
auth:
method: 'ApiKey'
credentials:
apiKey: 'JIRA_PROD'
index: jira-prod
projects: "*"
baseurl: "https://jira.eon.com/"
apiVersion: "2"
Authorisation
export type EntraIdAuthentication = {
method: 'EntraId';
credentials: {
tenantId?: string
clientId: string; // Public identifier for the app
clientSecret: string; // Secret used to authenticate the app and obtain tokens
scope: string
};
};
export type BasicAuthentication = {
method: 'Basic';
credentials: {
username: string;
password: string;
};
};
export type ApiKeyAuthentication = {
method: 'ApiKey';
credentials: {
apiKey: string;
};
};
export type PrivateTokenAuthentication = {
method: 'PrivateToken';
credentials: {
token: string;
};
};
export type NoAuthentication = {
method: 'none';
};
Tika
Apache Tika is used to process files such as PDFs, Word documents, etc. The configuration for Tika is as follows:
tika:
jar: "../../../tika-server-standard-2.9.2.jar"
protocol: http
host: 127.0.0.1
port: 9997
I think they are obvious in their meaning. Note that during indexing or similar operations the index launches the tika server using the jar, and kills it at the end.
Pipelines.yaml
Currently this is a second config file. We may merge it with indexer.yaml in the near future. It is used by commands like
index es makeIndexes # This will create the indexes in elastic search
index es pipeline # This will create the pipelines in elastic search
index es remakeIndicies --all # This will delete all the indexes and recreate them.
index es remakeIndicies --index jira-prod # This will delete the index jira-prod and recreate it.
index es remakeIndicies --all --noPush # This will delete all the indexes and recreate them but not push the data
This controls the 'digest pipeline' for elastic search
Here is a sample section
jira-prod-pipeline: # Because I messed up... known issue... this name is the name of the index with -pipeline on the end. Otherwise it doesn't work properly
index: jira-prod
fields: # These fields will be included in the searches
- issue
- comments
- description
- priority
- status
shorten: # clip the length of this field. This is so that you can store a summary but not all the data
description: 200
remove: # And remove these fields from the index
- comments
- full_text