@enterprise_search/indexer

v0.8.37

Published

a day ago

cli for indexer

Downloads

2,611

0High
0Medium
0Low

phil-rice

indexing

Index

This is a tool to index content for elastic search or other search engines, and to provide support for Document Level Security (DLS). This means that we ensure only people that can see the original document can see the search result.

Installation

npm i -g @itsmworkspace/indexer # Needs to be run as a adminstrator or super user

Requirements:

Node.js 16 or higher (it might work with lower versions, but it is not tested)

Example usage

Indexing

The main command is index, and we use the indexer sub command.

index indexer index  # Just indexes all the data
index indexer index --api --keep # indexes all the data and launches an api for looking at metrics and keeps running when finished
index indexer index --since 1d # Just index things that have changed in the last day

Api Key Management

Remember to use --help to get more information on the commands especially command configuration (which user, the url, the username and password...)

Also be aware that the apiKeys are about 'an environment'. See the --help for more information

index apikey add [email protected] # Generates a DSL api key for the user, invalidating all other DSL api keys
index apikey api # creates an api with end point /apikey/[email protected] that returns the api key for the user (invalidating all others)
index apikey remove [email protected] # Invalidates all the DSL api keys for the user.
index apikey id [email protected]     # Shows all the data that the user can access with their api key

Pushing data to elastic search

index es push # Pushes all the data to elastic search
index es push --help # Show options for the push command
index es push --elastic-search http://localhost:9200 # Pushes all the data to the elastic search at the given url

Configuration

There is a file call indexer.yaml that is used to configure the indexer. The file is located in the same directory as the indexer executable.

The file is in YAML format and has the following structure:

Defaults

The first section of the file is the defaults

defaults:
  # here we set defaults values. For example retry policies and throttles.
  query: # Here all the defaults for queries. The 'getting of data out of the source' for example out of Jira, or Confluence or gitlab...
    retryPolicy:
      initialInterval: 2000; # In milliseconds
      maximumInterval: 10000; # In milliseconds
      maximumAttempts: 5
      multiplier?: 2. # For exponential backoff. Default is 2
      nonRecoverableErrors?: string[]; // List of errors that should not be retried. 'Not Found' is the one commonly used
    throttle:
      max: 100  # Imagine tokens in a jar. This is the size of the jar. To do a request you need to take a token from the jar.
      tokensPer100ms: 0.1 # this is how many tokens are added to the jar every 100ms
      throttlingDelay: 50;    // Max random delay before retrying if have run out of tokens in ms defaults 50ms
      countOnTooManyErrors: -500 // If we get a 429 error we will set the number of tokens to this and also reduce the tokensPer100ms a bit
    auth: # see below of other options
      method: ApiKey
      credentials:
        apiKey: "{source}_APIKEY"
  target: # Now we have the defaults for the target. This is currently just storing a file system but I expect to add more
    retry: { } # Same as above 
    throttle: { } # Same as above. Probably not needed for the file system.
    file: "target/index/{index}/{name}_{num}.json" # The filename we write to
    max: 10000000 # The maximum number of documents in a file. When this is reached a new file is created

Sources

The next section is where we get data from. Here is a sample

index: # Must be the word index
  jiraAcl: # This is the name of the source. Other things like index name and type default to this
    scan: # Where we get the data from
      groupMembersFile: 'group.members.csv'
      index: '.search-acl-filter-jira-prod'
  jiraProd:
    type: jira
    scan:
      auth:
        method: 'ApiKey'
        credentials:
          apiKey: 'JIRA_PROD'
      index: jira-prod
      projects: "*"
      baseurl: "https://jira.eon.com/"
      apiVersion: "2"

Authorisation

export type EntraIdAuthentication = {
  method: 'EntraId';
  credentials: {
    tenantId?: string
    clientId: string;            // Public identifier for the app
    clientSecret: string;        // Secret used to authenticate the app and obtain tokens
    scope: string
  };
};
export type BasicAuthentication = {
  method: 'Basic';
  credentials: {
    username: string;
    password: string;
  };
};
export type ApiKeyAuthentication = {
  method: 'ApiKey';
  credentials: {
    apiKey: string;
  };
};
export type PrivateTokenAuthentication = {
  method: 'PrivateToken';
  credentials: {
    token: string;
  };
};
export type NoAuthentication = {
  method: 'none';
};

Tika

Apache Tika is used to process files such as PDFs, Word documents, etc. The configuration for Tika is as follows:

tika:
  jar: "../../../tika-server-standard-2.9.2.jar"
  protocol: http
  host: 127.0.0.1
  port: 9997

I think they are obvious in their meaning. Note that during indexing or similar operations the index launches the tika server using the jar, and kills it at the end.

Pipelines.yaml

Currently this is a second config file. We may merge it with indexer.yaml in the near future. It is used by commands like

index es makeIndexes                      # This will create the indexes in elastic search
index es pipeline                         # This will create the pipelines in elastic search
index es remakeIndicies --all             # This will delete all the indexes and recreate them.
index es remakeIndicies --index jira-prod # This will delete the index jira-prod and recreate it.
index es remakeIndicies --all --noPush    # This will delete all the indexes and recreate them but not push the data

This controls the 'digest pipeline' for elastic search

Here is a sample section

jira-prod-pipeline: # Because I messed up... known issue... this name is the name of the index with -pipeline on the end. Otherwise it doesn't work properly
  index: jira-prod
  fields:           # These fields will be included in the searches
    - issue
    - comments
    - description
    - priority
    - status
  shorten:         # clip the length of this field. This is so that you can store a summary but not all the data
     description: 200
  remove:          # And remove these fields from the index
    - comments
    - full_text