npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

node-es-transformer

v1.0.0-beta2

Published

A nodejs based library to (re)index and transform data from/to Elasticsearch.

Downloads

21

Readme

npm npm npm Commitizen friendly CI

node-es-transformer

A nodejs based library to (re)index and transform data from/to Elasticsearch.

Why another reindex/ingestion tool?

If you're looking for a nodejs based tool which allows you to ingest large CSV/JSON files in the GigaBytes you've come to the right place. Everything else I've tried with larger files runs out of JS heap, hammers ES with too many single requests, times out or tries to do everything with a single bulk request.

While I'd generally recommend using Logstash, filebeat, Ingest Nodes, Elastic Agent or Elasticsearch Transforms for established use cases, this tool may be of help especially if you feel more at home in the JavaScript/nodejs universe and have use cases with customized ingestion and data transformation needs.

This is experimental code, use at your own risk. Nonetheless, I encourage you to give it a try so I can gather some feedback.

So why is this still alpha?

  • The API is not quite final and might change from release to release.
  • The code needs some more safety measures to avoid some possible accidental data loss scenarios.
  • No test coverage yet.

Now that we've talked about the caveats, let's have a look what you actually get with this tool:

Features

  • Buffering/Streaming for both reading and indexing. Files are read using streaming and Elasticsearch ingestion is done using buffered bulk indexing. This is tailored towards ingestion of large files. Successfully tested so far with JSON and CSV files in the range of 20-30 GBytes. On a single machine running both node-es-transformer and Elasticsearch ingestion rates up to 20k documents/second were achieved (2,9 GHz Intel Core i7, 16GByte RAM, SSD), depending on document size.
  • Supports wildcards to ingest/transform a range of files in one go.
  • Supports fetching documents from existing indices using search/scroll. This allows you to reindex with custom data transformations just using JavaScript in the transform callback.
  • The transform callback gives you each source document, but you can split it up in multiple ones and return an array of documents. An example use case for this: Each source document is a Tweet and you want to transform that into an entity centric index based on Hashtags.

Getting started

In your node-js project, add node-es-transformer as a dependency (yarn add node-es-transformer or npm install node-es-transformer).

Use the library in your code like:

Read from a file

const transformer = require('node-es-transformer');

transformer({
  fileName: 'filename.json',
  targetIndexName: 'my-index',
  mappings: {
    properties: {
      '@timestamp': {
        type: 'date'
      },
      'first_name': {
        type: 'keyword'
      },
      'last_name': {
        type: 'keyword'
      }
      'full_name': {
        type: 'keyword'
      }
    }
  },
  transform(line) {
    return {
      ...line,
      full_name: `${line.first_name} ${line.last_name}`
    }
  }
});

Read from another index

const transformer = require('node-es-transformer');

transformer({
  sourceIndexName: 'my-source-index',
  targetIndexName: 'my-target-index',
  // optional, if you skip mappings, they will be fetched from the source index.
  mappings: {
    properties: {
      '@timestamp': {
        type: 'date'
      },
      'first_name': {
        type: 'keyword'
      },
      'last_name': {
        type: 'keyword'
      }
      'full_name': {
        type: 'keyword'
      }
    }
  },
  transform(doc) {
    return {
      ...doc,
      full_name: `${line.first_name} ${line.last_name}`
    }
  }
});

Options

  • deleteIndex: Setting to automatically delete an existing index, default is false.
  • sourceClientConfig/targetClientConfig: Optional Elasticsearch client options, defaults to { node: 'http://localhost:9200' }.
  • bufferSize: The amount of documents inserted with each Elasticsearch bulk insert request, default is 1000.
  • fileName: Source filename to ingest, supports wildcards. If this is set, sourceIndexName is not allowed.
  • splitRegex: Custom line split regex, defaults to /\n/.
  • sourceIndexName: The source Elasticsearch index to reindex from. If this is set, fileName is not allowed.
  • targetIndexName: The target Elasticsearch index where documents will be indexed.
  • mappings: Optional Elasticsearch document mappings. If not set and you're reindexing from another index, the mappings from the existing index will be used.
  • mappingsOverride: If you're reindexing and this is set to true, mappings will be applied on top of the source index's mappings. Defaults to false.
  • indexMappingTotalFieldsLimit: Optional field limit for the target index to be created that will be passed on as the index.mapping.total_fields.limit setting.
  • populatedFields: If true, fetches a set of random documents to identify which fields are actually used by documents. Can be useful for indices with lots of field mappings to increase query/reindex performance. Defaults to false.
  • query: Optional Elasticsearch DSL query to filter documents from the source index.
  • skipHeader: If true, skips the first line of the source file. Defaults to false.
  • transform(line): A callback function which allows the transformation of a source line into one or several documents.
  • verbose: Logging verbosity, defaults to true

Development

Clone this repository and install its dependencies:

git clone https://github.com/walterra/node-es-transformer
cd node-es-transformer
yarn

yarn build builds the library to dist, generating two files:

  • dist/node-es-transformer.cjs.js A CommonJS bundle, suitable for use in Node.js, that requires the external dependency. This corresponds to the "main" field in package.json
  • dist/node-es-transformer.esm.js an ES module bundle, suitable for use in other people's libraries and applications, that imports the external dependency. This corresponds to the "module" field in package.json

yarn dev builds the library, then keeps rebuilding it whenever the source files change using rollup-watch.

yarn test runs the tests. The tests expect that you have an Elasticsearch instance running without security at http://localhost:9200. Using docker, you can set this up with:

# Download the docker image
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.10.4

# Run the container
docker run --name es01 --net elastic -p 9200:9200 -it -m 1GB -e "discovery.type=single-node" -e "xpack.security.enabled=false" docker.elastic.co/elasticsearch/elasticsearch:8.10.4

To commit, use cz. To prepare a release, use e.g. yarn release -- --release-as 1.0.0-beta2.

License

Apache 2.0.