flow-kafka-pipelines

v1.0.2

Published

3 years ago

Kafka pipelines for Flo.w

Downloads

0High
0Medium
0Low

robin.summerhill

Kafka Pipelines for Flo.w

Introduction

Kafka Pipelines for Flo.w is a NodeJS library for building Kafka stream processing microservices. The library is built around the idea of a 'pipeline' that consumes messages from a source and can produce messages to a sink. A simple domain-specific-language (DSL) is provided to build a pipeline by specifying a source, a sink, and intermediate processing steps such as map and filter operations.

The library builds heavily on NodeJS Readable and Writable streams, and Transforms. The library also builds on KafkaJS (the underlying Kafka library). You should refer to the documentation for that library to understand configuration options.

Installation

The repository for this library can be found at https://bitbucket.org/emu-analytics/flow-kafka-pipelines. You can install it as an NPM package into your project as shown below:

# Install NPM package
npm install flow-kafka-pipelines

API Documentation

To build the API documentation run:

npm run gen-doc

To view the documentation run:

npm run view-doc

Examples

See the examples directory for examples that work together (there are two flavours: JSON format and AVRO format).

Producer Pipeline

The producer pipeline reads lines from STDIN and produces a Kafka message to the 'topic1' topic for each line.

Processor Pipeline

The processor pipeline consumes messges from the 'topic1' topic and does some simple manipulation to demonstrate a typical consume-process-produce pipeline. The pipeline produces processed results to the 'topic2' topic. The processor pipeline demonstrates the use of the Kafka-backed in-memory cache provided by the Cache class to store the number of messages processed and the total line length. You should be able to stop and restart the processor pipeline while maintaining ongoing state.

Consumer Pipeline

The consumer pipeline consumes messages from the 'topic2' pipeline and writes output to STDOUT.

Aggregator Pipeline

The aggregator pipeline consumes messages from the 'topic1' pipeline and counts the occurences of messages (grouped by message content).

Pipeline DSL

| Pipeline Step | Description | |-----------|---------------|-------------| | fromTopic | Consume messages from Kafka topic| | toTopic | Produce messages to Kafka topic | | fromReadable | Consume messages from a generic NodeJS Readable stream| | toWritable | Produce messages to a generic NodeJS Writable stream | | map | transform a message via a map function | | filter | filter messages using a predicate function | | aggregate | performe windowed aggregation | | pipe | transform a message via a generic NodeJS Transform |

Stateful Cache

The Cache class provided by this library is a Kafka-backed in-memory cache. It is designed to allow a stateful processing microservice to be restarted and continue from where it left off. The cache provides the usual get and set methods for storing key/value pairs. Keys should be strings and values can be any Javascript type.

Cache updates and deletions are persisted to a Kafka topic as a change stream. The topic will be created automatically if it doesn't already exists. Log compaction is enabled by default so that, logically, only the latest value for each key is retained. On initialization, the persisted change stream is fully consumed to reinstantiate the state of the cache. You should wait for initialization to finish before starting your processing pipeline.