flow-kafka-pipelines
v1.0.2
Published
Kafka pipelines for Flo.w
Downloads
2
Readme
Kafka Pipelines for Flo.w
Introduction
Kafka Pipelines for Flo.w is a NodeJS library for building Kafka stream processing microservices. The library is built around the idea of a 'pipeline' that consumes messages from a source and can produce messages to a sink. A simple domain-specific-language (DSL) is provided to build a pipeline by specifying a source, a sink, and intermediate processing steps such as map and filter operations.
The library builds heavily on NodeJS Readable and Writable streams, and Transforms. The library also builds on KafkaJS (the underlying Kafka library). You should refer to the documentation for that library to understand configuration options.
Installation
The repository for this library can be found at https://bitbucket.org/emu-analytics/flow-kafka-pipelines. You can install it as an NPM package into your project as shown below:
# Install NPM package
npm install flow-kafka-pipelines
API Documentation
To build the API documentation run:
npm run gen-doc
To view the documentation run:
npm run view-doc
Examples
See the examples
directory for examples that work together (there are two flavours: JSON format and AVRO format).
Producer Pipeline
The producer pipeline reads lines from STDIN and produces a Kafka message to the 'topic1' topic for each line.
Processor Pipeline
The processor pipeline consumes messges from the 'topic1' topic and does some simple manipulation to demonstrate a typical consume-process-produce pipeline. The pipeline produces processed results to the 'topic2' topic. The processor pipeline demonstrates the use of the Kafka-backed in-memory cache provided by the Cache
class to store the number of messages processed and the total line length. You should be able to stop and restart the processor pipeline while maintaining ongoing state.
Consumer Pipeline
The consumer pipeline consumes messages from the 'topic2' pipeline and writes output to STDOUT.
Aggregator Pipeline
The aggregator pipeline consumes messages from the 'topic1' pipeline and counts the occurences of messages (grouped by message content).
Pipeline DSL
| Pipeline Step | Description | |-----------|---------------|-------------| | fromTopic | Consume messages from Kafka topic| | toTopic | Produce messages to Kafka topic | | fromReadable | Consume messages from a generic NodeJS Readable stream| | toWritable | Produce messages to a generic NodeJS Writable stream | | map | transform a message via a map function | | filter | filter messages using a predicate function | | aggregate | performe windowed aggregation | | pipe | transform a message via a generic NodeJS Transform |
Stateful Cache
The Cache
class provided by this library is a Kafka-backed in-memory cache. It is designed to allow a stateful processing microservice to be restarted and continue from where it left off. The cache provides the usual get
and set
methods for storing key/value pairs. Keys should be strings and values can be any Javascript type.
Cache updates and deletions are persisted to a Kafka topic as a change stream. The topic will be created automatically if it doesn't already exists. Log compaction is enabled by default so that, logically, only the latest value for each key is retained. On initialization, the persisted change stream is fully consumed to reinstantiate the state of the cache. You should wait for initialization to finish before starting your processing pipeline.