@gregoriusrippenstein/node-red-streaming

v0.0.12

Published

7 months ago

Nodes to make streaming possible in Node-RED.

Downloads

0High
0Medium
0Low

gregoriusrippenstein

node-red streaming pipeline etl streamapi

Streaming

A set of Node-RED nodes for interacting with the NodeJS stream API.

Proof of concept, Work in progress.

This is not to be confused with Node-RED streaming: Node-RED is an event-driven streaming application with events being passed along wires. This streaming package aims to solve the handling large datasets using Node-RED.

Screencasts

For an indepth explanation of these nodes, checkout the screencasts.

Message passing

Node-RED uses msg objects to pass data between flows. These objects get cloned when wires split or as an object is passed out of a node. Cloning of objects in Javascript is fine provided the objects are small. If these objects get to big, then Node-RED suffers. Memory gets bloated and things slow down.

ETL pipelines on the other hand, work mostly with large datasets - gigabyte zip files are not unheard of. These datasets are fetched, stored and transformed before the data is passed to a data warehouse or some other data store. Hence ETL pipelines are not a good use-case for Node-RED.

Added to that, Node-RED assumes data is completely available in memory, i.e., attached to the msg object and passed through flows. Many nodes work with Buffers stored in memory containing kilobytes of data.

This package attempts to provide a solution to dealing with large datasets within Node-RED. Central to this is providing methods for streaming data into Node-RED as individual msg objects. Large datasets are handled outside of the Node-RED flow mechanism, with only individual messages being sent into Node-RED for further handling.

NodeJS Stream API

To understand how these nodes work, a review of the basic ideas of the Stream API:

Streams only deal with chunks of the original data: data is read in chunks and passed through streams until the chunk has been consumed. More chunks are passed through the streams as chunks get consumed.
Streams come in three flavours: Readable, Transformer and Writable. Data is read into the stream by readers, altered by transformers and streamed out in chunks by writers.
Pipelines are the grouping of streams to describe an action to be performed on a dataset. A pipeline always takes the same basic structure: first comes the reader streams, then come the transformers (as many as needed) and finally the writers store away the data chunks.

Putting it all together: pipelines describe a collection of streams that modify data in chunks, never storing an entire dataset in memory. The advantage of the Stream API is being able to consume large datasets with constant memory usage. For example the difference between loading a 600mb file into memory versus reading 600mb in chunks of 65kb.

Not all formats allow streaming of data chunks, hence streaming does not work with all data formats and sometimes it is unavoidable to load entire datasets into memory.

Node-RED and Stream API

Even though Node-RED is a streaming application, it has no "native" support for the NodeJS Stream API. Node-RED makes the assumption that data flows into Node-RED in small data chunks.

To support streaming large datasets through Node-RED, this package assumes this basic workflow:

streaming download of web content directly to disk
stream-reading of content from disk
transformation of stream content to single messages, messages are individually sent to Node-RED
archive files are not streamed but first extracted and stored on disk. Individual files are then be streamed into Node-RED

This workflow is a basic requirement when creating ETL pipelines: data is stored before being worked on.

Meta-Programming

These nodes are designed to be delimited by two nodes: PipeStart and PipeEnd. The Node-RED flow defined between these two nodes is a meta flow as the flow describes a stream pipeline.

Upon receiving the msg object, the PipeEnd node constructs and executes that pipeline - code for that. This implies that the msg objects goes through the flow twice: once for describing the pipeline and a second time for constructing the pipeline - the createStream function takes this msg as an argument.

Order within the flow is important so that reader-stream nodes must come before transformer nodes must come before writer nodes.

Extending

Work in Progress this might change.

To create nodes that can be included in a pipeline setup requires that these nodes:

Include themselves in the _streamPipeline array
Provide a createStream function for defining streams.

That sounds simple and examples of this can be found:

Any third-party node can implement this functionality and these nodes can be then be used in conjunction with the existing nodes.

Examples

An example ETL pipeline that uses these nodes is available. This pipeline can also be viewed in Node-RED (static, batteries not included).

Contributing

Even though this codebase is maintained as a Node-RED flow, contributions can be made using pull requests.

If you have an idea for a streaming node, then create it using the API described above and include it in your own package, there is no need to have third-party streaming nodes included here. A good naming convention would be @scope/node-red-streaming-XXXX but its not a requirement.

TODOs

These are maintained on the flows own homepage.

Discussion

There is a discussion around these nodes on the Node-RED forum.

License

Do not do evil.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme