deduplicator

v1.0.0

Published

3 years ago

Remove duplicates from real-time streams of up to 50,000 messages per second, per CPU core, with reasonable CPU and Memory constraints.

Downloads

0High
0Medium
0Low

zboldyga

duplicate filter stream union deduplicate unique

Stream DeDuplicator

A glorified Redis hash for removing duplicate messages from streams of data. Works effectively in both high-speed, real-time scenarios (speed capped at around 50,000 msg/sec per NodeJS Thread), and in lightweight settings (memory and CPU consumption is low if the load is low).

Designed to prevent a need for more complex deduplication approaches (e.g. Bloom Filters). By using efficient hashing and factoring-in message lifetimes, this module can tackle most deduplication scenarios with basic hardware.

Used in production by Scalabull to eliminate duplicate patient records on-the-fly.

Installation

npm install deduplicator

Recomended Redis Configuration:

+ Redis must be configured for predictable behavior. 
+ Single core (multiply memory by # of instances):
+ maxmemory-policy volatile-ttl
+ maxmemory 520mb

Example Usage

Deduplicator accepts one integer as input - the lifetime of messages, in seconds. The first receipt of a given message X will be held in memory for 'lifetime' seconds, and during that time any duplicates of X will be detectable. After the lifetime expires, receipt of a new message Y that is equal to X will not be considered a duplicate (Receipt of Y will prevent any duplicates of Y for lifetime seconds).

If duplicates are expected to be received in long windows of time, the lifetime must reflect this. See below for example use cases and their respective memory consumptions.

Removing duplicate messages from low-level protocol communications. e.g. lots of duplicate messages are sent at high speeds in telecom, healthcare, finance. In this use case, the lifetime can likely be low. In healthcare, for instance, it is uncommon for duplicates to occur over the MLLP protocol outside of the span of a few seconds. Setting the lifetime at 60 seconds is a safe window to prevent duplicates, peak memory consumption will be capped at below 520mb, and the system can process 50,000 messages per second. For multiples of these figures, consider using PM2 to run multiple instances of deduplicator (effectively leveraging multiple CPU cores).
var DeDuplicator = require('deduplicator');
var instance = new DeDuplicator(60);
Getting on-the-fly notifications of unique events. Stream your daily logs thru the deduplicator to get a condensed view of which unique events recently occurred. Set lifetime to 24 * 60 * 60... Works well for data sets with hundreds or thousands of unique daily events, but with many occurrences of each of those events. Capping memory at 512mb allows thruput of 3 million daily unique events, which is much higher than the intended size for this use case... Contact me if you have questions about this.
var DeDuplicator = require('deduplicator');
var instance = new DeDuplicator(86400);

Note: The Deduplicator extends EventEmitter. Users need to handle all of the following events: + ready + error + overflow + drained + output + push inbound messages to 'input'

See /test for template applications that fulfill these requirements and simulate fast & slow scenarios.

Notes

Actual memory use depends on the number of unique messages being stored, not on the thruput. In practice, if messages are received multiple times on average, memory use will generally not exceed a fraction of the full peak.

Node Streams aren't used because the application automatically batches operations that occur within small windows of time, and Streams don't easily support this model. See: http://blog.justonepixel.com/geek/2015/02/15/batch-operations-writable-streams . Instead, EventEmitter is used directly.

Only one namespace is supported per redis instance. For detecting duplicates in multiple different sets of data at the same time, multiple redis instances must be used with multiple different threads of deduplicator.

Contributions

Avoid: attempting to speed up the module. Using NodeJS and Redis, thruput is going to be limited to around 50,000 or 60,000 messages per second, per CPU core. Using a lower-level language with memory management is likely necessary for thruput improvements... Faster speeds result in issues with V8's garbage collector not running frequently enough, and Node eventually crashes.

If Heap crashes occur, it's most likely either because you aren't abiding by the overflow policies, and/or because you're using very large input messages and pushing them thru the system at very high speeds. For the later case, heap crashes may be able to be avoided by using Buffers to store the messages outside of V8's heap (only works in node V6.0 or newer). This lifts the memory cap from 1.8gb to the full system available memory... This is a potential improvement, but in most cases adding overflow throttling will prevent heap overflows.

Deduplicator can technically support alternate hashing algorithms, e.g. the simhash. With the simhash, approximate duplicates can be removed on-the-fly (things that look similar in some way but aren't exactly the same). This could make the deduplicator more practical for use case 2.

This module can theoretically work with Redis partitions across multiple machines. With just a few machines it should be possible to effectively deduplicate millions of unique messages per second. Contact me if you are interested in pursuing this.

Published

Vulnerabilities

Links

Maintainers

Keywords