npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

deduplicator

v1.0.0

Published

Remove duplicates from real-time streams of up to 50,000 messages per second, per CPU core, with reasonable CPU and Memory constraints.

Downloads

1

Readme

Stream DeDuplicator

A glorified Redis hash for removing duplicate messages from streams of data. Works effectively in both high-speed, real-time scenarios (speed capped at around 50,000 msg/sec per NodeJS Thread), and in lightweight settings (memory and CPU consumption is low if the load is low).

Designed to prevent a need for more complex deduplication approaches (e.g. Bloom Filters). By using efficient hashing and factoring-in message lifetimes, this module can tackle most deduplication scenarios with basic hardware.

Used in production by Scalabull to eliminate duplicate patient records on-the-fly.

Installation

npm install deduplicator

Recomended Redis Configuration:

+ Redis must be configured for predictable behavior. 
+ Single core (multiply memory by # of instances):
+ maxmemory-policy volatile-ttl
+ maxmemory 520mb

Example Usage

Deduplicator accepts one integer as input - the lifetime of messages, in seconds. The first receipt of a given message X will be held in memory for 'lifetime' seconds, and during that time any duplicates of X will be detectable. After the lifetime expires, receipt of a new message Y that is equal to X will not be considered a duplicate (Receipt of Y will prevent any duplicates of Y for lifetime seconds).

If duplicates are expected to be received in long windows of time, the lifetime must reflect this. See below for example use cases and their respective memory consumptions.

  1. Removing duplicate messages from low-level protocol communications. e.g. lots of duplicate messages are sent at high speeds in telecom, healthcare, finance. In this use case, the lifetime can likely be low. In healthcare, for instance, it is uncommon for duplicates to occur over the MLLP protocol outside of the span of a few seconds. Setting the lifetime at 60 seconds is a safe window to prevent duplicates, peak memory consumption will be capped at below 520mb, and the system can process 50,000 messages per second. For multiples of these figures, consider using PM2 to run multiple instances of deduplicator (effectively leveraging multiple CPU cores).

    var DeDuplicator = require('deduplicator');

    var instance = new DeDuplicator(60);

  2. Getting on-the-fly notifications of unique events. Stream your daily logs thru the deduplicator to get a condensed view of which unique events recently occurred. Set lifetime to 24 * 60 * 60... Works well for data sets with hundreds or thousands of unique daily events, but with many occurrences of each of those events. Capping memory at 512mb allows thruput of 3 million daily unique events, which is much higher than the intended size for this use case... Contact me if you have questions about this.

    var DeDuplicator = require('deduplicator');

    var instance = new DeDuplicator(86400);

Note: The Deduplicator extends EventEmitter. Users need to handle all of the following events: + ready + error + overflow + drained + output + push inbound messages to 'input'

See /test for template applications that fulfill these requirements and simulate fast & slow scenarios.

Notes

Actual memory use depends on the number of unique messages being stored, not on the thruput. In practice, if messages are received multiple times on average, memory use will generally not exceed a fraction of the full peak.

Node Streams aren't used because the application automatically batches operations that occur within small windows of time, and Streams don't easily support this model. See: http://blog.justonepixel.com/geek/2015/02/15/batch-operations-writable-streams . Instead, EventEmitter is used directly.

Only one namespace is supported per redis instance. For detecting duplicates in multiple different sets of data at the same time, multiple redis instances must be used with multiple different threads of deduplicator.

Contributions

Avoid: attempting to speed up the module. Using NodeJS and Redis, thruput is going to be limited to around 50,000 or 60,000 messages per second, per CPU core. Using a lower-level language with memory management is likely necessary for thruput improvements... Faster speeds result in issues with V8's garbage collector not running frequently enough, and Node eventually crashes.

If Heap crashes occur, it's most likely either because you aren't abiding by the overflow policies, and/or because you're using very large input messages and pushing them thru the system at very high speeds. For the later case, heap crashes may be able to be avoided by using Buffers to store the messages outside of V8's heap (only works in node V6.0 or newer). This lifts the memory cap from 1.8gb to the full system available memory... This is a potential improvement, but in most cases adding overflow throttling will prevent heap overflows.

Deduplicator can technically support alternate hashing algorithms, e.g. the simhash. With the simhash, approximate duplicates can be removed on-the-fly (things that look similar in some way but aren't exactly the same). This could make the deduplicator more practical for use case 2.

This module can theoretically work with Redis partitions across multiple machines. With just a few machines it should be possible to effectively deduplicate millions of unique messages per second. Contact me if you are interested in pursuing this.