npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

neek

v1.1.2

Published

A simple implementation of *nix uniq in NodeJS to take advantage of streaming

Downloads

31

Readme

Neek

Build Status

A simple way to filter duplicate lines from a list, à la uniq. Takes an input and filters to an output removing duplicates.

Compatibility

The current version of Neek is designed using several features of ES6; namely the Set interface. If this is not available, it will fall back to a library interface which is not as fast (but it's still pretty good). As such, best performance occurs when on Node >= v4.0.0 and all numbers in this README will refer to this version.

Setup

Depending on your use case, there are two different ways you can install Neek. The first is as a global module, mostly for use when scripting in a shell.

$ npm install -g neek

You can also install it as a local module in case you wish to use it inside another tool:

$ npm install neek

Usage

As mentioned, there are two ways to use Neek. The first use, and probably the most common, is simply invoking via a shell, or using inside a shell to remove duplicate lines:

$ neek --input dup_file.txt -o output.txt

$ cat dup_file.txt | neek -o output.txt

The shell version takes these parameters:

-i, --input         an input file to process
-o, --output        a file to output to
-q, --quiet         only output the processed data

The other use is from within a Node module which requires some processing to output text without duplicates, although I expect this will be less common. Below is an example inside Node:

Please note that input/output accept either a String path or a Stream.

var neek = require('neek');

var readable = './test/resources/lines_with_dups.txt';
var writable = './test/resources/output_without_dups.txt';

neek.unique(readable, writable, function(result){
  console.log(result);
});

unique(input, output[, callback])

The unique method is the only method currently available on the neek module. You pass in your two Streams and an optional callback.

The output parameter can take the value 'string', which will pass the output to the callback in result.output, rather than piping it to a stream. The callback to unique is optional, but be careful when omitting it in case you're depending on the Stream being written.

If you pass a String type to either input or output (when output !== 'string') it will be wrapped up in a read/write stream, with the assumption that it is a file path.

This object contains three fields; output, size and count. These fields translate to the following:

output  - output of the process, if you chose a string output - otherwise null
total   - the number of lines processed
unique  - the final amount of lines (without duplicate data)

Comparison

On a test set of a 527MB file containing 1,071,367 total lines with 443,917 unique lines, below is a comparison of the performance of Unix tools uniq and sort, and then neek. uniq is assuming that your data is sorted.

Uniq

$ time uniq test-set.txt > deduplicated.txt

real	0m38.922s
user	0m37.647s
sys	    0m1.105s

In the unfortunately case that your data isn't sorted, you would have to use sort, however Neek behaves the same regardless of order.

Sort

$ time sort -u test-set.txt > deduplicated.txt

real	2m16.459s
user	2m13.757s
sys	    0m2.186s

Now let's look at Neek!

Neek

$ time bin/neek --input test-set.txt -o deduplicated.txt

real	0m9.581s
user	0m8.615s
sys	    0m1.588s

As you can see, Neek is ~4.1x (around 400%) faster to run than uniq and ~14.2x (around 1400%) faster to run than sort, meaning it's invaluable for larger files. Aside from being far faster Neek uses efficient pipes, which is far better for memory usage. Tools like sort will buffer the entire file into memory, making it a bad choice for large files.

Redirection

On versions prior to Node v4.x one important thing to note is that a shell redirection is slightly faster than using the --output flag. In the processing of the above file, the --output flag took an extra 9 seconds due to the overheads inside Node.

Where possible, I would recommend simply using a shell redirection. If you do use a redirection, make sure to pass -q. Here is a comparison:

$ time bin/neek --input test-set.txt -q > deduplicated.txt

real	0m19.928s
user	0m16.596s
sys	    0m3.653s

$ time bin/neek --input test-set.txt --output deduplicated.txt

real	0m30.536s
user	0m22.242s
sys	    0m10.883s

In post Node v4.x, this is not an issue (in fact the situation is almost reversed, shell redirection is far slower).