node-simhash
v0.1.0
Published
Command Line tool that compares two text files using simhash
Downloads
92
Maintainers
Readme
node-simhash
A simple command line tool for comparing text files using the simhash algorithm and contrasting with the jaccard index.
References
Near duplicate detection (moz.com)
Installation
If you have just clone this like then run the following
npm install
npm link
Or if you would like to install globally
npm install https://github.com/sjhorn/node-simhash -g
Command line tool usage
Using node
simhash file1.txt file2.txt
simhash https://file.com/page1.html https://file.com/page2.html
Using lib
var simhash = require('node-simhash');
simhash.compare(string1, string2);
Methods
.summary(file1, file2)
Compare two text strings using both simhash and jaccard index and print a summary
.compare(file1, file2)
Compare two text strings using both simhash and jaccard index
.hammingWeight(number)
Count the binary ones in a number.
.shingles(string, words_per_single=2)
Convert string to set of shingles using the default of 2 words per shingle and tokenize using the natural libraries default tokenizer.
.jaccardIndex(string1, string2)
Compare two strings by tokeniseing and then compare the intersection of shingles to the union of shingles.
.createBinaryString(number)
Print a 32-bit number as a binary string of 32 characters
.shingleHashList(set)
Convert a set of shingles to a set of crc-32 hashes.