text-phash

v1.0.8

Published

a day ago

Compute and compare perceptual hashes for text strings to check similarity.

Downloads

465

0High
0Medium
0Low

mlefkon

phash hash text string perceptual similarity match

TextPHash

Perceptual Hash for text strings.

Source repository: Github: mlefkon/text-phash
NPM Package: NPM: text-phash

What it does

Computes a perceptual hash for a text string.
Compares perceptual hashes to give a percent similarity between two text strings.

Usage

const TextPHash = require('text-phash')
// OR
import TextPHash from 'text-phash'

let hashA = TextPHash.computePHash("The quick brown fox jumped over the black fence.")
let hashB = TextPHash.computePHash("Over the black fence, the quick brown fox jumped.")
let pctMatch = TextPHash.percentMatch(hashA, hashB)
console.log(hashA) // 00500000000000000000000500000000000F0050005000000000000000500000
console.log(hashB) // 00500005000000000000000500000000000F0000005000000000000000500000
console.log(pctMatch);  // 77.77777777777779

Methodology

Supply text (can be one word or a lengthy book)
Tokenize text into neighboring word-groups. Number of words in each group is set in options:NGRAM_WORDS.
Initialize a [hashHits] array with zeros, one 'counter' for each possible hash value. Number of hash values is set in options:WORD_HASH_BITS.
Hash each word-group.
For each hash encountered, increment it's 'counter' in the [hashHits] array
Normalize all [hashHits] counters between 0, for no hits, and a set maximum (set in options:HIT_VALUE_BITS) hits.
Convert [hashHits] array into a hexadecimal string.
Compare two hashes by converting hex back into [hashHits] array and comparing the difference in hits.

Functions

For optional options parameter {object}, supply one or more properties from the 'Default Options' object below.

computePHash()

TextPHash.computePHash(text)
TextPHash.computePHash(text, options)

Returns a hexadecimal number representing a binary string (2 ^ WORD_HASH_BITS x 2 ^ HIT_VALUE_BITS) bits long. Using the default options, this will be a 64 digit hexadecimal string.

percentMatch()

TextPHash.percentMatch(pHashA, pHashB)
TextPHash.percentMatch(pHashA, pHashB, options)

If options are supplied, they must be the same as those used to create the hashes.
Returns a number between zero and 100.

Default Options

Available on the static class object TextPHash.DefaultOptions:

NGRAM_WORDS: default = 2
Number of 'neighbor' words that will be hashed together.
For example, a value of 1: ABCDE=>[A,B,C,D,E], 2: ABCDE => [AB, BC, CD, DE], 3: ABCDE => [ABC,BCD,CDE]
WORD_HASH_FUNCTION: default = TextPHash.WordHashDJB
A function that does a non-unique hash on each word-group/ngram.
Select any TextPHash.WordHash... function in TextPHash class (DJB, FNV1a, Murmur3). Or provide your own with signature: (strText, intHashBitSize) => intHash
WORD_HASH_BITS: default = 6
The binary size of hash produced by WORD_HASH_FUNCTION.
Hashes are not meant to be unique, so this can be a low number. The hashes build a histogram of melded word frequencies. This is the 'x value' in the word-group-hash histogram. So if this is '6', there will be 2^6 possible hashes, or 64 'x values'.
HIT_VALUE_BITS: default = 4
Binary size of hit counter for a single hash. Actual hits are adjusted down to these discrete values.
So if this is '4' and hash counters range from 0 to a max of 140 hits, the 140 value will be adjusted to (2^4)-1, or a max value of 15. A hash counter with lower value, say 70 hits, would get an adjusted value of 8. This is the 'y value' in the word-group-hash histogram.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme