text-phash
v1.0.8
Published
Compute and compare perceptual hashes for text strings to check similarity.
Downloads
465
Maintainers
Readme
TextPHash
Perceptual Hash for text strings.
- Source repository: Github: mlefkon/text-phash
- NPM Package: NPM: text-phash
What it does
- Computes a perceptual hash for a text string.
- Compares perceptual hashes to give a percent similarity between two text strings.
Usage
const TextPHash = require('text-phash')
// OR
import TextPHash from 'text-phash'
let hashA = TextPHash.computePHash("The quick brown fox jumped over the black fence.")
let hashB = TextPHash.computePHash("Over the black fence, the quick brown fox jumped.")
let pctMatch = TextPHash.percentMatch(hashA, hashB)
console.log(hashA) // 00500000000000000000000500000000000F0050005000000000000000500000
console.log(hashB) // 00500005000000000000000500000000000F0000005000000000000000500000
console.log(pctMatch); // 77.77777777777779
Methodology
- Supply text (can be one word or a lengthy book)
- Tokenize text into neighboring word-groups. Number of words in each group is set in options:NGRAM_WORDS.
- Initialize a
[hashHits]
array with zeros, one 'counter' for each possible hash value. Number of hash values is set in options:WORD_HASH_BITS. - Hash each word-group.
- For each hash encountered, increment it's 'counter' in the
[hashHits]
array - Normalize all
[hashHits]
counters between 0, for no hits, and a set maximum (set in options:HIT_VALUE_BITS) hits. - Convert
[hashHits]
array into a hexadecimal string. - Compare two hashes by converting hex back into
[hashHits]
array and comparing the difference in hits.
Functions
For optional options
parameter {object}, supply one or more properties from the 'Default Options' object below.
computePHash()
TextPHash.computePHash(text)
TextPHash.computePHash(text, options)
- Returns a hexadecimal number representing a binary string (
2 ^ WORD_HASH_BITS
x2 ^ HIT_VALUE_BITS
) bits long. Using the default options, this will be a 64 digit hexadecimal string.
percentMatch()
TextPHash.percentMatch(pHashA, pHashB)
TextPHash.percentMatch(pHashA, pHashB, options)
- If options are supplied, they must be the same as those used to create the hashes.
- Returns a number between zero and 100.
Default Options
Available on the static class object TextPHash.DefaultOptions
:
NGRAM_WORDS
: default = 2Number of 'neighbor' words that will be hashed together.
For example, a value of 1: ABCDE=>[A,B,C,D,E], 2: ABCDE => [AB, BC, CD, DE], 3: ABCDE => [ABC,BCD,CDE]
WORD_HASH_FUNCTION
: default = TextPHash.WordHashDJBA function that does a non-unique hash on each word-group/ngram.
Select any
TextPHash.WordHash...
function in TextPHash class (DJB, FNV1a, Murmur3). Or provide your own with signature:(strText, intHashBitSize) => intHash
WORD_HASH_BITS
: default = 6The binary size of hash produced by WORD_HASH_FUNCTION.
Hashes are not meant to be unique, so this can be a low number. The hashes build a histogram of melded word frequencies. This is the 'x value' in the word-group-hash histogram. So if this is '6', there will be 2^6 possible hashes, or 64 'x values'.
HIT_VALUE_BITS
: default = 4Binary size of hit counter for a single hash. Actual hits are adjusted down to these discrete values.
So if this is '4' and hash counters range from 0 to a max of 140 hits, the 140 value will be adjusted to (2^4)-1, or a max value of 15. A hash counter with lower value, say 70 hits, would get an adjusted value of 8. This is the 'y value' in the word-group-hash histogram.