sai-bloom
v1.0.4
Published
Fast, efficient and easy to use Bloom filter.
Downloads
22
Readme
Bloom Filter
This package is usable in pure Javascript projects. SAI is only needed to recompile it.
Based on a Javascript implementation at https://github.com/jasondavies/bloomfilter.js.
Assumes platform support for Uint32Array, which is a pretty safe assumption in the node ecosystem.
Written by Sean M Puckett mailto:[email protected]. MIT License (appears at end of file).
Performance
The filter does not get slower (or take up more memory) when more things are added. The configuration size of the filter is the final size. Adding more items just gradually increases the probability of false positives.
Roughly 4M tests or insertions can be be performed per second, on a 2012 MacBook Pro Retina laptop, under node 9.3.0. This estimate is for items on the order of 1-10 characters, with an acceptable false positive rate of about 5%. Longer items and more accurate filters will take slightly longer.)
Configuration
You can autoconfigure the filter by initializing it with an estimated number of items and an allowed failure rate.
Pass in an object with these attributes:
- size: approximate number of items you'd like to store
- rate: minimum acceptable rate of failure, as a decimal fraction (e.g. 5% = 0.05).
The above configuration case would be create SaiBloom: size 1000000, rate 0.05
.
It's important to note that even if you ask for a zero false positive rate, there is no ironclad guarantee that you will not get a false positive eventually. Do not rely on this for important things.
Manual configuration is also available.
Preserving state
This filter offers a means to preserve and restore state using the .state
attribute. This allows you to e.g. precalculate a filter and save the state to disk for later use against live data. To regenerate the filter, pass the structure you got from .state
to the SaiBloom constructor.
The native bit array is encoded as a base-90 ASCII-compliant string with values from 35 to 124. This is done for space efficiency.
Examples
Javascript
Manual configuration:
var SaiBloom=require('sai-bloom').constructor // example
var bloom=new SaiBloom(1000*10,7)
bloom.Add(1)
console.log(bloom.Test(1)) // true
console.log(bloom.Test(2)) // false
Automatic configuration:
var bloom2=new SaiBloom({size: 10000, rate: 0.01}); // example
SAI
Automatic configuration:
set bloom to create "SaiBloom": size 10000, rate 0.01 // example
bloom.Add 1
debug bloom.Test(1) // true
debug bloom.Test(2) // false
Save and reload state:
set bloom to create "SaiBloom": size 100, rate 0.01 // example
count 50
bloom.Add counter
set filterstate to bloom.state
set bloom2 to create "SaiBloom" filterstate
debug bloom2.Test(25) // true
debug bloom2.Test(75) // false
Abstract
What's a Bloom filter?
From Wikipedia:
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". Elements can be added to the set, but not removed (though this can be addressed with a "counting" filter); the more elements that are added to the set, the larger the probability of false positives.
Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if "conventional" error-free hashing techniques were applied. He gave the example of a hyphenation algorithm for a dictionary of 500,000 words, out of which 90% follow simple hyphenation rules, but the remaining 10% require expensive disk accesses to retrieve specific hyphenation patterns. With sufficient core memory, an error-free hash could be used to eliminate all unnecessary disk accesses; on the other hand, with limited core memory, Bloom's technique uses a smaller hash area but still eliminates most unnecessary accesses. For example, a hash area only 15% of the size needed by an ideal error-free hash still eliminates 85% of the disk accesses.
More generally, fewer than 10 bits per element are required for a 1% false positive probability, independent of the size or number of elements in the set.
Note that SaiBloom is not a 'counting' filter, as described above. One cannot remove items from it.
False positives characterization
I wanted to understand filter performance across bits-per-item and number of hash rounds, so I undertook a 2-dimensional characterization iterating from 2-32 bits-per-item, and 2-16 rounds.
For each test case:
- A filter was created with 100K * bits and the given number of rounds.
- Then 100K unique items were _Add_ed.
- The filter was then _Test_ed 1M times, each with a different unique item.
- The desired outcome is that each Test would return
false
. - The percentage of times it returned
true
instead, is the false positive rate.
The number of rounds affects the false positive rate. For each number of bits, there is an optimum number of rounds that produces the fewest false positives in this test suite. In the results below, only the number of rounds per bits producing the best results (i.e. the lowest false positive rate) is shown. A cursory examination suggests choosing a number of rounds that is about 2/3 the number of bits will produce the best results.
- 2 bits @ 2 hashes: 40.08% false positives
- 3 bits @ 2 hashes: 23.73%
- 4 bits @ 3 hashes: 14.75%
- 5 bits @ 4 hashes: 9.15%
- 6 bits @ 4 hashes: 5.60%
- 7 bits @ 5 hashes: 3.48%
- 8 bits @ 6 hashes: 2.16%
- 9 bits @ 6 hashes: 1.33%
- 10 bits @ 7 hashes: 0.82%
- 11 bits @ 7 hashes: 0.50%
- 12 bits @ 8 hashes: 0.32%
- 13 bits @ 9 hashes: 0.19%
- 14 bits @ 9 hashes: 0.11%
- 15 bits @ 12 hashes: 0.07%
- 16 bits @ 10 hashes: 0.04%
- 17 bits @ 11 hashes: 0.02%
- 18 bits @ 11 hashes: 0.01%
- 19 bits @ 15 hashes: 0.01%
- 20 bits @ 15 hashes: 0.00%
Beyond 20 bits per item to be stored, there were no false positives found in the best row, though this is just a statistical null result rather than a provable one -- it is always possible for a Bloom filter to fail, it just gets more and more unlikely the more bits have been set aside per item.
Implementation
SAI code for the Bloom filter follows (yes, this file can be compiled directly).
SaiBloom object
object SaiBloom
Instance variables:
instance:
buckets empty // this will be initialized with a Uint32Array
bits 0 // number of bits in the bloom filter (rounded to 32)
rounds 0 // number of hash rounds (bits to set per Add)
.state attribute
Recover / restore the filter's current state.
- On get: return the filter's state as a plain JS object.
- On set: set the filter's state to a value previously retrieved.
Implementation:
state get
set codes new ~Array buckets.length
ply buckets
set codes\key Encode32to5(it)
return:
bits bits
rounds rounds
encoded join'd codes ''
set
Configure $bits, $rounds
count 0, $encoded.length, 5
set buckets[counter/5] to Decode32to5( $encoded, counter )
Instantiate method
If passed a state-like object from an already made filter, will recreate that filter:
- state: (as p) a plain object previously retrieved from a bloom filter's
.state
.
If passed an autoconfiguration object, will choose suitable parameters
- p.size: number of objects you propose to store in the filter
- p.rate: acceptable rate of false positives, expressed as a decimal between 0.0 and 1.0, e.g. 0.05 = 5%.
Otherwise, to create an empty filter with manual configuration:
- bits: (as p) number of bits in the filter (will be rounded up to an even 32)
- rounds: number of hash rounds to run, e.g. number of bits to set per Add
Implemenation:
Instantiate task given p, rounds_
if p.bits // did we get a state object?
set state p
elsif p.size // did we get an autoconfiguration request?
assert p.rate<0.5, "SaiBloom autoconfiguration error, rate should be less than 0.5 (not ${p.rate}) -- percentages must be decimal, e.g. 5% is 0.05."
set rate to p.rate ?> 0.00001
set b to (Math.log(rate) * -2) + 0.5
set r to Math.ceil( b * 0.66 )
Configure b * p.size, r
else // assume we got manual configuration
Configure p, rounds_
Configure method
Set up arrays, inilialize state, filter empty.
- bits:: number of bits in the filter
- rounds: number of bits to set per add
Implementation:
Configure task given bits_, rounds_
set
bucketcount Math.ceil(bits_ / 32)
bits bucketcount * 32
rounds rounds_
buckets new ~Uint32Array bucketcount
Add method
Add a stringable item to the filter.
- item: must either be a string or have a
.toString()
method
The item is marked as having been added, though you cannot recover or remove it later. All you can do is validate for sure whether it has not been added, or has probably been added.
Add task given item
set
s to typeof item is 'string' ?? item :: item.toString()
a to FNV_1A(s)
b to FNV_1A_B(a)
x to a % bits
count rounds
set loc to x<0 ?? (x+bits) :: x
set buckets[loc rsh 5] orb (1 lsh (loc andb 0x1f))
set x to (self+b) % bits
Test method
Test the filter to see if a stringable has been added
- item: either a string or has a
.toString()
method
Returns
false
if item has not been addedtrue
if it probably has been added
Implementation:
Test task given item
set
s to typeof item is 'string' ?? item :: item.toString()
a to FNV_1A(s)
b to FNV_1A_B(a)
x to a % bits
count rounds
set loc to x<0 ?? (x+bits) :: x
unless buckets[loc rsh 5] andb (1 lsh (loc andb 0x1f))
return false
set x to (self+b) % bits
return true
Cardinality method
Return an estimate of how many things are stored in the filter.
This should not be appreciably lower than the actual number of things. If it is, then the bit count is too small for the number of items you're trying to store in the filter. The more items you insert for a given bitcount, the more risk of false positives you will get.
Cardinality task
set bitcount to buckets | total using Bitcount
return (-bits * Math.log(1-bitcount/bits)) / rounds
Internal functions
Count number of bits in a 32 bit integer. Pretty cool parallel addition from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
Bitcount unbound task given v
set v to self - ((self rsh 1) andb 0x55555555)
set v to (self andb 0x33333333) + ((self rsh 2) andb 0x33333333)
set v to (((self + (self rsh 4)) andb 0x0f0f0f0f) * 0x01010101) rsh 24
return v
Fowler/Noll/Vo string hash, the alternate (preferred) implementation.
FNV_1A unbound task given v
set a to 2166136261
count v's length as i
set c to v.charCodeAt(i)
set d to c andb 0xff00
if d .. set a to FNV_MULTIPLY( ( a xorb d ) rsh 8 )
set a to FNV_MULTIPLY( a xorb ( c andb 0xff ) )
return FNV_MIX( a )
Additional rounds of bit mixing to apply for Bloom bit distributions.
FNV_1A_B unbound task given a
return FNV_MIX( FNV_MULTIPLY( a ) )
FNV multiplication, essentially a * 16777619
but done in such a way it doesn't cascade into a float and lose precision.
FNV_MULTIPLY unbound task given a
return a + (a lsh 1) + (a lsh 4) + (a lsh 7) + (a lsh 8) + (a lsh 24)
FNV bit mixer, see https://web.archive.org/web/20131019013225/http://home.comcast.net/~bretm/hash/6.html
FNV_MIX unbound task given a
set
a + (self lsh 13)
a xorb (self ursh 7)
a + (self lsh 3)
a xorb (self ursh 17)
a + (self lsh 5)
a andb 0xffffffff
return a
Encode a 32 bit unsigned integer as a five character string. Characters from ASCII 35 to 124 are used.
Encode32to5 unbound task as i
set v ''
count 5
set p to i % 90
set v + ~String.fromCharCode(35+p)
set i to (self-p)/90
return v
Decode a 32 bit unsigned integer from five characters in a string, encoded as above.
Decode32to5 unbound task as s, pos
set pos ? 0
set i 0
set j 1
count 5
set i+(s.charCodeAt(pos+counter)-35)*j
set j*90
return i
License
MIT License.
Copyright 2018, Sean M Puckett
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.