cvm-lib
v0.1.1
Published
Estimate the number of distinct values in a set using the simple and space-efficient CVM algorithm
Downloads
21
Maintainers
Readme
CVM Library
Estimate the number of distinct values in a set using the simple and space-efficient CVM algorithm.
Getting Started
Install
NPM:
npm install cvm-lib
Yarn:
yarn add cvm-lib
JSR:
jsr add @rojas/cvm
Examples
See the examples/ directory for all examples.
Hamlet
Estimate unique words in Shakespeare's Hamlet:
node ./examples/hamlet/index.js
- Total words: 31991
- CVM capacity: 2161
- Expected uniques: 4762 ± 10.00%
- Estimated uniques: 4728 (-0.71%)
1M
Estimate unique integers among 1 million random integers.
node ./examples/hamlet/index.js
- Total values: 1000000
- CVM capacity: 10631
- Expected uniques: 994384 ± 5.00%
- Estimated uniques: 996480 (0.21%)
API
Functions
calculateCapacity(n, epsilon?, delta?)
Calculates the space required to estimate the number of distinct values in a set with a given accuracy and confidence.
n
: The total number of values in the set, or an estimate if unknown. Must be a positive number.epsilon
(optional): An estimate's relative error. Controls accuracy. Must be between 0 and 1. Defaults to0.05
.delta
(optional): The probability an estimate is not accurate. Controls confidence. Must be between 0 and 1. Defaults to0.01
.
Classes
Estimator<T>
Estimates the number of distinct values in a set using the CVM algorithm.
Constructors
new (capacity)
: Create an instance with a given capacity. Must be a positive integer.new (config)
: Create an instance using a config object.
Properties
capacity
: Gets the maximum number of samples in memory.randomFn
: Gets or sets the random number generator function (e.g.Math.random
).sampleRate
Gets the base sample rate (e.g.0.5
).size
: Gets the number of samples in memory.
Methods
add(value)
: Adds a value.clear()
: Clears/resets the instance.estimate()
: Gets the estimated number of distinct values.
Interfaces
EstimatorConfig<T>
A configuration object used to create Estimator
instances.
capacity
: The maximum number of samples in memory. Must be a positive integer.randomFn
(optional): The random number generator function. Should return random or pseudorandom values between 0 and 1.sampleRate
(optional): The sampling rate for managing samples. Must be between 0 and 1.- Note: Custom values may negatively affect accuracy. In general, the further from
0.5
, the more it's affected. Ifcapacity
was calculated viacalculateCapacity
, expected accuracy / confidence may be invalidated.
- Note: Custom values may negatively affect accuracy. In general, the further from
storage
(optional): An object that implementsSampleSet
for storing samples.
SampleSet<T>
Represents a generic set for storing samples.
size
: The number of values in the set.add(value)
: Adds a value to the set.clear()
: Clears all values from the set.delete(value)
: Removes a specified value from the set.[Symbol.iterator]()
: Iterates through the set's values.
Community and Support
Contributions are welcome!
Questions / Dicussions: Please contact us via GitHub discussions.
Bug Reports: Please use the GitHub issue tracker to report any bugs. Include a detailed description and any relevant code snippets or logs.
Feature Requests: Please submit feature requests as issues, clearly describing the feature and its potential benefits.
Pull Requests: Please ensure your code adheres to the existing style of the project and include any necessary tests and documentation.
For more information, check out the contributor's guide.
Build
- Clone the project from github
git clone [email protected]:havelessbemore/cvm-lib.git
cd cvm-lib
- Install dependencies
npm install
- Build the project
npm run build
This will output ECMAScript (.mjs
) and CommonJS (.cjs
) modules in the dist/
directory.
Format
To run the code linter:
npm run lint
To automatically fix linting issues, run:
npm run format
Test
To run tests:
npm test
To run tests with a coverage report:
npm run test:coverage
A coverage report is generated at ./coverage/index.html
.
References
- Source paper: Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book
- Notes by Donald Knuth: Knuth, D. E. (2023). The CVM Algorithm for Estimating Distinct Elements in Streams. Stanford Computer Science Department.
- Wikipedia: CVM Algorithm.
- High-level summary: Nadis, S. (2024, May 16). Computer Scientists Invent an Efficient New Way to Count. Quanta Magazine..