address-deduplicator-stream
v1.0.4
Published
A stream for deduplicating a stream of address Documents.
Downloads
11
Readme
address deduplicator stream
A stream that performs address deduplication using the robust OpenVenues deduplicator; note that it must be separately installed and running.
API
address-deduplicator-stream
exports a single function:
createDeduplicateStream( requestBatchSize, maxLiveRequests, serverUrl )
, which accepts three optional arguments:
requestBatchSize
(default:100
): The number of addresses to buffer into a batch before sending it to the deduplicator. The higher the number, the less time and energy collectively spent in making requests, but the bigger the memory consumption buildup.maxLiveRequests
(default:10
): Since the deduper is implemented as a standalone server and processes data more slowly than the importer feeds it, the stream needs to rate-limit itself.maxLiveRequests
indicates the maximum number of unresolved concurrent requests at any time; when that number is hit, the stream will pause reading until the number of concurrent requests falls below it.serverUrl
(default:'http://localhost:5000'
): The HTTP base URL of the address deduplicator server.
and returns a Transform
stream, which accepts un-deduplicated addresses and filters out the duplicates; note that
it'll likely be the slowest part of your data pipeline because of all the involved heavy lifting. The addresses
themselves are expected to be pelias/model Document
objects.