@iebh/dedupe-sweep
v2.0.1
Published
Deduplicate reference libraries using the sweep method
Readme
IEBH/Dedupe-Sweep
Deduplicate reference libraries using the sweep method.
This library is intended to be used with Reflib compatible references.
// Simple example with an array of references
var Dedupe = require('@iebh/dedupe-sweep');
(new Dedupe())
.set('strategy', 'doiOnly')
.run([
{doi: 'https://doi.org/10.1000/182'},
{doi: '10.1000/182'},
])
.then(deduped => { /* ... */ })// More complex example reading in a reference library with RefLib, deduping it and saving as another file
var Dedupe = require('@iebh/dedupe-sweep');
var reflib = require('reflib');
// Read in the library
var refs = await reflib.promises.parseFile('my-large-reference-library.xml');
// Dedupe
var deduper = new Dedupe()
deduper.set('strategy', 'clark')
var dedupedRefs = await deduper.run(refs);
// Save the deduped library
await reflib.promises.outputFile('my-large-reference-library-deduped.xml', dedupedRefs);Testing
The various strategies within this project are tested using the Systematic Reviews Data Sets for Testing Automation Tools by Beller et. al and are available in the test/data directory.
Tests can be run via npm test or mocha. See the test directory for more information on specifics.
Testing statistics are based on the methodology from Evaluating automated deduplication tools: protocol by Hair et. al.
API
Constructor: Dedupe(options)
Returns a Dedupe class which extends a basic EventEmitter.
Dedupe.settings
Object storing all local settings for the class.
| Setting | Type | Default | Description |
|-------------------|-------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------|
| strategy | string | 'clark' | The strategy to use on the next run() |
| validateStrategy | boolean | true | Validate the strategy before beginning, only disable this if you are sure the strategy is valid |
| action | string | '0' | The action to take when detecting a duplicate. ENUM: ACTIONS |
| actionField | string | 'dedupe' | The field to use with actions |
| threshold | number | 0.1 | Floating value (between 0 and 1) when marking or deleting refs automatically |
| markOk | string / function | 'OK' | String value to set the action field to when actionField=='mark' and the ref is a non-dupe, if a function it is called as (ref) |
| markDupe | string / function | 'DUPE' | String value to set the action field to when actionField=='mark' and the ref is a dupe, if a function it is called as (ref) |
| dupeRef | string | 0 | How to refer to other refs when actionfield=='stats'. ENUM: DUPEREF |
| fieldWeight | number | 0 | How to calculate duplication score. ENUM: FIELDWEIGHT |
| markOriginal | boolean | false | Whether to mark the original as a duplicate or not |
Static: Dedupe.ACTIONS
Actions to take when detecting duplicates
| Value | Setting | Description |
|-------|------------|-------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | 'STATS' | Add the field field in Dedupe.settings.actionField with the deduplicate chance to the input |
| 1 | 'MARK' | Set the field in Dedupe.settings.actionField to Dedupe.settings.mark{Ok,Dupe} depending on duplicate status but leave input unchanged |
| 2 | 'DELETE' | Remove duplicates from input and return sliced output |
Static: Dedupe.DUPEREF
How to refer to other references.
| Value | Setting | Description |
|-------|---------------|--------------------------------------------------------------|
| 0 | 'INDEX' | Refer to other references by their offset in the input array |
| 1 | 'RECNUMBER' | Refer to other references by their recnumber field |
Static: Dedupe.FIELDWEIGHT
How to refer to other references.
| Value | Setting | Description |
|-------|---------------|-----------------------------------------------------------|
| 0 | 'MINIMUM' | Calculate duplication score based on minimum field score |
| 1 | 'AVERAGE' | Calculate duplication score based on average field score |
Dedupe.comparisons
A lookup object of comparison functions used within strategies.
Each comparison is made up of:
| Setting | Type | Description |
|---------------|----------|--------------------------------------------------------------------------------------------------------|
| key | string | Internal short name of the comparison in camelCase |
| title | string | Human friendly title of the comparison |
| description | string | Longer description of what the comparison does |
| handler | function | Function, called as (a, b) for fields which is expected to return a floating value of duplicate-ness |
Dedupe.mutators
A lookup object of field mutators used within strategies.
Each mutator is made up of:
| Setting | Type | Description |
|---------------|----------|-----------------------------------------------------------------------------|
| key | string | Internal short name of the mutator in camelCase |
| title | string | Human friendly title of the mutator |
| description | string | Longer description of what the mutator does |
| handler | function | Function, called as (value) which is expected to return the mutated input |
Static: Dedupe.strategies
A lookup object of strategies.
Each strategy is made up of:
| Setting | Type | Description |
|-----------------|------------|-------------------------------------------------------------------------------|
| key | string | Internal short name of the strategy in camelCase |
| title | string | Human friendly title of the strategy |
| description | string | Longer description of the strategy |
| mutators | object | List of fields which will be mutated and how, prior to the strategy being run |
| steps | array | Array of steps to take when running the strategy |
Dedupe.set(option, value)
Convenience function to quickly set a single option, or merge an object of options. Returns the original Dedupe instance.
Dedupe.run(input)
Takes an array of input references applying the action specified in Dedupe.settings.action.
Returns a promise.
Strategies
This module includes a selection of deduplication strategies which are basic JavaScript objects which detail steps to take to detect reference duplication.
Each strategy should include a title, description, optional mutations and a collection of steps to perform.
A simple example of the DOI only strategy:
module.exports = {
title: 'DOI only',
description: 'Compare references against DOI fields only',
mutations: {
doi: 'doiRewrite',
},
steps: [
{
fields: ['doi'],
comparison: 'exact',
},
],
};Strategy format:
| Path | Type | Default | Description |
|----------------------|-----------|---------|-------------------------------------------------------------------------------------------|
| title | string | | The short human-readable title of the strategy |
| description | string | | A longer, HTML compatible description of the strategy |
| mutators | object | | An object of the reference properties to mutate prior to processing, each value should be a known mutator |
| steps | array | | A collection of steps for the deduplication process |
| steps.skipOmitted | boolean | true | Skip field comparison where either side is not specified |
| steps[].fields | array | | An array of strings, each value should correspond to a known reference field |
| steps[].comparison | string | | The comparison method to use in this step, should correspond to a known comparison method |
