npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@iebh/dedupe-sweep

v2.0.0

Published

Deduplicate reference libraries using the sweep method

Downloads

11

Readme

IEBH/Dedupe-Sweep

Deduplicate reference libraries using the sweep method.

This library is intended to be used with Reflib compatible references.

// Simple example with an array of references
var Dedupe = require('@iebh/dedupe-sweep');

(new Dedupe())
	.set('strategy', 'doiOnly')
	.run([
		{doi: 'https://doi.org/10.1000/182'},
		{doi: '10.1000/182'},
	])
	.then(deduped => { /* ... */ })
// More complex example reading in a reference library with RefLib, deduping it and saving as another file
var Dedupe = require('@iebh/dedupe-sweep');
var reflib = require('reflib');

// Read in the library
var refs = await reflib.promises.parseFile('my-large-reference-library.xml');

// Dedupe
var deduper = new Dedupe()
deduper.set('strategy', 'clark')
var dedupedRefs = await deduper.run(refs);

// Save the deduped library
await reflib.promises.outputFile('my-large-reference-library-deduped.xml', dedupedRefs);

Testing

The various strategies within this project are tested using the Systematic Reviews Data Sets for Testing Automation Tools by Beller et. al and are available in the test/data directory.

Tests can be run via npm test or mocha. See the test directory for more information on specifics.

Testing statistics are based on the methodology from Evaluating automated deduplication tools: protocol by Hair et. al.

API

Constructor: Dedupe(options)

Returns a Dedupe class which extends a basic EventEmitter.

Dedupe.settings

Object storing all local settings for the class.

| Setting | Type | Default | Description | |-------------------|-------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------| | stratergy | string | 'clark' | The stratergy to use on the next run() | | validateStratergy | boolean | true | Validate the strategy before beginning, only disable this if you are sure the strategy is valid | | action | string | '0' | The action to take when detecting a duplicate. ENUM: ACTIONS | | actionField | string | 'dedupe' | The field to use with actions | | threshold | number | 0.1 | Floating value (between 0 and 1) when marking or deleting refs automatically | | markOk | string / function | 'OK' | String value to set the action field to when actionField=='mark' and the ref is a non-dupe, if a function it is called as (ref) | | markDupe | string / function | 'DUPE' | String value to set the action field to when actionField=='mark' and the ref is a dupe, if a function it is called as (ref) | | dupeRef | string | 0 | How to refer to other refs when actionfield=='stats'. ENUM: DUPEREF | | fieldWeight | number | 0 | How to calculate duplication score. ENUM: FIELDWEIGHT | | markOriginal | boolean | false | Whether to mark the original as a duplicate or not |

Static: Dedupe.ACTIONS

Actions to take when detecting duplicates

| Value | Setting | Description | |-------|------------|-------------------------------------------------------------------------------------------------------------------------------------------| | 0 | 'STATS' | Add the field field in Dedupe.settings.actionField with the deduplicate chance to the input | | 1 | 'MARK' | Set the field in Dedupe.settings.actionField to Dedupe.settings.mark{Ok,Dupe} depending on duplicate status but leave input unchanged | | 2 | 'DELETE' | Remove duplicates from input and return sliced output |

Static: Dedupe.DUPEREF

How to refer to other references.

| Value | Setting | Description | |-------|---------------|--------------------------------------------------------------| | 0 | 'INDEX' | Refer to other references by their offset in the input array | | 1 | 'RECNUMBER' | Refer to other references by their recnumber field |

Static: Dedupe.FIELDWEIGHT

How to refer to other references.

| Value | Setting | Description | |-------|---------------|-----------------------------------------------------------| | 0 | 'MINIMUM' | Calculate duplication score based on minumum field score | | 1 | 'AVERAGE' | Calculate duplication score based on average field score |

Dedupe.comparisons

A lookup object of comparison functions used within strategies.

Each comparison is made up of:

| Setting | Type | Description | |---------------|----------|--------------------------------------------------------------------------------------------------------| | key | string | Internal short name of the comparison in camelCase | | title | string | Human friendly title of the comparison | | description | string | Longer description of what the comparison does | | handler | function | Function, called as (a, b) for fields which is expected to return a floating value of duplicate-ness |

Dedupe.mutators

A lookup object of field mutators used within strategies.

Each mutator is made up of:

| Setting | Type | Description | |---------------|----------|-----------------------------------------------------------------------------| | key | string | Internal short name of the mutator in camelCase | | title | string | Human friendly title of the mutator | | description | string | Longer description of what the mutator does | | handler | function | Function, called as (value) which is expected to return the mutated input |

Static: Dedupe.strategies

A lookup object of strategies.

Each strategy is made up of:

| Setting | Type | Description | |-----------------|------------|-------------------------------------------------------------------------------| | key | string | Internal short name of the strategy in camelCase | | title | string | Human friendly title of the strategy | | description | string | Longer description of the strategy | | mutators | object | List of fields which will be mutated and how, prior to the strategy being run | | steps | array | Array of steps to take when running the strategy |

Dedupe.set(option, value)

Convenience function to quickly set a single option, or merge an object of options. Returns the original Dedupe instance.

Dedupe.run(input)

Takes an array of input references applying the action specified in Dedupe.settings.action. Returns a promise.

Strategies

This module includes a selection of deduplication strategies which are basic JavaScript objects which detail steps to take to detect reference duplication.

Each strategy should include a title, description, optional mutations and a collection of steps to perform.

A simple example of the DOI only strategy:

module.exports = {
	title: 'DOI only',
	description: 'Compare references against DOI fields only',
	mutations: {
		doi: 'doiRewrite',
	},
	steps: [
		{
			fields: ['doi'],
			comparison: 'exact',
		},
	],
};

Strategy format:

| Path | Type | Default | Description | |----------------------|-----------|---------|-------------------------------------------------------------------------------------------| | title | string | | The short human-readable title of the strategy | | description | string | | A longer, HTML compatible description of the strategy | | mutators | object | | An object of the reference properties to mutate prior to processing, each value should be a known mutator | | steps | array | | A collection of steps for the deduplication process | | steps.skipOmitted | boolean | true | Skip field comparison where either side is not specified | | steps[].fields | array | | An array of strings, each value should correspond to a known reference field | | steps[].comparison | string | | The comparison method to use in this step, should correspond to a known comparison method |