npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

string-comparison

v1.3.0

Published

A library implementing different string similarity

Downloads

113,804

Readme

string-comparison

npm bundle size npm GitHub stars GitHub license

JavaScript implementation of tdebatty/java-string-similarity

A library implementing different string similarity, distance and sortMatch measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...

Download & Usage

download

npm install string-comparison --save
yarn add string-comparison
pnpm add string-comparison

usage

let stringComparison = require('string-comparison')
// or import stringComparison from 'string-comparison'

const Thanos = 'healed'
const Rival = 'sealed'
const Avengers = ['edward', 'sealed', 'theatre']

// use by cosine
let cos = stringComparison.Cosine

console.log(cos.similarity(Thanos, Rival))
console.log(cos.distance(Thanos, Rival))
console.log(cos.sortMatch(Thanos, Avengers))

OverView

The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.

| | Measure(s) | Normalized? | Metric? | Type | Cost | Typical usage | | ------------------------------------------------------------------------------------------------------------------------------------ | --------------------------------------- | ----------- | ------- | ------- | ------ | --------------- | | Jaccard index | similaritydistancesortMatch | Yes | Yes | Set | O(m+n) | | | Cosine similarity | similaritydistancesortMatch | Yes | No | Profile | O(m+n) | | | Sorensen-Dice coefficient | similaritydistancesortMatch | Yes | No | Set | O(m+n) | | | Levenshtein | similaritydistancesortMatch | No | Yes | | O(mn) | | | Jaro-Winkler | similarity distancesortMatch | Yes | No | | O(mn) | typo correction |

Normalized, metric, similarity and distance

Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance. Therefore the library defines some interfaces to categorize them.

(Normalized) similarity and distance

  • StringSimilarity : Implementing algorithms define a similarity between strings (0 means strings are completely different).
  • NormalizedStringSimilarity : Implementing algorithms define a similarity between 0.0 and 1.0, like Jaro-Winkler for example.
  • StringDistance : Implementing algorithms define a distance between strings (0 means strings are identical), like Levenshtein for example. The maximum distance value depends on the algorithm.
  • NormalizedStringDistance : This interface extends StringDistance. For implementing classes, the computed distance value is between 0.0 and 1.0. NormalizedLevenshtein is an example of NormalizedStringDistance.

Levenshtein

The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

It is a metric string distance. This implementation uses dynamic programming (Wagner–Fischer algorithm), with only 2 rows of data. The space requirement is thus O(m) and the algorithm runs in O(m.n).

import { levenshtein } from "string-comparison"
import type {SortMatchResultType} from "string-comparison"

const Thanos = 'healed'
const Rival = 'sealed'
const Avengers = ['edward', 'sealed', 'theatre']

console.log(levenshtein.similarity(Thanos, Rival))
console.log(levenshtein.distance(Thanos, Rival))
console.log(levenshtein.sortMatch(Thanos, Avengers) as SortMatchResultType)

// output
0.8333333333333334
1
[
  { member: 'edward', index: 0, rating: 0.16666666666666663 },
  { member: 'theatre', index: 2, rating: 0.4285714285714286 },
  { member: 'sealed', index: 1, rating: 0.8333333333333334 }
]

Longest Common Subsequence

The longest common subsequence (LCS) problem consists in finding the longest subsequence common to two (or more) sequences. It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences.

It is used by the diff utility, by Git for reconciling multiple changes, etc.

The LCS distance between strings X (of length n) and Y (of length m) is n + m - 2 |LCS(X, Y)| min = 0 max = n + m

LCS distance is equivalent to Levenshtein distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.

This class implements the dynamic programming approach, which has a space requirement O(m.n), and computation cost O(m.n).

In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.

import { longestCommonSubsequence } from "string-comparison"
or 
import { lcs } from "string-comparison"


const Thanos = 'healed'
const Rival = 'sealed'
const Avengers = ['edward', 'sealed', 'theatre']

console.log(lcs.similarity(Thanos, Rival))
console.log(lcs.distance(Thanos, Rival))
console.log(lcs.sortMatch(Thanos, Avengers))

// output
0.8333333333333334
2
[
  { member: 'edward', index: 0, rating: 0.5 },
  { member: 'theatre', index: 2, rating: 0.6153846153846154 },
  { member: 'sealed', index: 1, rating: 0.8333333333333334 }
]

Metric Longest Common Subsequence

Distance metric based on Longest Common Subsequence, from the notes "An LCS-based string metric" by Daniel Bakkelund. http://heim.ifi.uio.no/~danielry/StringMetric.pdf

The distance is computed as 1 - |LCS(s1, s2)| / max(|s1|, |s2|)

import { metricLcs } from "string-comparison"
or 
import { mlcs } from "string-comparison"

const Thanos = 'healed'
const Rival = 'sealed'
const Avengers = ['edward', 'sealed', 'theatre']

console.log(metricLcs.similarity(Thanos, Rival))
console.log(metricLcs.distance(Thanos, Rival))
console.log(metricLcs.sortMatch(Thanos, Avengers))

// output
0.8333333333333334
0.16666666666666663
[
  { member: 'edward', index: 0, rating: 0.5 },
  { member: 'theatre', index: 2, rating: 0.5714285714285714 },
  { member: 'sealed', index: 1, rating: 0.8333333333333334 }
]

Cosine similarity

Like Q-Gram distance, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Each input string is simply a set of n-grams. The Jaccard index is then computed as |V1 inter V2| / |V1 union V2|.

Distance is computed as 1 - similarity. Jaccard index is a metric distance.

import { cosine } from "string-comparison"

Sorensen-Dice coefficient

Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 inter V2| / (|V1| + |V2|).

Distance is computed as 1 - similarity.

import { diceCoefficient } from "string-comparison"

Jaro-Winkler similarity

The Jaro-Winkler similarity is a string metric measuring edit distance between two strings. Jaro – Winkler Similarity is much similar to Jaro Similarity. They both differ when the prefix of two string match. Jaro – Winkler Similarity uses a prefix scale ‘p’ which gives a more accurate answer when the strings have a common prefix up to a defined maximum length l.

import { jaroWinkler } from "string-comparison"

API

  • cosine
  • diceCoefficient
  • jaccardIndex
  • levenshtein
  • lcs = longestCommonSubsequence
  • mlcs = metricLcs
  • jaroWinkler

Methods

  • similarity.
  • distance.
  • sortMatch

similarity

Implementing algorithms define a similarity between strings

params

  1. thanos [String]
  2. rival [String]

return

Return a similarity between 0.0 and 1.0

distance

Implementing algorithms define a distance between strings (0 means strings are identical)

params

  1. thanos [String]
  2. rival [String]

return

Return a number

sortMatch

params

  1. thanos [String]
  2. avengers [...String]

return

Return an array of objects - SortMatchResultType ex:

[
  { member: 'edward', rating: 0.16666666666666663 },
  { member: 'theatre', rating: 0.4285714285714286 },
  { member: 'mailed', rating: 0.5 },
  { member: 'sealed', rating: 0.8333333333333334 }
]

CHANGELOG

CHANGELOG

MIT

MIT