npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

hac

v1.0.7

Published

Hierarchical agglomerative clustering

Downloads

35

Readme

HAC

HAC stands for Hierarchical Agglomerative Clustering, a commeon technique for unsupervised document clustering.

NOTICE: HAC requires unpublished modules on github, it will just work fine with npm install, but will fail on Tonic (the Try it out on npm website), since it requires all modules published on npm. Future works will try to publish these required modules on npm.

Installation

npm install hac --save

Usage

Instantiate

var HAC = require("hac");
var hac = new HAC();

Add documents

hac.addDocument(doc, id, class);

Arguments:

  • doc String: the document to be added, could be string of text or array of terms
  • id String/int (optional): the id of the docuemnt. If ignored, a uuid would generated automatically
  • class String/int (optional): the class(or label) of this document. You probably won't need this, but if specified, you could use getMeasure() to get F measure or Randon Index to see clustering performance.

Clustering

hac.cluster(clusterMethod);

Arguments:

  • clusterMethod Class Method: the clustering algorithm to be used. Available options are as following:
    • HAC.GA: Group-average Agglomerative clustering
    • HAC.SingleLink: single link clustering
    • HAC.CompleteLink: complete link clustering
    • HAC.Centroid: centroid clustering. To Be Implemented

Get clustering result

var clusters = hac.getClusters(k, fields);

Arguments:

  • k int: the number of clusters
  • fields Array: array of fields of a document that you want in the final clustering result. Available fields are as following:
    • "id": the id of the document
    • "class": the class(label) of the document, if specified when calling addDocument()
    • "content": string of document content
    • "terms": document content represented as array of terms
    • "tfs": array of term frequencies for this document
    • "vector": vector representation of this document

Alternatively, you could use following method to get clusters with cluster labeling:

var clusters = hac.getClustersWithLabels(k, fields, featureCount, featureMethod);

The cluster labeling algorithm uses feature selection, which is a module called FeatureSelector.

Arguments:

  • k int: number of clusters.
  • fields Array: array of fields. see above description of getClusters()
  • featureCount int: the number of feature terms that you want for each cluster
  • featureMethod Class Method: the feature selection algorithm to be used. Available options are as following:;
    • FeatureSelector.MI: Expected Mutual Information feature selection
    • FeatureSelectr.LLR: Likelihood Ratio feature selection

Get performance measurement

You could get F measure or Random index for the clustering result.

NOTE: if you want to see performance measurements, you must specify the class argument when calling addDocument(). Also, when calling getClusters() or getClustersWithLabels(), you must include the field "class" in the argment fields.

var measure = getMeasure(clusters, method, beta, showRawScore);

Arguments:

  • clusters Array: the clustering result that you get by calling getClusters() or getClustersWithLabels()
  • method Class Method: the measuring algorithm to be used. Available options are as following:
    • HAC.F: F measure
    • HAC.RI: Random Index
  • beta int (optional): If you use HAC.F, you should give hac a beta value, which should be integer greater than or equal to 1
  • showRawScore boolean (optional): If set to true, print the tp, fp, fn, tn, total negative and total positive on the console

Complete example

var hac = new HAC();
var docs = [];
docs.push(["嗨", "你好"]);
docs.push(["嗨", "很", "高興", "認識", "你"]);
docs.push("hello, how's everything today? is everything ok today?")
docs.push("let's test one more document!");
docs.push("documents are always not large enough");

for(var i = 0; i < docs.length; i++) {
    hac.addDocument(docs[i], i);
}
hac.cluster(HAC.GA);

var clusters = hac.getClusters(2, ["id", "content"]);
_.forEach(clusters, function(cluster) {
    console.log("cluster id: " + cluster.id)
    _.forEach(cluster.docs, function(doc) {
        console.log("doc id: " + doc.id)
        console.log("doc content: " + doc.content);
    })
    console.log()
})

the result would be:

cluster id: 7
doc id: 0
doc content: 嗨,你好
doc id: 1
doc content: 嗨,很,高興,認識,你
doc id: 2
doc content: hello, how's everything today? is everything ok today?

cluster id: 6
doc id: 3
doc content: let's test one more document!
doc id: 4
doc content: documents are always not large enough

Release Notes

  • 1.0.7: update url of modules hosted on github to a simpler form
  • 1.0.6: correct require path of the heap module
  • 1.0.5: make statements in README for incompatibility with Tonic
  • 1.0.4: require es6-shim to support older node engine
  • 1.0.3: change arrow functions to anonymous functions for backward compatibility
  • 1.0.2: subtle modification to README
  • 1.0.1: first publishment