npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

feature-scaler

v1.0.0

Published

Normalize arbitrary lists of js objects into something you can feed to a machine learning algorithm.

Downloads

7

Readme

Build Status

feature-scaler

feature-scaler is a utility that transforms a list of arbitrary JavaScript objects into a normalized format suitable for feeding into machine learning algorithms. It can also decode encoded data back into its original format.

Motivation: I use Andrej Karpathy's excellent convnetjs library to experiment with neural networks in JavaScript and often have to preprocess my data before training a network. This utility makes it easy to encode data in a format usable by convnetjs.

"Why JavaScript?" is a fair question - Python's scikit-learn has most of the data preprocessing features you may need. I wrote this mainly because I wanted an easy way to use convnetjs without communicating across languages. If your data is big enough that convnetjs or the performance of the V8 engine in node.js is the limiting factor in your workflow, don't use JavaScript!

Field types currently supported: ints, floats, bools, and strings.

Check out tests/main.spec.js for a demo of this library in action.

In the following documentation, I'll use planetList as the example data set we're transforming. It looks like this:

const planetList = [
  { planet: 'mars', isGasGiant: false, value: 10 },
  { planet: 'saturn', isGasGiant: true, value: 20 },
  { planet: 'jupiter', isGasGiant: true, value: 30 }
]

The independent variables are planet and isGasGiant. The dependent variable is value.

encode(data, opts = { dataKeys, labelKeys })

  • data: list of raw data you need encoded. Assumptions: all entries in this list have the same structure as the first entry in the list. If the first element in data has a key called isGasGiant, and data[0].isGasGiant === true, isGasGiant should be a boolean for all objects in the list.
  • opts
  • opts.labelKeys - list of keys you are predicting values for (value).
  • opts.dataKeys optional - list of independent keys (planet, isGasGiant). If not provided, defaults to all keys minus opts.labelKeys.

Example usage:

const dataKeys = ['planet', 'isGasGiant'];
const labelKeys = ['value']
const encodedInfo = encode(planetList, { dataKeys: ['value']});

// encodedInfo.data
[ [ 1, 0, 0, 0, -1 ], [ 0, 1, 0, 1, 0 ], [ 0, 0, 1, 1, 1 ] ]
// Note: as is the norm with machine learning algorithms,
// "label" data is at the end of each row.
// encodedInfo.data[0][4] === -1; the scaled label value for Mars.

// encodedInfo.decoders - can be treated as a black box
[
  { key: 'planet', type: 'string', offset: 3, lookupTable: ['mars','saturn','jupiter'] },
  { key: 'isGasGiant', type: 'boolean' },
  { key: 'value', type: 'number', mean: 20, std: 10 }
]

Each entry in the "decoders" list is metadata from the original dataset. It contains information on how to transform an encoded row back into the original { key: value } pairs. Your code should not modify this list. The only thing you should do with it is feed it back into decode, described below.

Note: encodedInfo can safely be serialized to JSON and saved for later use with JSON.stringify(encodedInfo).

decode(encodedData, decoders)

  • encodedData - the data from encode output
  • decoders - the decoders from encode output

It returns the list of data in its original format.

decodeRow(encodedRow, decoders)

Similar to decode, but operates on a single row. e.g.

decodeRow(encodedData[0], decoders) === decode(encodedData, decoders)[0]

Technical details

The short version is this library encodes data in the following ways:

  • Number fields: (n - mean) / stddev
  • Boolean fields: n ? 1 : 0
  • String fields: one-hot encoding (see below).

One-hot encoding

Standardizing numbers and booleans is easy, but categorical string data is a little trickier. In the example above, transforming ['mars', 'jupiter', 'saturn'] into a single number value falsely implies* there is an ordering to the underlying value. Suppose you had a variable that represented the weather; there is no logical ordering to ['rain', 'sun', 'overcast']. If we naively had a sinlge numeric "weather" column where rain=0, sun=1, overcast=2, some machine learning algorithms would treat that field as "ordered".

Instead, we need to map these strings to a list of single-valued binary values. In the planets example, we see the following encodings:

  • mars == [0, 0, 1]
  • saturn == [0, 1, 0]
  • jupiter == [1, 0, 0]

We can feed this into an arbitrary machine learning algorithm without the possibility of it (incorrectly) inferring an ordering to our data.

* In our example, there is indeed an ordering to the planets! If the ordering is important, add a calculated field to the data before encoding. You could add a numberOfPlanetFromSun integer field to each record before encoding if the ordering of categorical data is important.

Further Reading

  • https://github.com/karpathy/convnetjs
  • http://cs231n.stanford.edu/ - Stanford neural network intro class
  • http://sebastianraschka.com/Articles/2014_about_feature_scaling.html - general motivation for feature scaling, from Sebastian Raschka
  • https://code-factor.blogspot.com/2012/10/one-hotone-of-k-data-encoder-for.html - one-hot encoding

Todo

  • Add support for decoding a single value (currently only decoding a whole row is supported)
  • Add support for unrolling nested objects
  • Add support for missing data
  • Currently it standardizes numeric values; perhaps add support for scaling numeric values to [0, 1].

Contributions welcome! Please include unit tests, and ensure both npm run test and npm run lint pass without warning.