@jsonstack/data
v2.0.1
Published
jsonstack data is a TypeScript module with simple and efficient tools for data mining and data analysis in TypeScript. When jsonstack data used with ML.js, pandas-js, and numjs, you're left with the equivalent R/Python tool set in TypeScript.
Downloads
784
Maintainers
Readme
@modelx/data
quickly generate UMDs and other module types with rollup and typescript
Getting started
Clone the repo and drop your module in the src directory.
# Install Prerequisites
$ npm install rollup typedoc jest sitedown --g
Basic Usage
$ npm run build #builds type declarations, created bundled artifacts with rollup and generates documenation
Description
ModelScript is a javascript module with simple and efficient tools for data mining and data analysis in JavaScript. ModelScript can be used with ML.js, pandas-js, and numjs, to approximate the equivalent R/Python tool chain in JavaScript.
In Python, data preparation is typically done in a DataFrame, ModelScript encourages a more R like workflow where the data preparation is in it's native structure.
Installation
$ npm i modelscript
Full Documentation
Usage (basic)
ModelScript is an EcmaScript module and designed to be imported in an ES2015+ environment. In order to use in older environment, please use const modelscript = require('modelscript/build/modelscript.cjs.js')
for older versions of node and <script type="text/javascript" src=".../path/to/.../modelscript/build/modelscript.umd.js"/>
"modelscript" : {
ml:{ //see https://github.com/mljs/ml
UpperConfidenceBound [Class: UpperConfidenceBound]{ // Implementation of the Upper Confidence Bound algorithm
predict(), //returns next action based off of the upper confidence bound
learn(), //single step training method
train(), //training method for upper confidence bound calculations
},
ThompsonSampling [Class: ThompsonSampling]{ //Implementation of the Thompson Sampling algorithm
predict(), //returns next action based off of the thompson sampling
learn(), //single step training method
train(), //training method for thompson sampling calculations
},
},
nlp:{ //see https://github.com/NaturalNode/natural
ColumnVectorizer [Class: ColumnVectorizer]{ //class creating sparse matrices from a corpus
get_tokens(), // Returns a distinct array of all tokens after fit_transform
get_vector_array(), //Returns array of arrays of strings for dependent features from sparse matrix word map
fit_transform(options), //Fits and transforms data by creating column vectors (a sparse matrix where each row has every word in the corpus as a column and the count of appearances in the corpus)
get_limited_features(options), //Returns limited sets of dependent features or all dependent features sorted by word count
evaluateString(testString), //returns word map with counts
evaluate(testString), //returns new matrix of words with counts in columns
}
},
csv:{
loadCSV: [Function: loadCSV], //asynchronously loads CSVs, either a filepath or a remote URI
loadTSV: [Function: loadTSV], //asynchronously loads TSVs, either a filepath or a remote URI
},
model_selection: {
train_test_split: [Function: train_test_split], // splits data into training and testing sets
cross_validation_split: [Function: kfolds], //splits data into k-folds
cross_validate_score: [Function: cross_validate_score],//test model variance and bias
grid_search: [Function: grid_search], // tune models with grid search for optimal performance
},
DataSet [Class: DataSet]: { //class for manipulating an array of objects (typically from CSV data)
columnMatrix(vectors), //returns a matrix of values by combining column arrays into a matrix
columnArray(columnName, options), // - returns a new array of a selected column from an array of objects, can filter, scale and replace values
columnReplace(columnName, options), // - returns a new array of a selected column from an array of objects and replaces empty values, encodes values and scales values
columnScale(columnName, options), // - returns a new array of scaled values which can be reverse (descaled). The scaling transformations are stored on the DataSet
columnDescale(columnName, options), // - Returns a new array of descaled values
selectColumns(columns, options), //returns a list of objects with only selected columns as properties
labelEncoder(columnName, options), // - returns a new array and label encodes a selected column
labelDecode(columnName, options), // - returns a new array and decodes an encoded column back to the original array values
oneHotEncoder(columnName, options), // - returns a new object of one hot encoded values
columnMatrix(columnName, options), // - returns a matrix of values from multiple columns
columnReducer(newColumnName, options), // - returns a new array of a selected column that is passed a reducer function, this is used to create new columns for aggregate statistics
columnMerge(name, data), // - returns a new column that is merged onto the data set
filterColumn(options), // - filtered rows of data,
fitColumns(options), // - mutates data property of DataSet by replacing multiple columns in a single command
static reverseColumnMatrix(options), // returns an array of objects by applying labels to matrix of columns
static reverseColumnVector(options), // returns an array of objects by applying labels to column vector
},
calc:{
getTransactions: [Function getTransactions], // Formats an array of transactions into a sparse matrix like format for Apriori/Eclat
assocationRuleLearning: [async Function assocationRuleLearning], // returns association rule learning results using apriori
},
util: {
range: [Function], // range helper function
rangeRight: [Function], //range right helper function
scale: [Function: scale], //scale / normalize data
avg: [Function: arithmeticMean], // aritmatic mean
mean: [Function: arithmeticMean], // aritmatic mean
sum: [Function: sum],
max: [Function: max],
min: [Function: min],
sd: [Function: standardDeviation], // standard deviation
StandardScalerTransforms: [Function: StandardScalerTransforms], // returns two functions that can standard scale new inputs and reverse scale new outputs
MinMaxScalerTransforms: [Function: MinMaxScalerTransforms], // returns two functions that can mix max scale new inputs and reverse scale new outputs
StandardScaler: [Function: StandardScaler], // standardization (z-scores)
MinMaxScaler: [Function: MinMaxScaler], // min-max scaling
ExpScaler: [Function: ExpScaler], // exponent scaling
LogScaler: [Function: LogScaler], // natual log scaling
squaredDifference: [Function: squaredDifference], // Returns an array of the squared different of two arrays
standardError: [Function: standardError], // The standard error of the estimate is a measure of the accuracy of predictions made with a regression line
coefficientOfDetermination: [Function: coefficientOfDetermination],
adjustedCoefficentOfDetermination: [Function: adjustedCoefficentOfDetermination],
adjustedRSquared: [Function: adjustedCoefficentOfDetermination],
rBarSquared: [Function: adjustedCoefficentOfDetermination],
r: [Function: coefficientOfCorrelation],
coefficientOfCorrelation: [Function: coefficientOfCorrelation],
rSquared: [Function: rSquared], //r^2
pivotVector: [Function: pivotVector], // returns an array of vectors as an array of arrays
pivotArrays: [Function: pivotArrays], // returns a matrix of values by combining arrays into a matrix
standardScore: [Function: standardScore], // Calculates the z score of each value in the sample, relative to the sample mean and standard deviation.
zScore: [Function: standardScore], // alias for standardScore.
approximateZPercentile: [Function: approximateZPercentile], // approximate the p value from a z score
},
preprocessing: {
DataSet: [Class DataSet],
},
}
Examples (JavaScript / Python / R)
Loading CSV Data
Javascript
import { default as jsk } from 'modelscript';
let dataset;
//In JavaScript, by default most I/O Operations are asynchronous, see the notes section for more
ms.loadCSV('/some/file/path.csv')
.then(csvData=>{
dataset = new ms.DataSet(csvData);
console.log({csvData});
/* csvData [{
'Country': 'Brazil',
'Age': '44',
'Salary': '72000',
'Purchased': 'N',
},
...
{
'Country': 'Mexico',
'Age': '27',
'Salary': '48000',
'Purchased': 'Yes',
}] */
})
.catch(console.error);
// or from URL
ms.loadCSV('https://example.com/some/file/path.csv')
Python
import pandas as pd
#Importing the dataset
dataset = pd.read_csv('/some/file/path.csv')
R
# Importingd the dataset
dataset = read.csv('Data.csv')
Handling Missing Data
Javascript
//column Array returns column of data by name
// [ '44','27','30','38','40','35','','48','50', '37' ]
const OringalAgeColumn = dataset.columnArray('Age');
//column Replace returns new Array with replaced missing data
//[ '44','27','30','38','40','35',38.77777777777778,'48','50','37' ]
const ReplacedAgeMeanColumn = dataset.columnReplace('Age',{strategy:'mean'});
//fit Columns, mutates dataset
dataset.fitColumns({
columns:[{name:'Age',strategy:'mean'}]
});
/*
dataset
class DataSet
data:[
{
'Country': 'Brazil',
'Age': '38.77777777777778',
'Salary': '72000',
'Purchased': 'N',
}
...
]
*/
Python
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy = 'mean', axis=0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
R
# Taking care of the missing data
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age,FUN = function(x) mean(x,na.rm =TRUE)),
dataset$Age)
One Hot Encoding and Label Encoding
Javascript
// [ 'Brazil','Mexico','Ghana','Mexico','Ghana','Brazil','Mexico','Brazil','Ghana', 'Brazil' ]
const originalCountry = dataset.columnArray('Country');
/*
{ originalCountry:
{ Country_Brazil: [ 1, 0, 0, 0, 0, 1, 0, 1, 0, 1 ],
Country_Mexico: [ 0, 1, 0, 1, 0, 0, 1, 0, 0, 0 ],
Country_Ghana: [ 0, 0, 1, 0, 1, 0, 0, 0, 1, 0 ] },
}
*/
const oneHotCountryColumn = dataset.oneHotEncoder('Country');
// [ 'N', 'Yes', 'No', 'f', 'Yes', 'Yes', 'false', 'Yes', 'No', 'Yes' ]
const originalPurchasedColumn = dataset.labelEncoder('Purchased');
// [ 0, 1, 0, 0, 1, 1, 1, 1, 0, 1 ]
const encodedBinaryPurchasedColumn = dataset.labelEncoder('Purchased',{ binary:true });
// [ 0, 1, 2, 3, 1, 1, 4, 1, 2, 1 ]
const encodedPurchasedColumn = dataset.labelEncoder('Purchased');
// [ 'N', 'Yes', 'No', 'f', 'Yes', 'Yes', 'false', 'Yes', 'No', 'Yes' ]
const decodedPurchased = dataset.labelDecode('Purchased', { data: encodedPurchasedColumn, });
//fit Columns, mutates dataset
dataset.fitColumns({
columns:[
{
name: 'Purchased',
options: {
strategy: 'label',
labelOptions: {
binary: true,
},
},
},
{
name: 'Country',
options: {
strategy: 'onehot',
},
},
]
});
Python
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features=[0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)
R
# Encoding categorical data
dataset$Country = factor(dataset$Country,
levels = c('Brazil', 'Mexico', 'Ghana'),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
Cross Validation
Javascript
const testArray = [20, 25, 10, 33, 50, 42, 19, 34, 90, 23, ];
// { train: [ 50, 20, 34, 33, 10, 23, 90, 42 ], test: [ 25, 19 ] }
const trainTestSplit = ms.cross_validation.train_test_split(testArray,{ test_size:0.2, random_state: 0, });
// [ [ 50, 20, 34, 33, 10 ], [ 23, 90, 42, 19, 25 ] ]
const crossValidationArrayKFolds = ms.cross_validation.cross_validation_split(testArray, { folds: 2, random_state: 0, });
Python
#splitting the dataset into trnaing set and test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
R
# Splitting the dataset into the training set and test set
library(caTools)
set.seed(1)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
Scaling (z-score / min-mix)
Javascript
dataset.columnArray('Salary',{ scale:'standard'});
dataset.columnArray('Salary',{ scale:'minmax'});
Python
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
Notes
Check out https://repetere.github.io/modelscript for the full modelscript Documentation
A quick word about asynchronous JavaScript
Most machine learning tutorials in Python and R are not using their asynchronous equivalents; however, there is a bias in JavaScript to default to non-blocking operations.
With the advent of ES7 and Node.js 7+ there are syntax helpers with asynchronous functions. It may be easier to use async/await in JS if you want an approximation close to what a workflow would look like in R/Python
import * as fs from 'fs-extra';
import * as np from 'numjs';
import { default as ml } from 'ml';
import { default as pd } from 'pandas-js';
import { default as mpn } from 'matplotnode';
import { loadCSV, preprocessing } from 'modelscript';
const plt = mpn.plot;
void async () => {
const csvData = await loadCSV('../Data.csv');
const rawData = new preprocessing.DataSet(csvData);
const fittedData = rawData.fitColumns({
columns: [
{ name: 'Age' },
{ name: 'Salary' },
{
name: 'Purchased',
options: {
strategy: 'label',
labelOptions: {
binary: true,
},
}
},
]
});
const dataset = new pd.DataFrame(fittedData);
const X = dataset.iloc(
[ 0, dataset.length ],
[ 0, 3 ]).values;
const y = dataset.iloc(
[ 0, dataset.length ],
3).values;
console.log({
X,
y
});
}();