r-regression
v1.0.3
Published
$ npm install r-regression --save
Downloads
1
Maintainers
Readme
r-regression
Installation
$ npm install r-regression --save
R style regression models.
Example
The examples in this help use files in the [source data folder] (https://bitbucket.org/cjansenson/r-regression/src/master/data)
const regression = require('r-regression'),
const csvdata = require('csvdata');
csvdata.load('./data/mpg.csv').then(df => {
this.model = regression.lm('mpg ~ cyl + disp + hp', df);
console.log(this.model.summary.toString());
})
Output
Call:
lm(mpg ~ cyl + disp + hp)
Residuals:
| Min | 1Q | Median | 3Q | Max |
-------------------------------------------------------
| -11.699 | -3.1413 | -0.3221 | 2.33005 | 16.5796 |
Coefficients:
| | Estimate | Std. Err | t value | Pr(>|t| |
--------------------------------------------------------------------------------
| (inter.) | 39.2624 | 1.33290 | 29.4564 | <2e-16 |
| cyl | -0.7076 | 0.43636 | -1.6217 | 0.10566 |
| disp | -0.0292 | 0.00865 | -3.3838 | 0.00078 |
| hp | -0.0598 | 0.01352 | -4.4270 | 1.245e-5 |
---
Residual standard error: 4.53330 on 386 degrees of freedom
Multiple R-squared: 0.66616, Adjusted R-squared: 0.66356
F-statistic: 256.74662 on 3 and 386 DF, p-value: < 2.2e-16
Accepted data structures
The data passed as a parameter must be in one of the following forms:
A) An array of object
[
{
"cyl": 8,
"name": "buick skylark 320",
"mpg": 15,
"disp": 350,
"hp": 165,
"wt": 3693,
"acc": 11.5,
"year": 70
},
{
"cyl": 8,
"name": "plymouth satellite",
"mpg": 18,
"disp": 318,
"hp": 150,
"wt": 3436,
"acc": 11,
"year": 70
},
...
...
{
"cyl": 8,
"name": "ford torino",
"mpg": 17,
"disp": 302,
"hp": 140,
"wt": 3449,
"acc": 10.5,
"year": 70
}
]
B) An object with arrays
{
"cyl": [ 8,8, 8, 8]
"name": [
"buick skylark 320",
"plymouth satellite",
"amc rebel sst",
"ford torino"
],
"mpg": [ 15, 18, 16, 17],
"disp": [ 350, 318, 304, 302],
"hp": [ 165, 150, 150, 140],
"wt": [ 3693, 3436, 3433, 3449],
"acc": [ 11.5, 11, 12, 10.5],
"year": [ 70, 70, 70, 70]
}
C) A DataFrame as in Dataframe
Example with dataframe
const regression = require('r-regression'),
const csvdata = require('csvdata');
const DataFrame = require("dataframe-js").DataFrame;
csvdata.load('./data/mpg.csv').then(df => {
let df1 = new DataFrame(df);
this.model = regression.lm('mpg ~ cyl + disp + hp', df1);
console.log(this.model.summary.toString());
})
Accepted syntax
Currently only the symbols ~
, .
, +
, -
, :
, and , *
are accepted in the regression formulas.
Still to do: parentheses, polynomial, functions, exponential.
mpg as modeled as a function of cyl, disp, and hp
regression.lm('mpg ~ cyl + disp + hp', df);
mpg as modeled as a function of all the other variables
regression.lm('mpg ~ .', df);
mpg as modeled as a function of hp, cyl, and the interaction of hp and cyl
regression.lm('mpg ~ hp + cyl + hp:cyl', df);
The formula above is equivalent to the following:
regression.lm('mpg ~ hp*cyl', df);
Combinations work with more than 2 variables
hp*cyl*disp is equivalent to: hp + cyl + disp + hp:cyl + cyl:disp + hp:disp + hp:cyl:disp
Factor variables
Configuration allows to indicate which variables should be considered factor variables (categorical), so they are treated differently.
In the example below, the number of cylinders is considered a factor variable, thus generating multiple regression lines.
const regression = require('r-regression'),
const csvdata = require('csvdata');
csvdata.load('./data/mpg.csv').then(df => {
let options = {
factors: ['cyl']
};
this.model = regression.lm('mpg ~ cyl + wt', df, options);
console.log(this.model.summary.toString());
});
Output
Call:
lm(mpg ~ cyl + wt)
Residuals:
| Min | 1Q | Median | 3Q | Max |
-------------------------------------------------------
| -10.252 | -2.5394 | -0.2326 | 1.91863 | 16.8833 |
Coefficients:
| | Estimat | Std. Er | t value | Pr(>|t| |
--------------------------------------------------------------------------------
| (inter.) | 35.2026 | 2.47087 | 14.2470 | <2e-16 |
| cyl4 | 8.16212 | 2.08659 | 3.91170 | 0.00010 |
| cyl5 | 11.1225 | 3.17950 | 3.49819 | 0.00052 |
| cyl6 | 4.33286 | 2.16253 | 2.00360 | 0.04581 |
| cyl8 | 4.89760 | 2.31786 | 2.11298 | 0.03524 |
| wt | -0.0061 | 0.00056 | -10.799 | <2e-16 |
---
Residual standard error: 4.13009 on 384 degrees of freedom
Multiple R-squared: 0.72434, Adjusted R-squared: 0.72075
F-statistic: 201.80417 on 5 and 384 DF, p-value: < 2.2e-16
A more complex example
const regression = require('r-regression'),
const csvdata = require('csvdata');
csvdata.load('./data/epa2015.csv').then(df => {
let options = {
factors: ['type', 'drive', 'lockup']
};
this.model = regression.lm('CO ~ type:lockup + type:drive + lockup:drive', df, options);
console.log(this.model.summary.toString());
});
Output
Call:
lm(CO ~ type:lockup + type:drive + lockup:drive)
Residuals:
| Min | 1Q | Median | 3Q | Max |
-------------------------------------------------------
| -0.8530 | -0.2463 | -0.1348 | 0.07994 | 7.14032 |
Coefficients:
| Name | Estimat | Std. Er | t value | Pr(>|t| |
--------------------------------------------------------------------------------
| (inter.) | 0.34261 | 0.01972 | 17.3723 | <2e-16 |
| typeBoth:lockupN | 0.08496 | 0.06833 | 1.24340 | 0.21378 |
| typeBoth:lockupY | -0.0425 | 0.03851 | -1.1045 | 0.26942 |
| typeCar:lockupN | 0.05213 | 0.03174 | 1.64242 | 0.10057 |
| typeCar:lockupY | -0.0520 | 0.02439 | -2.1351 | 0.03280 |
| typeTruck:lockupN | 0.10354 | 0.05276 | 1.96233 | 0.04978 |
| typeBoth:drive4 | -0.1543 | 0.23703 | -0.6510 | 0.51506 |
| typeBoth:driveA | -0.0081 | 0.06325 | -0.1288 | 0.89750 |
| typeBoth:driveF | 0.07812 | 0.04117 | 1.89766 | 0.05780 |
| typeCar:drive4 | -0.0024 | 0.06214 | -0.0399 | 0.96813 |
| typeCar:driveA | 0.21531 | 0.03912 | 5.50316 | 3.943e-8 |
| typeCar:driveF | -0.0143 | 0.02246 | -0.6396 | 0.52240 |
| typeCar:driveP | 0.00068 | 0.08384 | 0.00818 | 0.99346 |
| typeTruck:drive4 | -0.1140 | 0.05887 | -1.9369 | 0.05281 |
| typeTruck:driveA | -0.0977 | 0.05934 | -1.6475 | 0.09952 |
| typeTruck:driveF | -0.0361 | 0.03397 | -1.0627 | 0.28793 |
| typeTruck:driveP | -0.0448 | 0.16712 | -0.2686 | 0.78823 |
| lockupN:drive4 | -0.0131 | 0.17486 | -0.0754 | 0.93984 |
| lockupN:driveA | 0.24999 | 0.08512 | 2.93671 | 0.00333 |
| lockupN:driveF | -0.0847 | 0.03626 | -2.3365 | 0.01950 |
| lockupN:driveP | 0.00068 | 0.08384 | 0.00818 | 0.99346 |
---
Residual standard error: 0.46939 on 4390 degrees of freedom
Multiple R-squared: 0.02406, Adjusted R-squared: 0.01961
F-statistic: 5.41028 on 20 and 4390 DF, p-value: < 2.2e-16
Accessing Model Summary results
The model.summary
object contains all the summary information about the model.
this.model = regression.lm('mpg ~ cyl + wt', df);
let summary = this.model.summary;
console.log("R squared: " + summary.r_squared);
console.log("Adj R squared: " + summary.adj_r_squared);
console.log("F: " + summary.F);
console.log("Degrees of freedom: " + summary.degFreedom);
console.log("Residual standard error: " + summary.sigma);
console.log("\n\n");
console.log("Coefficients can be accessed as a dictionary of arrays");
console.log(summary.coefficients.toDict());
console.log("... or elements of a matrix");
console.log(summary.coefficients.mat);
console.log("\n\n");
console.log("Same as residual statistics");
console.log(summary.residuals.toDict());
Output (Some of this output was formatted for better documentation)####
R squared: 0.6968669371610983
Adj R squared: 0.6953003580249799
F: 444.8335363943274
Degrees of freedom: 387
Residual standard error: 4.314193982299181
Coefficients can be accessed as a dictionary of arrays
{
Name: [ '(inter.)', 'cyl', 'wt' ],
Estimate: [ 46.27984366893338, -0.7192694685962204, -0.0063479471345544635 ],
'Std. Error': [ 0.7975478288201534, 0.29040152555355486, 0.0005826830104224589 ],
't value': [ 58.027671816745, -2.4768102275811743, -10.89434052651038 ],
'Pr(>|t|)': [ 4.830747533920324e-193, 0.013682709406769245, 2.7540964642770183e-24 ]
}
... or elements of a matrix
[
[ '(inter.)', 46.27984366893338, 0.7975478288201534, 58.027671816745, 4.830747533920324e-193 ],
[ 'cyl', -0.7192694685962204, 0.29040152555355486, -2.4768102275811743, 0.013682709406769245 ],
[ 'wt', -0.0063479471345544635, 0.0005826830104224589, -10.89434052651038, 2.7540964642770183e-24 ]
]
Same as residual statistics
{
Min: [ -12.638995549351037 ],
'1Q': [ -2.8816383979474907 ],
Median: [ -0.28836806118915526 ],
'3Q': [ 2.195003911205035 ],
Max: [ 16.59140265936142 ]
}
Accessing other model information
To access coefficients information
model.coefficients // Returns an object with the coefficients
Fitted values
model.fittedValues // Returns an array of fitted values
Residuals
model.residuals // Returns an array of residuals
Prediction
csvdata.load('./data/mpg.csv').then(df => {
this.model = regression.lm('mpg ~ cyl + wt', df);
const newValues = [
{cyl: 8, wt: 3500},
{cyl: 6, wt: 2000}
];
let fit = this.model.predict(newValues);
console.log(fit);
});
Output
[ 18.307872949222997, 29.268332588247134 ]
Confidence intervals
The following code creates a 99% confidence interval around the fitted values.
csvdata.load('./data/mpg.csv').then(df => {
this.model = regression.lm('mpg ~ cyl + wt', df);
const newValues = [
{cyl: 8, wt: 3500},
{cyl: 6, wt: 2000},
{cyl: 6, wt: 3500},
{cyl: 4, wt: 3500},
{cyl: 2, wt: 3500}
];
let fit = this.model.predict(newValues, 'confidence', .99);
console.log(fit.toString());
console.log('The fitted values can be accessed as either an object of arrays');
console.log('fit:' + fit.toDict().fit);
console.log('lwr:' + fit.toDict().lwr);
console.log('upr:' + fit.toDict().upr);
console.log('Or an array of objects:');
console.log(fit.toCollection()[0]);
console.log(fit.toCollection()[1]);
console.log(fit.toCollection()[2]);
console.log(fit.toCollection()[3]);
console.log(fit.toCollection()[4]);
});
Output
| fit | lwr | upr |
---------------------------------
| 18.3078 | 16.9444 | 19.6712 |
| 29.2683 | 27.3453 | 31.1913 |
| 19.7464 | 19.0137 | 20.4790 |
| 21.1849 | 19.2521 | 23.1177 |
| 22.6234 | 19.2389 | 26.0080 |
The fitted values can be accessed as either an object of arrays
fit: [ 18.307872949222997, 29.268332588247134, 19.746411886415437, 21.184950823607878, 22.623489760800318 ]
lwr: [ 16.944478752252614, 27.345352164898884, 19.013740846471986, 19.252188710119217, 19.238946757671457 ]
upr: [ 19.67126714619338, 31.191313011595383, 20.47908292635889, 23.117712937096538, 26.00803276392918 ]
Or an array of objects:
{ fit: 18.307872949222997, lwr: 16.944478752252614, upr: 19.67126714619338 }
{ fit: 29.268332588247134, lwr: 27.345352164898884, upr: 31.191313011595383 }
{ fit: 19.746411886415437, lwr: 19.013740846471986, upr: 20.47908292635889 }
{ fit: 21.184950823607878, lwr: 19.252188710119217, upr: 23.117712937096538 }
{ fit: 22.623489760800318, lwr: 19.238946757671457, upr: 26.00803276392918 }
In the example above, the toDict()
and toCollection()
are used to retrieve the results as either an object of arrays or a array of objects.
Prediction intervals
The following code creates a 99% prediction interval around the fitted values.
csvdata.load('./data/mpg.csv').then(df => {
this.model = regression.lm('mpg ~ cyl + wt', df);
const newValues = [
{cyl: 8, wt: 3500},
{cyl: 6, wt: 2000},
{cyl: 6, wt: 3500},
{cyl: 4, wt: 3500},
{cyl: 2, wt: 3500}
];
let fit = this.model.predict(newValues, 'prediction', .99);
console.log(fit.toString());
});
Output
| fit | lwr | upr |
---------------------------------
| 18.3078 | 7.05726 | 29.5584 |
| 29.2683 | 17.9362 | 40.6003 |
| 19.7464 | 8.55471 | 30.9381 |
| 21.1849 | 9.85124 | 32.5186 |
| 22.6234 | 10.9541 | 34.2927 |
Just as in the previous example, toDict()
and toCollection()
can be used to retrieve the results as either an object of arrays or a array of objects.
Options
| Option | Purpose |
|-----------|-------------------------------------------------|
| automateFactors
| Automates the process of finding the columns with categorical values. |
| factors
| List of columns to consider for categorical values. |
| removeNA
| Automatically removes the rows with NA or NAN values. |
| removeCollinearTerms
| Automatically removes collinear terms. Suggested for formulas containing factor variables. |
| dropInvalidColumns
| Automatically drops invalid columns. (Could affect performance) |
The system expects numerical columns, unless they are factors. The factor columns list should be passed in the options object, although that process can be automated by setting the automateFactors flag.
License
MIT