node-dataset

v1.4.5

Published

6 months ago

A Node.js module for working with data sets created in code, loaded from files, or retrieved from a database.

Downloads

436

0High
0Medium
0Low

jamesbontempo

data dataset database SQL CSV JSON HTML XML MySQL MongoDB

node-dataset

node-dataset is a Node.js module for working with data sets created in code, loaded from files, or retrieved from a database. Its design is largely inspired by work with SQL databases, and its development was motivated by a desire to be able to manipulate data within a JavaScript application using similar features and functions, including the ability to join multiple sets of data together. It aims to provide an easy way to retrieve data of multiple types from multiple sources, and work with all of that data in a unified manner.

Introduction

Logically, a DataSet is analogous to a table in a database or a data file (e.g., a CSV file). A DataSet has a name, a set of fields, and data. You can create a new DataSet by supplying these three elements:

const ds = require("node-dataset");

const dataset = new ds.DataSet(
  "test",
  "fips, county, state",
  [
    [45001, "Abbeville", "South Carolina"],
    [22001, "Acadia", "Louisiana"],
    [51001, "Accomack", "Virginia"],
    [16001, "Ada", "Idaho"]
  ]
);

console.log(dataset) will produce the following:

DataSet {
  name: 'test',
  fields: [ 'fips', 'county', 'state' ],
  data: [
    [ 45001, 'Abbeville', 'South Carolina' ],
    [ 22001, 'Acadia', 'Louisiana' ],
    [ 51001, 'Accomack', 'Virginia' ],
    [ 16001, 'Ada', 'Idaho' ]
  ]
}

You can also create a DataSet from a file (CSV, JSON) or a database (MySQL, MongoDB):

const education = await new ds.DataSet().fromFile("./data/education.csv", "csv");

const population = await new ds.DataSet().fromFile("./data/population.json", "json");

const age = await new ds.DataSet("age").fromMySQL(
  {host: "localhost", user: "foo", password: "bar", database: "datasets"},
  "select convert(fips, double) as fips, age_group, total, male, female from age")
);

const fips = await new ds.DataSet("fips").fromMongoDB(
  "mongodb://localhost:27017", "test", "fips", {}, {"_id": 0}
 );

These examples also demonstrate a few important things:

Many methods return a DataSet allowing for "chainable" statements.
File and database retrieval is asynchronous.

Once a DataSet has been created, there are many ways that it can be manipulated.

Note: A DataSet is largely immutable and calling its data manipulation methods will return a new DataSet rather than change its underlying data.

For example, using a DataSet from the examples above, you could select just the data for a specified set of fields from one DataSet into a new DataSet, and rename the fields using the "as" keyword:

const college = education.select("fips as fips_code, college_or_higher as percent_college");

Or you could filter the data from one DataSet into a new DataSet:

const some_states = fips.filter("state = 'Maryland' or state like 'Cali%' or state in ('New York','Texas')");

Using chained methods, you could also perform the select and filter methods in sequence:

const some_other_states = fips
  .select("fips as fips_code, state, county")
  .filter("state in ('Illinois','Kentucky','Colorado')")
 );

A DataSet can also be joined with another DataSet to create new a DataSet:

const fips_education_ = fips
 .join("education", "inner", "fips", "fips")
 .join("population", "inner", "fips", "fips.fips");

This example demonstrates another important thing: when a DataSet is joined with another DataSet, the fields in the resulting DataSet are named by combining the name of each underlying DataSet and its fields (that's why the second join refers to fips.fips in the example above). This ensures that there aren't any problems if the same field name is used in more than one DataSet.

Aggregations can also be performed on a DataSet:

const ten_most_educated = fips
  .join("education", "inner", "fips", "fips")
  .aggregate("fips.state, count(fips.state), avg(education.college_or_higher), std(education.college_or_higher)", "fips.state")
  .sort("avg(education.college_or_higher) desc")
  .slice(0, 10)
);

Constructor

The DataSet constructor creates a new instance of a DataSet.

Parameters:

Name|Type|Description ----|----|----------- name|string|the name of the DataSet fields|string | array| a comma-separated list, or array, of fields in the DataSet (if a string is supplied, it will be split using /\s*,\s*/) data|array|an array of arrays containing the data in the DataSet (each subarray is essentially a record)

Example:

const dataset = new ds.DataSet(); // name will be null, fields will be [], and data will be []

const dataset = new ds.DataSet("test", "field1, field2", [[1, "a"], [2, "b"]]);

const dataset = new ds.DataSet("test", ["field1", "field2"], [[1, "a"], [2, "b"]]);

Basic methods

getName

The getName method returns a string containing the name of the current DataSet.

Example:

const name = dataset.getName();

setName

The setName method sets the name of the current DataSet and returns the DataSet.

Parameters:

Name|Type|Description ----|----|----------- name|string|the name

Example:

dataset.setName("new_name");

getFields

The getFields method returns an array containing the fields of the current DataSet.

Example:

const fields = dataset.getFields();

setFields

The setFields method sets the fields of the current DataSet and returns the DataSet.

Parameters:

Name|Type|Description ----|----|----------- fields|string | array|a comma-separated list, or array, of fields

Example:

dataset.setFields("field1, field2");

dataset.setFields(["field1", "field2"]);

getData

The getData method returns an array of arrays containing the data of the current DataSet.

Example:

const data = dataset.getData();

setData

The setData method sets the data of the current DataSet and returns the DataSet.

Parameters:

Name|Type|Description ----|----|----------- data|array|an array of arrays of the data

Example:

dataset.setData([[1, "a"], [2, "b"]]);

count

The count method returns the number of records in the current DataSet.

Example:

const count = dataset.count();

Data manipulation methods

select

The select method returns a new DataSet containing the data for a subset of fields from the original DataSet.

Parameters:

Name|Type|Description ----|----|----------- fieldList|string|a comma-separated list of fields to select, which can be renamed using the "as" keyword

Example:

const new_dataset = dataset.select("field1, field2, field3 as three");

Note: There is no equivalent to the SQL select * statement. If you wanted to create a new DataSet with all the same data as another DataSet you could do something like this:

const new_dataset = new ds.DataSet("clone", dataset.getFields(), dataset.getData());

join

The join method returns a new DataSet created by joining the current DataSet with another DataSet.

Parameters:

Name|Type|Description ----|----|----------- dataset|DataSet|the DataSet to join with the current DataSet with type|string|the type of join to perform: "inner", "left", "right" or "cross". fieldList1|string|a comma-separated list of fields from the current DataSet to use for the join fieldList2|string|a comma-separated list of fields from the joined to DataSet to use for the join

Example:

const new_dataset = dataset.join(d2, "left", "d1_field1, d1_field2", "d2_field1, d2_field2");

filter | where

The filter method returns a new DataSet created by filtering the current DataSet for a subset of data.

Parameters:

Name|Type|Description ----|----|----------- filterStatement|string|a statement describing the filter to be applied useEval|boolean|whether to use the new Function() constructor (similar to eval) to evaluate the filterStatement

Example:

const new_dataset = dataset.filter("field1 > 100 and field2 like '%something%' and field3 is not null", true);

Supported comparison operators and functions include: =, <, >, <=, >=, !=, <>, (not) in (e.g. id in (1, 2, 3) or code not in ('a', 'b, 'c')), (not) like (e.g., state like 'cali%' or state not like '*york'), (not) between (e.g., id between 0 and 10 or number not between 1000 and 2000), is (not) null (e.g., field1 is null or field2 is not null).

A note about useEval: All filter conditions, when matched against data, are ultimately reduced to a boolean statement; for example (true || (false && true)). At this point, if useEval is true, the boolean statment will be evaluated using the new Function() constructor, which is actually safer than using eval directly. However, if this is concerning, when useEval is set to false the "boolean-parser" module will be used to evaluate the statement. In testing, setting useEval to true regularly cuts the execution time in half.

For those who prefer SQL-style naming, the where method is a direct replacement for filter:

const new_dataset = dataset.where("field1 > 100 and field2 like '%something%' and field3 is not null");

sort | orderby

The sort method returns a new DataSet with the data sorted.

Parameters:

Name|Type|Description ----|----|----------- sortStatement|string|a statement describing how to sort the DataSet

Example:

const new_dataset = dataset.sort("field1 desc, field 2");

If no sort order—asc for ascending, desc for descending—is supplied, the default is asc.

For those who prefer SQL-style naming, the orderby method is a direct replacement for sort:

const new_dataset = dataset.orderby("field1 desc, field 2");

aggregate

The aggregate method returns a new DataSet with aggregate functions performed on fields in the current DataSet.

Parameters:

Name|Type|Description ----|----|----------- aggregationList|string|a comma-separated list of aggregate functions to perform on fields, which can be aliased using the "as" keyword groupList|string|a comma-separated list of fields to group by

Example:

const new_dataset = dataset.aggregate("field1, field2, min(field3) as min, max(field4)", "field1, field2");

Supported aggregate functions include: count, min (minimum), max (maximum), sum, avg (average), var (variance), and std (standard deviation).

slice | limit

The slice method returns a new DataSet using a slice of the data in the current DataSet (using zero-based array indexing).

Parameters:

Name|Type|Description ----|----|----------- begin|number|the array index indicating where the slice should begin end|number|the array index indicating up to where the slice should extend (the item at this index is not actually included in the results)

Example:

const new_dataset = dataset.slice(100, 125);

For those who prefer SQL-style naming, the limit method is a replacement for slice for which begin is always 0:

const new_dataset = dataset.limit(25);

Input/Output methods

fromFile

The asnynchronous fromFile method returns a Promise to populate the current DataSet using data from a file and returns the current DataSet when it's fulfilled.

Parameters:

Name|Type|Description ----|----|----------- filePath|string|the path to the file containing the data type|string|the type of file/format of the data ("json", "csv", or "xlsx") options|object|for CSV files, whether the first line contains the fields as a header, and the quote and delimiter characters (default is { header:true, quote: "\"", delimiter: "," }); for XLSX files, whether the first row of data contains the fields, the 1-based index of the worksheet containing the data, and, if the entire sheet does not contain data, the Excel-style range (e.g., "B2:E12") that contains the data (default is { header: true, worksheet: 1 })

Example:

const dataset = await new ds.DataSet().fromFile("./data/test.xlsx", "xlsx", { header: true, worksheet: 2, range: B2:C5 });

new ds.DataSet().fromFile("./data/test.xlsx", "xlsx", { header: true, worksheet: 2, range: B2:C5 }).then( ...do something with the DataSet... );

Note: the name of the new DataSet will be set to the name of the data file without its extension.

fromMySQL

The asynchronous fromMySQL method returns a Promise to populate the current DataSet using the results from a MySQL query and returns the current DataSet when it's fulfilled.

Paramters:

Name|Type|Description ----|----|----------- options|object|the options for the MySQL connection query|string|the query to retrieve the data

Example:

const dataset = await new ds.DataSet("test")
  .fromMySQL({host: "localhost", user: "foo", password: "bar", database: "test"}, "select * from table")

new ds.DataSet("test")
  .fromMySQL({host: "localhost", user: "foo", password: "bar", database: "test"}, "select * from table")
  .then( ...do something with the DataSet... );

Note: Unless set earlier, as in the example above, the name of a new DataSet created using the fromMySQL method will be null. The fields of the new DataSet will be set to the fields returned by the query.

fromMongoDB

The fromMongoDB method returns a Promise to populate the current DataSet using the results from a MongoDB query and returns the current DataSet when it's fulfilled.

Parameters:

Name|Type|Description ----|----|----------- url|string|the url of the database database|string|the name of the database collection|string|the name of the collection query|object|the query to execute to retrieve results projection|object|the projection to execute on the results

Example:

const dataset = await new ds.DataSet("test")
  .fromMongoDB("mongodb://localhost:27017", "test", "test", {}, {"_id": 0});

new ds.DataSet("test")
  .fromMongoDB("mongodb://localhost:27017", "test", "test", {}, {"_id": 0})
   .then( ...do something with the DataSet... );

Note: Unless set earlier, as in the example above, the name of a new DataSet created using the fromMongoDB method will be null.

fromJSON

The fromJSON method populates the current DataSet from an array of JSON objects.

Parameters:

Name|Type|Description ----|----|----------- json|array|an array of JSON objects where the keys are the fields and the values are the data

Example:

const dataset = new ds.DataSet("test").fromJSON([{field1: 1, field2: "a"}, {field1: 2, field2: "b"}]);

Note: Unless set earlier, as in the example above, the name of a new DataSet created using the fromJSON method will be null.

fromCSV

The fromCSV method populates the current DataSet from a CSV-formatted string.

Parameters:

Name|Type|Description ----|----|----------- csv|string|a CSV-formatted string where the first line contains the fields and the rest are the data options|object|whether the first line contains the fields as a header; the quote and delimiter characters; and whether to "coerce" strings that look like numbers into numbers, ones that look like dates into ISO-style date strings, and set "NA" values to null (default is { header: true, quote: "\"", delimiter: ",", coerce: true })

Example:

const dataset = new ds.DataSet("test").fromCSV("field1,field2\n1,'a'\n2,'b'", { quote: "'" });

Note: Unless set earlier, as in the example above, the name of a new DataSet created using the fromCSV method will be null.

fromHTML

The fromHTML method populates the current DataSet from an HTML table.

Parameters:

Name|Type|Description ----|----|----------- html|string|an HTML table options|object|whether the table contains column headers; an array of headers to use if it doesn't; and whether to "coerce" strings that look like numbers into numbers, ones that look like dates into ISO-style date strings, and set "NA" values to null (default is { header: true, coerce: true })

Example:

const dataset = new ds.DataSet("test").fromHTML("<table><tr><td>1</td><td>a</td><td>Product A</td><td>1/1/2022</td></tr></table>", { header: false, headers: ["id", "code","name"] });

Note: Unless set earlier, as in the example above, the name of a new DataSet created using the fromHTML method will be null.

fromArray

The fromArray method populates the current DataSet from an array.

Parameters:

Name|Type|Description ----|----|----------- array|array|an array containing the name, fields and data for the DataSet

Example:

const dataset = new ds.DataSet().fromArray(["test", "field1, field2", [[1, "a"], [2, "b"]]);

Note: this method is used extensively internally when creating a new DataSet, but may not be of significant use in an application.

toFile

The toFile method writes the current DataSet to a file.

Parameters:

Name|Type|Description ----|----|----------- filePath|string|the path to the output file type|string|the type of file/format of the data ("json" or "csv") options|object|options for the file (default for JSON { pretty: true, space: 2 }, for CSV { header: true, delimiter: ",", quote: "\"" }

Example:

dataset.toFile("./data/test.json", "json", { pretty: true, space: 2 });

dataset.toFile("./data/test.csv", "csv", { delimiter: "\t", quote: "'" });

toJSON

The toJSON method converts the current DataSet to an array of JSON objects where the fields are the keys and the data are the values.

Example:

const json = dataset.toJSON();

toCSV

The toCSV method converts the current DataSet to a CSV-style, multi-line string where the first line contains the fields and the remaining lines comprise the data.

Name|Type|Description ----|----|----------- options|object|whether to print the fields as a header, and the quote and delimiter characters (default is { header: true, quote: "\"", delimiter: "," })

Example:

const csv = dataset.toCSV({ header: false, delimiter: "\t" });

toHTML

The toHTML method converts the current DataSet to an HTML table where the fields are the headers and the data are the rows.

Example:

const json = dataset.toHTML();

toXML

The toXML method converts the current DataSet to an XML document.

Example:

const json = dataset.toXML();

SQL examples

For those used to working with SQL, it might be helpful to see some examples of how to map SQL queries to a series of DataSet calls.

Let's suppose you had this query:

select employee.id, employee.name, employee.department_id, department.id, department.name as department_name
from employee
    inner join department on department.id = employee.department_id

Assuming you have a DataSet named "employee" and another named "department," you could do this:

const joined_dataset = employee
    .join(department, "inner", "department_id", "id")
    .select("employee.id, employee.name, employee.department_id, department.id, department.name as department_name");

You'll notice that I've placed the select call after the join. I have to do this because the "employee" DataSet has to be joined to the "department" DataSet before I can select fields from across both of them. This will likely become a pattern as you use node-dataset—the last thing you specify is a select, whereas in SQL it's the first thing.

Similarly, aggregate will often be a final operation. Let's look at another SQL example, this time one using "group by," and see how it would be achieved with node-dataset.

select department.department_name, count(employee.id), avg(employee.age)
from department
    inner join employee on employee.department_id = department.id
group by department.department_name

const aggregate_dataset = department
    .join(employee, "inner", "id", "department_id")
    .aggregate("department.department_name, count(employee.id), avg(employee.age)", "department.department_name");

A note about (multiple) joins

As noted earlier, the names of the fields in a DataSet returned by a join always include their base DataSet name to ensure uniqueness, in case the joined DataSet objects have fields with the same name. However, the resulting DataSet from the join has its own name set to null. The simple explanation for this is that it's not clear what to name the new DataSet given that it is created from two distinct objects. The upside to setting the name to null is that fields will not be renamed, over and over, when multiple joins are "chained" together.

Imagine we have three DataSet objects, each with the fields "id" and "name". We could join all of them like this:

const triple_join = a
    .join(b, "inner", "id", "id")
    .join(c, "inner", "a.id", "id");

I have to specify "a.id" in the second join because the first join will rename the field in the DataSet passed to the second join. But in the end, the triple_join DataSet would have the fields "a.id", "a.name", "b.id", "b.name", "c.id", and "c.name." In particular, it's important to point out that the additional join with c would not prepend the a and b fields with anything because the DataSet from the first join that is passed to the second join has its name set to null. The end result of all of this is that successive joins in a chain will not do anything too funny with any fields other then prepend them with their respective DataSet name along the way.

Of course, that wouldn't keep me from doing something like this:

const triple_join = a
    .join(b, "inner", "id", "id")
    .select(a.id as id, a.name, b.id, b.name)
    .join(c, "inner", "id", "id");

By putting the select in between the first and second join and using the as keyword, I've renamed "a.id" as simply "id" so that I can refer to it as such in the next join. I'm not sure that makes anything easier to understand, but when performing a long join chain, renaming fields along the way, and maybe even using select to "whittle down" the DataSet, might make sense.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-dataset

Table of contents

Introduction

Constructor

Basic methods

getName

setName

getFields

setFields

getData

setData

count

Data manipulation methods

select

join

filter | where

sort | orderby

aggregate

slice | limit

Input/Output methods

fromFile

fromMySQL

fromMongoDB

fromJSON

fromCSV

fromHTML

fromArray

toFile

toJSON

toCSV

toHTML

toXML

SQL examples

A note about (multiple) joins