statistical-aggregation
v0.2.1
Published
A library for computing statistical aggregations which are composable and augmentable
Downloads
46
Maintainers
Readme
statistical-aggregation
This is a library for computing statistical aggregations which are composable and augmentable. Also see this blog post.
Installation
npm install statistical-aggregation
Example
const { aggregate } = require("statistical-aggregation");
const customersChunk1 = [
{
customerId: 1,
region: "midwest",
revenue: 10,
},
{
customerId: 2,
region: "midwest",
revenue: 20,
},
{
customerId: 4,
region: "northeast",
revenue: 40,
},
];
const customersChunk2 = [
{
customerId: 3,
region: "midwest",
revenue: 30,
},
{
customerId: 5,
region: "northeast",
revenue: 50,
},
{
customerId: 6,
region: "northeast",
revenue: 60,
},
];
const aggregationOptions = {
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
};
const aggregatedChunk1 = aggregate({
records: customersChunk1,
...aggregationOptions,
});
const aggregatedChunk2 = aggregate({
records: customersChunk2,
...aggregationOptions,
});
const finalResult = aggregate({
records: [
...aggregatedChunk1.aggregatedRecords,
...aggregatedChunk2.aggregatedRecords,
],
...aggregationOptions,
});
The value of finalResult
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"averageRevenue": 20
},
{
"region": "northeast",
"averageRevenue": 50
}
],
"totals": {
"averageRevenue": 35
}
}
This is a simplified example. In a possible real use case, each chunk of data would be much larger and would be retrieved from a database. Additionally, the three aggregations would be computed on three different servers.
Summary
This is a library for computing statistical aggregations which are composable and augmentable.
Composable means that the results of multiple aggregations can be combined. For example, aggregations which are calculated in parallel on separate machines can be combined into a single result.
Augmentable means that an aggregation can be continually added to as new data is available without having to recalculate from the original data.
This is all done through a single function, aggregate
.
Supported Aggregation Types
"minimum"
"maximum"
"count"
"sum"
"average"
"weightedAverage"
"standardDeviation"
Aggregation Metadata
Composability and augmentability is possible because there is a field called _aggregationMetadata
attached to each aggregated record.
For simplicity, this field is not shown in the examples.
You generally shouldn't need to worry about this field, but if the aggregated records are being
stored in a database, this field must be present on each record in order to retain composability
and augmentability.
Usage
An options object is passed to the function like this:
const result = aggregate({
records: customers,
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
totalRevenue: {
method: "sum",
sourceField: "revenue",
},
},
});
It will return an object containing the aggregated records as well as the total aggregation over all records. For example,
{
"aggregatedRecords": [
{
"region": "midwest",
"averageRevenue": 20,
"totalRevenue": 60
},
{
"region": "northeast",
"averageRevenue": 50,
"totalRevenue": 150
}
],
"totals": {
"averageRevenue": 35,
"totalRevenue": 210
}
}
Here is more detail on all possible options:
| Param | Description | | --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | options | | | options.records | The records to aggregate. These can include records which are the output of a previous aggregation. | | options.matchKeys | The fields to aggregate on. Nested fields can be accessed using path notation. Ex "a[0].b.c" | | options.fields | | | options.fields.<outputFieldName> | The output field returned on the aggregated records. Nested fields are allowed using path notation. Ex "a[0].b.c" | | options.fields.<outputFieldName>.method | The method of aggregation: either "sum", "min", "max", "average", "standardDeviation", "weightedAverage", or "count" | | options.fields.<outputFieldName>.sourceField | The field on the input records which the aggregation method acts on. Not needed for the "count" method. Nested fields can be accessed using path notation. Ex "a[0].b.c" | | options.fields.<outputFieldName>.weightField | Only used for "weightedAverage" method. The field on the input records which is the coefficient or weight for the weighted average. Nested fields can be accessed using path notation. Ex "a[0].b.c" | | options.buckets | | | options.buckets.<bucketFieldName> | An array of breakpoints defining the bucket ranges. The bucketFieldName must be one of the match keys. If a value is on a breakpoint it is put in the lower bucket. | | options.noAggregateMetadata | If true, aggregation metadata will not be output on aggregated records. This is not recommended since the aggregated records cannot be augmented with new records if there is no aggregation metadata present. | | options.sortBy | An array of fields to sort the output by. |
Examples
Simple Example
Say we have the following data and we want to find the average and total revenue by region.
const customers = [
{
customerId: 1,
region: "midwest",
revenue: 10,
},
{
customerId: 2,
region: "midwest",
revenue: 20,
},
{
customerId: 3,
region: "midwest",
revenue: 30,
},
{
customerId: 4,
region: "northeast",
revenue: 40,
},
{
customerId: 5,
region: "northeast",
revenue: 50,
},
{
customerId: 6,
region: "northeast",
revenue: 60,
},
];
const result = aggregate({
records: customers,
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
totalRevenue: {
method: "sum",
sourceField: "revenue",
},
},
});
The value of result
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"averageRevenue": 20,
"totalRevenue": 60
},
{
"region": "northeast",
"averageRevenue": 50,
"totalRevenue": 150
}
],
"totals": {
"averageRevenue": 35,
"totalRevenue": 210
}
}
As you can see above, the requested aggregated records are provided under aggregatedRecords
.
As a bonus, the totals across all records are also provided under totals
.
Example with Multiple Match Keys and Structured Data
If we want to find the average and total revenue by region and plan, we can specify multiple match keys. Additionally, if the data is structured, we can use path notation (like in lodash) to specify the keys. Path notation is also supported for the output fields.
const customers = [
{
customerId: 1,
region: "midwest",
other: { plan: "basic" },
revenue: 10,
},
{
customerId: 2,
region: "midwest",
other: { plan: "super" },
revenue: 20,
},
{
customerId: 3,
region: "midwest",
other: { plan: "basic" },
revenue: 30,
},
{
customerId: 4,
region: "northeast",
other: { plan: "super" },
revenue: 40,
},
{
customerId: 5,
region: "northeast",
other: { plan: "basic" },
revenue: 50,
},
{
customerId: 6,
region: "northeast",
other: { plan: "super" },
revenue: 60,
},
];
const result = aggregate({
records: customers,
matchKeys: ["region", "other.plan"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
totalRevenue: {
method: "sum",
sourceField: "revenue",
},
"extra.numberOfCustomers": {
method: "count",
},
},
});
The value of result
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"other": {
"plan": "basic"
},
"averageRevenue": 20,
"totalRevenue": 40,
"extra": {
"numberOfCustomers": 2
}
},
{
"region": "midwest",
"other": {
"plan": "super"
},
"averageRevenue": 20,
"totalRevenue": 20,
"extra": {
"numberOfCustomers": 1
}
},
{
"region": "northeast",
"other": {
"plan": "super"
},
"averageRevenue": 50,
"totalRevenue": 100,
"extra": {
"numberOfCustomers": 2
}
},
{
"region": "northeast",
"other": {
"plan": "basic"
},
"averageRevenue": 50,
"totalRevenue": 50,
"extra": {
"numberOfCustomers": 1
}
}
],
"totals": {
"averageRevenue": 35,
"totalRevenue": 210,
"extra": {
"numberOfCustomers": 6
}
}
}
Example with Buckets
If we want to find the average and total revenue by age range, we can use buckets.
const customers = [
{
customerId: 1,
age: 10,
revenue: 60,
},
{
customerId: 2,
age: 20,
revenue: 50,
},
{
customerId: 3,
age: 30,
revenue: 40,
},
{
customerId: 4,
age: 40,
revenue: 30,
},
{
customerId: 5,
age: 50,
revenue: 20,
},
{
customerId: 6,
age: 60,
revenue: 10,
},
];
const result = aggregate({
records: customers,
matchKeys: ["age"],
buckets: {
age: [0, 25, 50],
},
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
totalRevenue: {
method: "sum",
sourceField: "revenue",
},
numberOfCustomers: {
method: "count",
},
},
});
The value of result
will be:
{
"aggregatedRecords": [
{
"age": "0-25",
"averageRevenue": 55,
"totalRevenue": 110,
"numberOfCustomers": 2
},
{
"age": "25-50",
"averageRevenue": 30,
"totalRevenue": 90,
"numberOfCustomers": 3
},
{
"age": "50+",
"averageRevenue": 10,
"totalRevenue": 10,
"numberOfCustomers": 1
}
],
"totals": {
"averageRevenue": 35,
"totalRevenue": 210,
"numberOfCustomers": 6
}
}
Example of Weighted Average
Say we want to find the average customer revenue, but we want it to be weighted by the number of days a customer was active. We can use the "weightedAverage" method.
const customers = [
{
customerId: 1,
region: "midwest",
revenue: 10,
daysActive: 51,
},
{
customerId: 2,
region: "midwest",
revenue: 20,
daysActive: 52,
},
{
customerId: 3,
region: "midwest",
revenue: 30,
daysActive: 53,
},
{
customerId: 4,
region: "northeast",
revenue: 40,
daysActive: 54,
},
{
customerId: 5,
region: "northeast",
revenue: 50,
daysActive: 55,
},
{
customerId: 6,
region: "northeast",
revenue: 60,
daysActive: 56,
},
];
const result = aggregate({
records: customers,
matchKeys: ["region"],
fields: {
weightedAverageRevenue: {
method: "weightedAverage",
sourceField: "revenue",
weightField: "daysActive",
},
},
});
The value of result
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"weightedAverageRevenue": 20.128205128205128
},
{
"region": "northeast",
"weightedAverageRevenue": 50.121212121212125
}
],
"totals": {
"weightedAverageRevenue": 35.545171339563865
}
}
Example of Composition
We can put the results of two aggregations together.
const customersChunk1 = [
{
customerId: 1,
region: "midwest",
revenue: 10,
},
{
customerId: 2,
region: "midwest",
revenue: 20,
},
{
customerId: 3,
region: "midwest",
revenue: 30,
},
{
customerId: 4,
region: "northeast",
revenue: 40,
},
{
customerId: 5,
region: "northeast",
revenue: 50,
},
{
customerId: 6,
region: "northeast",
revenue: 60,
},
];
const customersChunk2 = [
{
customerId: 7,
region: "midwest",
revenue: 70,
},
{
customerId: 8,
region: "midwest",
revenue: 80,
},
{
customerId: 9,
region: "northeast",
revenue: 90,
},
{
customerId: 10,
region: "northeast",
revenue: 100,
},
];
const resultChunk1 = aggregate({
records: customersChunk1,
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
});
const resultChunk2 = aggregate({
records: customersChunk2,
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
});
const combinedResult = aggregate({
records: [
...resultChunk1.aggregatedRecords,
...resultChunk2.aggregatedRecords,
],
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
});
The value of combinedResult
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"averageRevenue": 42
},
{
"region": "northeast",
"averageRevenue": 68
}
],
"totals": {
"averageRevenue": 55
}
}
Example of Augmentation
We can add new records to an already calculated aggregation.
const customers = [
{
customerId: 1,
region: "midwest",
revenue: 10,
},
{
customerId: 2,
region: "midwest",
revenue: 20,
},
{
customerId: 3,
region: "midwest",
revenue: 30,
},
{
customerId: 4,
region: "northeast",
revenue: 40,
},
{
customerId: 5,
region: "northeast",
revenue: 50,
},
{
customerId: 6,
region: "northeast",
revenue: 60,
},
];
const result = aggregate({
records: customers,
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
});
const newCustomers = [
{
customerId: 7,
region: "midwest",
revenue: 70,
},
{
customerId: 8,
region: "midwest",
revenue: 80,
},
{
customerId: 9,
region: "northeast",
revenue: 90,
},
{
customerId: 10,
region: "northeast",
revenue: 100,
},
];
const updatedResult = aggregate({
records: [...result.aggregatedRecords, ...newCustomers],
matchKeys: ["region"],
fields: {
averageRevenue: {
method: "average",
sourceField: "revenue",
},
},
});
The value of updatedResult
will be:
{
"aggregatedRecords": [
{
"region": "midwest",
"averageRevenue": 42
},
{
"region": "northeast",
"averageRevenue": 68
}
],
"totals": {
"averageRevenue": 55
}
}