csv-scrubber
v1.2.1
Published
For node 4.2.4 and higher (uses ES6 functionality).
Downloads
52
Readme
csv-scrubber
For node 4.2.4 and higher (uses ES6 functionality).
csv-scrubber
is a library for stream transforming CSV files.
- Scrub field values.
- Add rows.
- Remove rows.
- Split one input file to multiple output files.
- Integrated logging.
npm install csv-scrubber
Usage
csv-scrubber
uses middleware kind of like http://expressjs.com
Middleware is executed in the order it is added to the scrubber.
Each middleware function operates on a single row from the input CSV. However since you can add or remove rows, middleware is passed an array of records. Initially there is a single record in the array (the original CSV row). To add or remove output rows just add or remove records in the array.
Each record is an array of strings -- one element for each CSV column. You can
access record columns by index, but it's generally easier to access
them by their column name (see addNameAccess
middleware below).
If you bind your middleware to the scrubber you have access to the scrubber
log
and properties like currentRow
.
scrubber.useHeader
runs middleware for header rows only.
scrubber.useAllRows
runs middleware for all rows including header.
scrubber.use
runs middleware for non-header rows only.
scrubber.js
#!/usr/bin/env node
const scrubber = require('csv-scrubber')();
// Accessing values by index here. Generally easier to access by name as shown
// in formatBirthDate. If prior middleware may have added or removed rows,
// you will want to loop over records as show here. If not records[0] is simpler.
function removeRowIfFirstColBlankOrInteger(records, cb) {
for (let i = 0, l = records.length; i < l; i++) {
if (isBlank(records[i][0]) || isInteger(records[i][0])) {
this.log.debug(`removed row: ${this.currentRow}`);
records.splice(i, 1);
}
}
cb(null);
}
// I know i am not going to add or remove rows, so just using records[0].
// Also accessing field by name instead of index.
function formatBirthDate(records, cb) {
records[0].birthdate = formatDate(records[0].birthdate);
}
scrubber.useAllRows(scrubber.addNameAccess.bind(scrubber));
scrubber.use(removeRowIfFirstColBlankOrInteger.bind(scrubber));
scrubber.use(formatBirthDate.bind(scrubber));
scrubber.scrub();
Default behavior is to pipe stdin to stdout
./scrubber.js < original.csv > some_rows_removed.csv
Included middleware
csv-scrubber
comes with two middleware functions included.
normalizeHeader
makes your header row all lowercase and replaces anything
other than alpha numerics with underscore and removes dup underscores.
It logs a warn message if you end up with duplicate column names.
addNameAccess
allows you to access fields by their column name instead
of index. You can do record.first_name
instead of record[3]
.
const scrubber = require('csv-scrubber')();
// You may want to normalize your header. If you do, it has to come before
// addNameAccess.
scrubber.useHeader(scrubber.normalizeHeader.bind(scrubber));
// If your CSV has a header row, addNameAccess is really nice.
scrubber.useAllRows(scrubber.addNameAccess.bind(scrubber));
Input and output streams
You have total control over the input and output streams.
If you specify nothing, we will take input from stdin
and output to
stdout
.
Here are examples of how to specify input and output streams.
const scrubber = require('csv-scrubber')({ instream: '/some/input.csv' });
const scrubber = require('csv-scrubber')({ instream: myInputStream });
const scrubber = require('csv-scrubber')({ outstream: '/some/output.csv' });
const scrubber = require('csv-scrubber')({ outstream: myOutputStream });
const scrubber = require('csv-scrubber')({ outstream: null });
If you specifically pass null
as outstream
we don't output anything. That
may be useful if you are sending rows to database or such. It's also useful with
splits where you are only interested in outputting the splits and not the source
row.
See splits below for how to output to multiple files.
Logging
scrubber.log
uses loglove
and defaults to logging warn
level to the file
scrub.log
. https://github.com/johndstein/loglove
Multiple output files (splits)
Splits allow you to split a single input file into multipe output files.
Imagine you have a CSV file that includes both contact and company info in the same row. You want to split them out to different output files.
const splits = [
{ name: 'contact',
header: ['name', 'title', 'email'],
outstream: 'contact.csv' },
{ name: 'company',
header: ['name', 'address', 'fax'],
outstream: 'company.csv' } ]
const scrubber = require('csv-scrubber')({ splits: splits });
Assuming we pass the above splits to the scrubber constructor, scrubber will add the following properties to the header record.
record.contact = ['name', 'title', 'email'];
record.company = ['name', 'address', 'fax'];
Scrubber will add the following blank row to each non-header record.
record.contact = ['', '', ''];
record.company = ['', '', ''];
You will need to add middleware functions to populate the contact
and
company
values from the record. Then scrubber will output contact
to
contact.csv
and company
to company.csv
.
If you are using the addNameAccess
middleware, you can access split field
values by name. So you can do record.contact.name = 'Harry'
.
TODO
maybe add error on name access if you reference a column that's not in the header.