npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

csv-scrubber

v1.2.1

Published

For node 4.2.4 and higher (uses ES6 functionality).

Downloads

52

Readme

csv-scrubber

For node 4.2.4 and higher (uses ES6 functionality).

csv-scrubber is a library for stream transforming CSV files.

  • Scrub field values.
  • Add rows.
  • Remove rows.
  • Split one input file to multiple output files.
  • Integrated logging.
npm install csv-scrubber

Usage

csv-scrubber uses middleware kind of like http://expressjs.com

Middleware is executed in the order it is added to the scrubber.

Each middleware function operates on a single row from the input CSV. However since you can add or remove rows, middleware is passed an array of records. Initially there is a single record in the array (the original CSV row). To add or remove output rows just add or remove records in the array.

Each record is an array of strings -- one element for each CSV column. You can access record columns by index, but it's generally easier to access them by their column name (see addNameAccess middleware below).

If you bind your middleware to the scrubber you have access to the scrubber log and properties like currentRow.

scrubber.useHeader runs middleware for header rows only.

scrubber.useAllRows runs middleware for all rows including header.

scrubber.use runs middleware for non-header rows only.

scrubber.js

#!/usr/bin/env node

const scrubber = require('csv-scrubber')();

// Accessing values by index here. Generally easier to access by name as shown
// in formatBirthDate. If prior middleware may have added or removed rows,
// you will want to loop over records as show here. If not records[0] is simpler.
function removeRowIfFirstColBlankOrInteger(records, cb) {
  for (let i = 0, l = records.length; i < l; i++) {
    if (isBlank(records[i][0]) || isInteger(records[i][0])) {
      this.log.debug(`removed row: ${this.currentRow}`);
      records.splice(i, 1);
    }
  }
  cb(null);
}

// I know i am not going to add or remove rows, so just using records[0].
// Also accessing field by name instead of index.
function formatBirthDate(records, cb) {
    records[0].birthdate = formatDate(records[0].birthdate);
}

scrubber.useAllRows(scrubber.addNameAccess.bind(scrubber));
scrubber.use(removeRowIfFirstColBlankOrInteger.bind(scrubber));
scrubber.use(formatBirthDate.bind(scrubber));

scrubber.scrub();

Default behavior is to pipe stdin to stdout

./scrubber.js < original.csv > some_rows_removed.csv

Included middleware

csv-scrubber comes with two middleware functions included.

normalizeHeader makes your header row all lowercase and replaces anything other than alpha numerics with underscore and removes dup underscores. It logs a warn message if you end up with duplicate column names.

addNameAccess allows you to access fields by their column name instead of index. You can do record.first_name instead of record[3].

const scrubber = require('csv-scrubber')();

// You may want to normalize your header. If you do, it has to come before
// addNameAccess.
scrubber.useHeader(scrubber.normalizeHeader.bind(scrubber));
// If your CSV has a header row, addNameAccess is really nice.
scrubber.useAllRows(scrubber.addNameAccess.bind(scrubber));

Input and output streams

You have total control over the input and output streams.

If you specify nothing, we will take input from stdin and output to stdout.

Here are examples of how to specify input and output streams.

const scrubber = require('csv-scrubber')({ instream: '/some/input.csv' });
const scrubber = require('csv-scrubber')({ instream: myInputStream });

const scrubber = require('csv-scrubber')({ outstream: '/some/output.csv' });
const scrubber = require('csv-scrubber')({ outstream: myOutputStream });
const scrubber = require('csv-scrubber')({ outstream: null });

If you specifically pass null as outstream we don't output anything. That may be useful if you are sending rows to database or such. It's also useful with splits where you are only interested in outputting the splits and not the source row.

See splits below for how to output to multiple files.

Logging

scrubber.log uses loglove and defaults to logging warn level to the file scrub.log. https://github.com/johndstein/loglove

Multiple output files (splits)

Splits allow you to split a single input file into multipe output files.

Imagine you have a CSV file that includes both contact and company info in the same row. You want to split them out to different output files.

const splits = [
  { name: 'contact',
    header: ['name', 'title', 'email'],
    outstream: 'contact.csv' },
  { name: 'company',
    header: ['name', 'address', 'fax'],
    outstream: 'company.csv' } ]

const scrubber = require('csv-scrubber')({ splits: splits });

Assuming we pass the above splits to the scrubber constructor, scrubber will add the following properties to the header record.

record.contact = ['name', 'title', 'email'];
record.company = ['name', 'address', 'fax'];

Scrubber will add the following blank row to each non-header record.

record.contact = ['', '', ''];
record.company = ['', '', ''];

You will need to add middleware functions to populate the contact and company values from the record. Then scrubber will output contact to contact.csv and company to company.csv.

If you are using the addNameAccess middleware, you can access split field values by name. So you can do record.contact.name = 'Harry'.

TODO

maybe add error on name access if you reference a column that's not in the header.