@stdlib/utils-dsv-base-parse

v0.2.2

Published

4 months ago

Incremental parser for delimiter-separated values (DSV).

Downloads

0High
0Medium
0Low

stdlib stdutils stdutil utilities utils util base csv tsv dsv delimiter data tabular table format parse parser

DSV Parser

Incremental parser for delimiter-separated values (DSV).

Installation

npm install @stdlib/utils-dsv-base-parse

Usage

var Parser = require( '@stdlib/utils-dsv-base-parse' );

Parser( [options] )

Returns an incremental parser for delimiter-separated values (DSV).

var parse = new Parser();

// Parse a line of comma-separated values (CSV):
parse.next( '1,2,3,4\r\n' ); // => [ '1', '2', '3', '4' ]

// ...

// Parse multiple lines of CSV:
parse.next( '4,5,6\r\n7,8,9\r\n' ); // => [ '4', '5', '6' ], [ '7', '8', '9' ]

// ...

// Parse partial lines:
parse.next( 'a,b' );
parse.next( ',c,d\r\n' ); // => [ 'a', 'b', 'c', 'd' ]

// ...

// Chain together invocations:
parse.next( 'e,f' ).next( ',g,h' ).next( '\r\n' ); // => [ 'e', 'f', 'g', 'h' ]

The constructor accepts the following options:

comment: character sequence appearing at the beginning of a row which demarcates that the row content should be parsed as a commented line. A commented line ends upon encountering the first newline character sequence, regardless of whether that newline character sequence is preceded by an escape character sequence. Default: ''.
delimiter: character sequence separating record fields (e.g., ',' for comma-separated values (CSV) and \t for tab-separated values (TSV)). Default: ','.
doublequote: boolean flag indicating how quote sequences should be escaped within a quoted field. When true, a quote sequence must be escaped by another quote sequence. When false, a quote sequence must be escaped by the escape sequence. Default: true.
escape: character sequence for escaping character sequences having special meaning (i.e., the delimiter, newline, and escape sequences outside of quoted fields; the comment sequence at the beginning of a record and outside of a quoted field; and the quote sequence inside a quoted field when doublequote is false). Default: ''.
ltrim: boolean indicating whether to trim leading whitespace from field values. If false, the parser does not trim leading whitespace (e.g., a, b, c parses as [ 'a', ' b', ' c' ]). If true, the parser trims leading whitespace (e.g., a, b, c parses as [ 'a', 'b', 'c' ]). Default: false.
maxRows: maximum number of records to process (excluding skipped lines). By default, the maximum number of records is unlimited.
newline: character sequence separating rows. Default: '\r\n' (see RFC 4180).
onClose: callback to be invoked upon closing the parser. If a parser has partially processed a record upon close, the callback is invoked with the following arguments:
- value: unparsed partially processed field text.
Otherwise, the callback is invoked without any arguments.
onColumn: callback to be invoked upon processing a field. The callback is invoked with the following arguments:
- field: field value.
- row: row number (zero-based).
- col: field (column) number (zero-based).
- line: line number (zero-based).
onComment: callback to be invoked upon processing a commented line. The callback is invoked with the following arguments:
- comment: comment text.
- line: line number (zero-based).
onError: callback to be invoked upon encountering an unrecoverable parse error. By default, upon encountering a parse error, the parser throws an Error. When provided an error callback, the parser does not throw and, instead, invokes the provided callback. The callback is invoked with the following arguments:
- error: an Error object.
onRow: callback to be invoked upon processing a record. The callback is invoked with the following arguments:
- record: an array-like object containing field values. If provided a rowBuffer, the record argument will be the same array-like object for each invocation.
- row: row number (zero-based).
- ncols: number of fields (columns).
- line: line number (zero-based).
If a parser is closed before fully processing the last record, the callback is invoked with field data for all fields which have been parsed. Any remaining field data is provided to the onClose callback. For example, if a parser has processed two fields and closes while attempting to process a third field, the parser invokes the onRow callback with field data for the first two fields and invokes the onClose callback with the partially processed data for the third field.
onSkip: callback to be invoked upon processing a skipped line. The callback is invoked with the following arguments:
- record: unparsed record text.
- line: line number (zero-based).
onWarn: when strict is false, a callback to be invoked upon encountering invalid DSV. The callback is invoked with the following arguments:
- error: an Error object.
quote: character sequence demarcating the beginning and ending of a quoted field. When quoting is false, a quote character sequence has no special meaning and is processed as normal text. Default: '"'.
quoting: boolean flag indicating whether to enable special processing of quote character sequences (i.e., when a quote sequence should demarcate a quoted field). Default: true.
rowBuffer: array-like object for the storing field values of the most recently processed record. When provided, the row buffer is reused and is provided to the onRow callback for each processed record. If a provided row buffer is a generic array, the parser grows the buffer as needed. If a provided row buffer is a typed array, the buffer size is fixed, and, thus, needs to be large enough to accommodate processed fields. Providing a fixed length array is appropriate when the number of fields is known prior to parsing. When the number of fields is unknown, providing a fixed length array may still be appropriate; however, one is advised to allocate a buffer having more elements than is reasonably expected in order to avoid buffer overflow.
rtrim: boolean indicating whether to trim trailing whitespace from field values. If false, the parser does not trim trailing whitespace (e.g., a ,b ,c parses as [ 'a ', 'b ', 'c' ]). If true, the parser trims trailing whitespace (e.g., a ,b ,c parses as [ 'a', 'b', 'c' ]). Default: false.
skip: character sequence appearing at the beginning of a row which demarcates that the row content should be parsed as a skipped record. Default: ''.
skipBlankRows: boolean flag indicating whether to skip over rows which are either empty or containing only whitespace. Default: false.
skipRow: callback whose return value indicates whether to skip over a row. The callback is invoked with the following arguments:
- nrows: number of processed rows (equivalent to the current row number).
- line: line number (zero-based).
If the callback returns a truthy value, the parser skips the row; otherwise, the parser attempts to process the row.
Note, however, that, even if the callback returns a falsy value, a row may still be skipped depending on the presence of a skip character sequence.
strict: boolean flag indicating whether to raise an exception upon encountering invalid DSV. When false, instead of throwing an Error or invoking the onError callback, the parser invokes an onWarn callback with an Error object specifying the encountered error. Default: true.
trimComment: boolean flag indicating whether to trim leading whitespace in commented lines. Default: true.
whitespace: list of characters to be interpreted as whitespace. Default: [ ' ' ].

The parser does not perform field conversion/transformation and, instead, is solely responsible for incrementally identifying fields and records. Further processing of fields/records is the responsibility of parser consumers who are generally expected to provide either an onColumn callback, an onRow callback, or both.

var format = require( '@stdlib/string-format' );

function onColumn( field, row, col ) {
    console.log( format( 'Row: %d. Column: %d. Value: %s', row, col, field ) );
}

function onRow( record, row, ncols ) {
    console.log( format( 'Row: %d. nFields: %d. Value: | %s |', row, ncols, record.join( ' | ' ) ) );
}

var opts = {
    'onColumn': onColumn,
    'onRow': onRow
};
var parse = new Parser( opts );

parse.next( '1,2,3,4\r\n' ); // => [ '1', '2', '3', '4' ]
parse.next( '5,6,7,8\r\n' ); // => [ '5', '6', '7', '8' ]

// ...

Upon closing the parser, the parser invokes an onClose callback with any partially processed (i.e., incomplete) field data. Note, however, that the field data may not equal the original character sequence, as escape sequences may have already been removed.

var format = require( '@stdlib/string-format' );

function onClose( v ) {
    console.log( format( 'Incomplete: %s', v ) );
}

var opts = {
    'onClose': onClose
};
var parse = new Parser( opts );

parse.next( '1,2,3,4\r\n' ); // => [ '1', '2', '3', '4' ]

// ...

// Provide an incomplete record:
parse.next( '5,6,"foo' );

// Close the parser:
parse.close();

By default, the parser assumes RFC 4180-compliant newline-delimited comma separated values (CSV). To specify alternative separators, specify the relevant options.

var opts = {
    'delimiter': '--',
    'newline': '%%'
};
var parse = new Parser( opts );

parse.next( '1--2--3--4%%' ); // => [ '1', '2', '3', '4' ]
parse.next( '5--6--7--8%%' ); // => [ '5', '6', '7', '8' ]

// ...

By default, the parser escapes double (i.e., two consecutive) quote character sequences within quoted fields. To parse DSV in which quote character sequences are escaped by an escape character sequence within quoted fields, set doublequote to false and specify the escape character sequence.

// Default parser:
var parse = new Parser();

// Parse DSV using double quoting:
parse.next( '1,"""2""",3,4\r\n' ); // => [ '1', '"2"', '3', '4' ]

// ...

// Create a parser which uses a custom escape sequence within quoted fields:
var opts = {
    'doublequote': false,
    'escape': '\\'
};
parse = new Parser( opts );

parse.next( '1,"\\"2\\"",3,4\r\n' ); // => [ '1', '"2"', '3', '4' ]

When quoting is true, the parser identifies a quote character sequence at the beginning of a field as the start of a quoted field. To process quote character sequences as normal field text, set quoting to false.

// Default parser;
var parse = new Parser();

parse.next( '1,"2",3,4\r\n' ); // => [ '1', '2', '3', '4' ]

// ...

// Create a parser which treats quote sequences as normal field text:
var opts = {
    'quoting': false
};
parse = new Parser( opts );

parse.next( '1,"2",3,4\r\n' ); // => [ '1', '"2"', '3', '4' ]

To parse DSV containing commented lines, specify a comment character sequence which demarcates the beginning of a commented line.

var opts = {
    'comment': '#'
};
var parse = new Parser( opts );

parse.next( '1,2,3,4\r\n' ); // => [ '1', '2', '3', '4' ]
parse.next( '# This is a commented line.\r\n' ); // comment
parse.next( '9,10,11,12\r\n' ); // => [ '9', '10', '11', '12' ]

To parse DSV containing skipped lines, specify a skip character sequence which demarcates the beginning of a skipped line.

var opts = {
    'skip': '//'
};
var parse = new Parser( opts );

parse.next( '1,2,3,4\r\n' ); // => [ '1', '2', '3', '4' ]
parse.next( '//5,6,7,8\r\n' ); // skipped line
parse.next( '9,10,11,12\r\n' ); // => [ '9', '10', '11', '12' ]

Properties

Parser.prototype.done

Read-only property indicating whether a parser is able to process new chunks.

var parse = new Parser();

parse.next( '1,2,3,4\r\n' );

// ...

var b = parse.done;
// returns false

// ...

parse.close();

// ...

b = parse.done;
// returns true

Methods

Parser.prototype.next( chunk )

Incrementally parses the next chunk.

var parse = new Parser();

parse.next( '1,2,3,4\r\n' );

// ...

parse.next( '5,6,7,8\r\n' );

// ...

Parser.prototype.close()

Closes the parser.

var parse = new Parser();

parse.next( '1,2,3,4\r\n' );

// ...

parse.next( '5,6,7,8\r\n' );

// ...

parse.close();

After closing a parser, a parser raises an exception upon receiving any additional chunks.

Notes

Special character sequences (i.e., delimiter, newline, quote, escape, skip, and comment sequences) must all be unique with respect to one another, and no special character sequence is allowed to be a subsequence of another special character sequence. Allowing common subsequences would lead to ambiguous parser states.
For example, given the chunk 1,,3,4,,, if delimiter is ',' and newline is ',,', is the first ,, a field with no content or a newline? The parser cannot be certain, hence the prohibition.
As specified in RFC 4180, special character sequences must be consistent across all provided chunks. Hence, providing chunks in which, e.g., line breaks vary between \r, \n, and \r\n is not supported.

Examples

var format = require( '@stdlib/string-format' );
var Parser = require( '@stdlib/utils-dsv-base-parse' );

function onColumn( v, row, col ) {
    console.log( format( 'Row: %d. Column: %d. Value: %s', row, col, v ) );
}

function onRow( v, row, ncols ) {
    console.log( format( 'Row: %d. nFields: %d. Value: | %s |', row, ncols, v.join( ' | ' ) ) );
}

function onComment( str ) {
    console.log( format( 'Comment: %s', str ) );
}

function onSkip( str ) {
    console.log( format( 'Skipped line: %s', str ) );
}

function onWarn( err ) {
    console.log( format( 'Warning: %s', err.message ) );
}

function onError( err ) {
    console.log( format( 'Error: %s', err.message ) );
}

function onClose( v ) {
    console.log( format( 'End: %s', v || '(none)' ) );
}

var opts = {
    'strict': false,
    'newline': '\r\n',
    'delimiter': ',',
    'escape': '\\',
    'comment': '#',
    'skip': '//',
    'doublequote': true,
    'quoting': true,
    'onColumn': onColumn,
    'onRow': onRow,
    'onComment': onComment,
    'onSkip': onSkip,
    'onError': onError,
    'onWarn': onWarn,
    'onClose': onClose
};
var parse = new Parser( opts );

var str = [
    [ '1', '2', '3', '4' ],
    [ '5', '6', '7', '8' ],
    [ 'foo\\,', 'bar\\ ,', 'beep\\,', 'boop\\,' ],
    [ '""",1,"""', '""",2,"""', '""",3,"""', '""",4,"""' ],
    [ '# This is a "comment", including with commas.' ],
    [ '\\# Escaped comment', '# 2', '# 3', '# 4' ],
    [ '1', '2', '3', '4' ],
    [ '//A,Skipped,Line,!!!' ],
    [ '"foo"', '"bar\\ "', '"beep"', '"boop"' ],
    [ ' # 😃', ' # 🥳', ' # 😮', ' # 🤠' ]
];
var i;
for ( i = 0; i < str.length; i++ ) {
    str[ i ] = str[ i ].join( opts.delimiter );
}
str = str.join( opts.newline );

console.log( format( 'Input:\n\n%s\n', str ) );
parse.next( str ).close();

Notice

This package is part of stdlib, a standard library for JavaScript and Node.js, with an emphasis on numerical and scientific computing. The library provides a collection of robust, high performance libraries for mathematics, statistics, streams, utilities, and more.

For more information on the project, filing bug reports and feature requests, and guidance on how to develop stdlib, see the main project repository.

Community

License

See LICENSE.