dezent

v1.0.8

Published

3 months ago

Powerful pattern matching and parsing that's readable, recursive, and structured.

Downloads

0High
0Medium
0Low

neilmix

parse parsing parser patterns pattern matching

Dezent

Powerful pattern matching and parsing that's readable, recursive, and structured.

Introduction

Dezent is an alternative to regular expressions for difficult use-cases, providing complex pattern matching that's easier to build, maintain, and use. It is a pattern matching engine similar to regular expressions, plus reusable rules that can be nested recursively, with whitespace-friendly syntax containing comments, plus malleable JSON output yielding data structures perfectly suited to your specific use-case. As an added bonus, Dezent is capable of partial parsing, making it great for situations where you are reading data in asynchronous chunks.

Dezent is ideal for:

Complex nested pattern matching
Converting text data into hierarchical data structures
Extracting text segments into structured output
Recursive descent parsing
Parsing asynchronously delivered input

Motivation

I built Dezent as a side-project to a side-project I was working on. I desired a custom configuration syntax that happened to be a small subset of Typescript - something that could be parsed into a simple JSON data structure (aka abstract syntax tree) while still being valid, compilable Typescript. This is a task that's too complex for regular expressions, but it wasn't really worth hand-crafting a recursive descent parser. All the parsing frameworks I'd seen were a bit complex, and I thought "why can't there be a regular-expression-like syntax that parses recursively and yields simple JSON data structures?" And thus Dezent was born.

Status

Dezent is fully functional - please try it out! I'd love your feedback. I am awaiting feedback to consider where to take it next.

License

Dezent is made available under the MIT License, see License.txt for details. Please contact [email protected] for support.

Quick Start

Dezent is inspired by Parsing Expression Grammars - if you understand the basic concepts of PEGs, Dezent will be easier to follow.

The Dezent javascript library and the dezent command-line tool are included in this distribution, see below for documentation. To parse a document, you must create a grammar that describes how your document is parsed. Reference documentation for grammars is included below.

The following examples are included in this distribution:

URLparser.js: A utility to interactively read and parse URLs. This example demonstrates general-purpose complex pattern matching.
randomJSONGenerator.js and randomJSONParser.js: a hypothentical advanced example of streamed log file parsing. This example demonstrates how Dezent can be used to parse asynchronous streams of data. The randomJSONGenerator.js script generates log files consisting of timestamps followed by a JSON object. The randomJSONParser.js parses and processes the JSON object on each line and displays how quickly the parser is able to consume data. To try out the example, run node examples/randomJSONGenerator.js | node examples/randomJSONParser.js from your command line.
calculator.js: A utility that interactively parses and calculates simple math expressions follow proper order of operations including left-associativity. This example demonstrates a powerful parsing concept called left recursion that's useful for parsing arithmentic expressions.

Reference - the Dezent object

The Dezent constructor parses and compiles your grammar:

> Dezent = require('dezent').Dezent;
[Function: Dezent]
> let d = new Dezent(`return {[a-zA-Z]+} ' '+ {[a-zA-Z]+} {[!.?]} -> [$1, $2, $3];`);

If your grammar has an error, you'll get an exception:

> d = new Dezent(`return {[a-zA-Z]+} ' '+ {[a-zA-Z]+} {[!.?]} -> [$1, $2, $3]`);
Error: Error parsing grammar: expected: ;
At line 1 char 60:
return {[a-zA-Z]+} ' '+ {[a-zA-Z]+} {[!.?]} -> [$1, $2, $3]
                                                           ^

The parse method parses an input string:

> d.parse('Hello world!');
[ 'Hello', 'world', '!' ]

If the text doesn't parse correctly, the parse method returns undefined. To see what went wrong, examine the error member of the Dezent object:

> d.parse('Hello to the whole world!');
undefined
> console.log(d.error.message);
Parse failed: expected one of the following: 
        a-z A-Z
        ! . ?
At line 1 char 9:
Hello to the whole world!
        ^

Reference - the DezentStream object

The DezentStream class provides partial or chunked parsing, which is useful for large files. It works much like the Dezent class, but uses the write and close methods in place of the parse method.

> DezentStream = require('dezent').DezentStream;
[Function: DezentStream]
> let ds = new DezentStream(`return {[a-zA-Z]+} ' '+ {[a-zA-Z]+} {[!.?]} -> [$1, $2, $3];`);
undefined
ds.write('Hello w');
undefined
ds.write('orld!');
undefined
ds.close();
[ 'Hello', 'world', '!' ]

The minBufferSizeInMB configuration controls when input text is released from memory. Take care to ensure that this value isn't too small, or you may trigger an exception by referencing input text that's already been released (for example, referencing a $0 value that's bigger than minBufferSizeInMB). The default is 1, for 1 megabyte.

> let ds = new DezentStream(grammar, { minBufferSizeInMB: 15 }); // minimum buffer size is 15 MB

Reference - performance profiling

If you're curious about the performance profile of your grammar, use the enableProfiling configuration flag to output a profiling summary of your parse to console.log. Consider the following silly grammar that performs very poorly:

> Dezent = require('dezent').Dezent;
[Function: Dezent]
> let d = new Dezent(`
    return { letter+ '-' | number | letter }+ -> $1[0];
    letter = [a-zA-Z] -> $0;
    number = [0-9] -> $0;`, { enableProfiling: true });
> d.parse('abcdefghijklmnopqrstuvwxyz0123456789');
      rule name  call count   call time  pass count   pass time  fail count   fail time   redundant  
      ---------  ----------   ---------  ----------   ---------  ----------   ---------   ---------  
         letter         415           1         377           1          38           0         378  
         number          37           0          10           0          27           0           0  

'a'

Reference - command-line util

% npm install -g dezent
% dezent
usage: dezent [grammar path] [text path]
% echo "return {[a-zA-Z]+} ' '+ {[a-zA-Z]+} {[\!.?]} [ \\\n]* -> [\$1, \$2, \$3];" > helloworld.dezent
% echo 'Hello world!' > input.txt
% dezent helloworld.dezent input.txt
["Hello","world","!"]
% echo 'Hello world!' | dezent helloworld.dezent -
["Hello","world","!"]

Reference - grammar - pattern matching

`return`

'return' whitespace _ rule _ ';'

Dezent uses top-down parsing starting with the return rule. It is expected to consume the entire document, and a parse error is triggered if it does not. There may be only one return rule per grammar.

> new Dezent(`return 'a' -> $0;`).parse('a');
`a`

> new Dezent(`return {'a'*} -> $1;`).parse('aaa');
'aaa'

`=` ruleset

{identifier} _ '=' _ {rule} ( _ ',' _ {rule} )* _ ';'

Rulesets provide abstraction, reusability, and recursion within Dezent grammars. A ruleset is given a name and one or more rules that match input. Rulesets are referred to within patterns via the ruleset's name. Its rules are evaluated in order, and the first rule to match will yield the ruleset's output, thereby termination matching for the ruleset. If no rules match, the ruleset will fail to match. Each ruleset name must be unique within the grammar.

> new Dezent(`
    three_chars = ... -> $0;
    return {three_chars} -> $1;
`).parse('abc');
'abc'

> new Dezent(`
    two_rules = 'a' -> 1, 'b' -> 2;
    return {two_rules}* -> $1;
`).parse('abab');
[1, 2, 1, 2]

Rules can be used recursively:

> new Dezent(`
    a_rule = {'a'} {a_rule}? -> [$1, $2];
    return {a_rule} -> $1;
`).parse('aaa');
[ 'a', [ 'a', [ 'a', null ] ] ]

Dezent also support left recursion:

> new Dezent(`
    number = [0-9]+ -> $0;
    add = {add} '+' {number} -> [$1, $2],
          {number} -> $1;
    return {add} -> $1;
`).parse('1+2+3');
[ [ '1', '2' ], '3' ]

`->` rule

pattern _ ( '|' _ pattern _ )* _ '->' _ value

Rules are the core of Dezent's pattern matching power. Each rule consists of one or more patterns that may match and produce output. Patterns are evaluated in order from first to last, and the first pattern that matches will trigger the rule to return its output, thereby terminating matching for this rule. If no patterns successfully match, the rule itself fails to match.

> new Dezent(`myRule = 'ab' | 'cd' -> $0; return myRule -> $0;`).parse('cd');
'cd'

> new Dezent(`myRule = 'a'|'b'|'c' -> $0; return myRule -> $0;`).parse('d'); // does not match
undefined

The yield of a rule is optional and defaults to $0. (The yield of a return is mandatory.)

> new Dezent(`myRule = 'ab' | 'cd'; return myRule -> $0;`).parse('cd');
'cd'

pattern

(predicate (capture|group|string|class|ruleref|any) modifier _)+

Patterns consist of a sequence of tokens that match (or fail to match) input text.

> new Dezent(`return 'a' 'b' 'c' -> $0;`).parse('abc');
'abc'
> new Dezent(`return 'a' 'b' 'c' -> $0;`).parse('xyz'); // does not match
undefined

Terminal tokens (string, class, or any) match a sequence of characters:

> new Dezent(`return 'a string' -> $0;`).parse('a string');
'a string'
> new Dezent(`return [a-zA-Z !]* -> $0;`).parse('Hello World!');
'Hello World!'
> new Dezent(`return .* -> $0;`).parse('I can match anything.');
'I can match anything.'

Grouping tokens (group or capture) match one of a series of patterns:

> new Dezent(`return ('a' . | 'b' .)+ -> $0;`).parse('acbx');
'acbx'
> new Dezent(`return {'a' . | 'b' .}+ -> $1;`).parse('acbx');
[ 'ac', 'bx' ]

A ruleref refers to a rule defined in the grammar:

> new Dezent(`
    match_a = 'a' -> $0;
    return match_a+ -> $0;
`).parse('aaaa');
'aaaa'

All tokens within the pattern must match in order for the pattern to match.

> let d = new Dezent(`return 'a' 'b' 'c' -> $0;`);
> d.parse('abx'); // does not match
undefined
> console.log(d.error.message);
Parse failed: expected: c
At line 1 char 3:
abx
  ^

`[]` character class

'[' ( classChar '-' classChar | classChar )* ']'

Character class terminals match one of a set of characters or range of characters. They match one character at a time - use the * or + repeat modifiers to match a repeating sequence of the class.

> new Dezent(`
    identifier = [a-zA-Z_][0-9a-zA-Z_]* -> $0;
    return identifier -> $0;
`).parse('myVarName_123');
'myVarName_123'

`.` any character

The dot matches any character.

> new Dezent(`return .{...}. -> $1;`).parse('abcde');
'bcd'

Watch out - matching .* will consume all the remaining input.

> new Dezent(`return 'a' .* 'b' -> $0;`).parse('axxxxxxb'); // does not match
undefined

A repeating dot character is best used in conjunction with a not predicate.

> new Dezent(`return 'a' (!'b' .)* 'b' -> $0;`).parse('axxxxxxb');
'axxxxxxb'

`''` string

['] (escape|stringText)* [']

Represents a specific sequence of characters to match. Escapes via backslash character are typical: The \u escape is followed by 4 hex digits and represents a unicode character. There are 5 whitespace escapes: \n (newline), \t (tab), \r (carriage return), \b (backsape), and \f (form feed). Any other character preceded by a backslash will be replaced with the character itself.

> new Dezent(`return 'abc\\txyz\\.' -> $0;`).parse('abc\txyz.');
'abc\txyz.'

`*` modifier - repeat 0 or more

Indicates that this token should repeat 0 or more times. Repeat modifiers are greedy, which means they consume as much input as possible.

> new Dezent(`return 'a'* -> $0;`).parse('aaa');
'aaa'
> new Dezent(`return [ab]* 'b' -> $0;`).parse('aaabb'); // does not match
undefined

To avoid consuming too much input, use a predicate (predicates don't consume input).

> new Dezent(`return '"' { (!'"' .)* } '"' -> $1;`).parse('"parsed string"');
'parsed string'

Any capture contained within a repeated token will return an array as output, regardless of how many times the token matches.

> new Dezent(`return 'a' {'b'}* 'a' -> $1;`).parse('abbba');
[ 'b', 'b', 'b' ]
> new Dezent(`return 'a' {'b'}* 'a' -> $1;`).parse('aa');
[]
> new Dezent(`return ( {[a-z]*} [0-9]* )* -> $1;`).parse('ab12cd34');
[ 'ab', 'cd' ]

`+` modifier - repeat 1 or more

Indicates that this token should repeat 1 or more times. Repeat modifiers are greedy, which means they consume as much input as possible.

> new Dezent(`return 'a'+ -> $0;`).parse('aaa');
'aaa'
> new Dezent(`return [ab]+ 'b' -> $0;`).parse('aaabb'); // does not match
undefined

To avoid consuming too much input, use a token with predicate as guard (predicates don't consume input). Any capture contained within a repeated token will return an array as output, regardless of how many times the token matches.

> new Dezent(`return 'a' {'b'}+ 'a' -> $1;`).parse('abbba');
[ 'b', 'b', 'b' ]
> new Dezent(`return 'a' {'b'}+ 'a' -> $1;`).parse('aba');
[ 'b' ]
> new Dezent(`return 'a' {'b'}+ 'a' -> $1;`).parse('aa'); // does not match
undefined
> new Dezent(`return ( {[a-z]+} [0-9]+ )+ -> $1;`).parse('ab12cd34');
[ 'ab', 'cd' ]

`?` modifier - maybe

Indicates that the given token is not required to match.

> new Dezent(`return 'a' 'b'? 'a' -> $0;`).parse('aba');
'aba'
> new Dezent(`return 'a' 'b'? 'a' -> $0;`).parse('aa');
'aa'

A capture within a maybe token will return either the captured value or null, depending on whether the token matched.

> new Dezent(`return 'a' {'b'}? 'a' -> $1;`).parse('aba');
'b'
> new Dezent(`return 'a' {'b'}? 'a' -> $1;`).parse('aa');
null

`()` group

'(' _ pattern _ ( '|' _ pattern _ )*  _ ')'

Groups a set of patterns into a single token. The first pattern to match wins, thereby terminating matching for this token.

> new Dezent(`return 'a' ('bc' | 'de') 'a' -> $0;`).parse('abca');
'abca'

`{}` capture

'{' _ capturePattern _ ( '|' _ capturePattern _ )* _ '}'

This is the mechanism by which matched rules in a pattern are made available to the enclosing rule's output. Each capture is referred to by back reference, numbering from 1 to n, left to right.

> new Dezent(`char = . -> $0; return {char} {char} -> [$1, $2];`).parse('ab');
[ 'a', 'b' ]

Captures yield the output of the matched rule, and support selection from any number of rules.

> new Dezent(`
    rule1 = 'a' -> { value: $0 };
    rule2 = 'b' -> { value: $0 };
    return {rule1 | rule2} -> $1;
`).parse('b');
{ value: 'b' }

`&` predicate - match

The 'and' predicate allows a pattern to match only if the predicate's token matches. Note that predicates do not consume input.

> new Dezent(`return &'foo' 'foobar' -> $0;`).parse('foobar');
'foobar'

`!` predicate - does not match

The 'not' predicate allows a pattern to match only if the predicate's token does not match. Note that predicates do not consume input.

> new Dezent(`return !'x' ... -> $0;`).parse('axx');
'axx'
> new Dezent(`return !'x' ... -> $0;`).parse('xyz'); // does not match
undefined

The 'not' predicate is particularly useful in conjunction with the 'any' token.

> new Dezent(`return '"' { (!'"' .)* } '"' -> $1;`).parse('"parsed string"');
'parsed string'

`=` constant

'$' identifier _ '=' _ value _ ';'

Declares constants within your grammar. Constants cannot be used within rules, only output.

> new Dezent(`
    $pi = 3.14;
    return .* -> $pi;
).parse('anything');
3.14
> new Dezent(`
    $myconst = { foo: 'bar' };
    return .* -> $myconst;
`).parse('anything');
{ foo: 'bar' }

Reference - grammar - output

JSON-like

Dezent rules yield output in JSON format. Keep in mind that Dezent grammar syntax uses single-quoted strings. Also, quotes around object member names are optional (as in javascript). The command-line tool yields standard JSON - with double-quote strings and quoted member names - when generating its output.

> new Dezent(`
    return .* -> {
        bool: true,
        number: 2.4e-5,
        string: 'a string',
        null: null,
        array: [1, 2, 'another string']
    };
`).parse('anything');
{ bool: true,
  number: 0.000024,
  string: 'a string',
  null: null,
  array: [ 1, 2, 'another string' ] }

`$0` back reference

The $0 back reference returns the entire string matched during rule processing, regardless of any captures.

> new Dezent(`return ... -> $0;`).parse('xyz');
'xyz'
> new Dezent(`return .{...}. -> $0;`).parse('abcde');
'abcde'
> new Dezent(`return ... -> 'foo\\$0baz';`).parse('bar');
'foobarbaz'

`$1...$n` back references

Back references refer to sequences captured during a rule's parse. Each reference is numbered from 1 to n, left to right.

> new Dezent(`return {.} {.} -> [$1, $2];`).parse('ab');
[ 'a', 'b' ]
> new Dezent(`return {.} {.} -> 'a\\$1c\\$2e';`).parse('bd');
'abcde'

A repeated capture will always return an array as output, regardless of how many times the token matches.

> new Dezent(`return 'a' {'b'}* 'a' -> $1;`).parse('abbba');
[ 'b', 'b', 'b' ]
> new Dezent(`return 'a' {'b'}* 'a' -> $1;`).parse('aa');
[]
> new Dezent(`return ( {[a-z]*} [0-9]* )* -> $1;`).parse('ab12cd34');
[ 'ab', 'cd' ]

Note the subtle but powerful difference between repeats inside the capture vs. outside:

> new Dezent(`return {'a'*} -> $1;`).parse('aaa');
'aaa'
> new Dezent(`return {'a'}* -> $1;`).parse('aaa');
[ 'a', 'a', 'a' ]

When token matching is optional (the '?' modifier), null may be returned

> new Dezent(`return 'a' {'b'}? 'a' -> $1;`).parse('aa');
null

By default, captures yield string output. However, if a capture matches (and only matches) a rule reference, it yields the output of that rule.

> new Dezent(`
    rule1 = 'a' -> { value: $0 };
    rule2 = 'b' -> { value: $0 };
    return {rule1 | rule2} -> $1;
`).parse('b');
{ value: 'b' }

If a capture matches multiple tokens, the matching string segment is always returned.

> new Dezent(`
    rule1 = 'a' -> { value: $0 };
    rule2 = 'b' -> { value: $0 };
    return {rule1 rule2} -> $1;
`).parse('ab');
'ab'

`$identifier` constant reference

You can define constants and refer to them in your output:

> new Dezent(`
    $pi = 3.14;
    return .* -> $pi;
`).parse('anything');
3.14
> new Dezent(`
    $myconst = { foo: 'bar' };
    return .* -> $myconst;
`).parse('anything');
{ foo: 'bar' }

`@identifier` meta reference

There are two meta-references available in output: the ordinal position of the rule's match, and its length.

> new Dezent(`
    rule = (!'x' .)* -> { match: $0, pos: @position, length: @length };
    return 'x'* {rule} 'x'* -> $1;
`).parse('xabcx');
{ match: 'abc', pos: 1, length: 3 }

Meta-references can be used within a constant definition and their values will be populated when the rule is processed.

> new Dezent(`
    $meta = { match: $0, pos: @position, length: @length };
    rule = (!'x' .)* -> { type: 'rule', ...$meta };
    return 'x'* {rule} 'x'* -> $1;
`).parse('xabcx');
{ match: 'abc', pos: 1, length: 3 }

`...` spread

The spread operator allows you to incorporate one array or object within another array or object.

> new Dezent(`
    $array = [1, 2, 3];
    return .* -> [ ...$array, 4, 5, 6];
`).parse('anything');
[ 1, 2, 3, 4, 5, 6 ]
> new Dezent(`
    $object = { foo: 1, bar: 2 };
    return .* -> [ ...$object ];
`).parse('anything');
[ [ 'foo', 1 ], [ 'bar', 2 ] ]
> new Dezent(`
    $array = [ [ 'foo', 1 ], [ 'bar', 2 ] ];
    return .* -> { ...$array };
`).parse('anything');
{ foo: 1, bar: 2 }
> new Dezent(`
    $object = { foo: 1, bar: 2 };
    return .* -> { ...$object, baz: 3 };
`).parse('anything');
{ foo: 1, bar: 2, baz: 3 }

pivot callback

The pivot callback swaps an array's rows with its columns. This is particularly useful for spreading a pair of captures inside an object:

> new Dezent(`
    return {[a-z]}* {[0-9]}* -> 
        { 
            '1': $1, 
            '2': $2, 
            pivot([$1, $2]), 
            spread: { ...pivot([$1, $2]) }
        };
`).parse('abc123');
{ '1': [ 'a', 'b', 'c' ],
  '2': [ '1', '2', '3' ],
  pivot: [ [ 'a', '1' ], [ 'b', '2' ], [ 'c', '3' ] ],
  spread: { a: '1', b: '2', c: '3' } }

`.` or `[]` property access

You can access object and array properties much like you would in javascript:

> new Dezent(`
    $myconst = { foo: 1, bar: [2, 3, 4] };
    return .* -> [$myconst.foo, $myconst['bar'][2]];
`).parse('anything');
[ 1, 4 ]

? collapse

You can cause null or empty backrefs to be collapsed in an array context.

> new Dezent(`
    return {.} {'b'}? {.} -> [ $1, $2?, $3 ];
`).parse('aa');
[ 'a', 'a' ]
> new Dezent(`
    letter = 
        { [a-d] | [f-i] } -> $1,
        'e' -> null;
    return {letter}* -> $1?;
`).parse('abcdefghi');
[ 'a', 'b', 'c', 'd', 'f', 'g', 'h', 'i' ]

callbacks

Provide your own custom callback functions to produce special output, or even SAX-style parsing.

> new Dezent(
    `return {.*} -> translate($1);`,
    { 
        callbacks: {
            translate: (txt) => txt.replace("foo", "bar") 
        }
    }
).parse('all foo all the time');
'all bar all the time'

Alternatively, callbacks can be late-bound and supplied at the time of parsing:

> new Dezent(`return {.*} -> translate($1);`).parse(
    'all foo all the time', 
    { translate: (txt) => txt.replace("foo", "bar") }
);
'all bar all the time'

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Dezent

Table of contents

Introduction

Motivation

Status

License

Quick Start

Reference - the Dezent object

Reference - the DezentStream object

Reference - performance profiling

Reference - command-line util

Reference - grammar - pattern matching

return

= ruleset

-> rule

pattern

[] character class

. any character

'' string

* modifier - repeat 0 or more

+ modifier - repeat 1 or more

? modifier - maybe

() group

{} capture

& predicate - match

! predicate - does not match

= constant