heket
v1.4.2
Published
Heket - ABNF Parser Generator for Node.js
Downloads
31
Readme
Heket 🐸
An ABNF parser / unparser generator for Node.js
Table of Contents
- Overview
- Installation
- Basic parsing
- Rule matching
- Errors during parsing
- Practical example: IRC
- Impractical example: ABNF
- Unparsing
- Errors during unparsing
- Parsing + unparsing
- Technical details
- Why did you write this?
- Where did the name come from?
Overview
Heket is a parser generator for the ABNF specification, written in Node.js.
It allows you to create custom parsers for formal grammars written in ABNF, and then apply those parsers to input text to see if it matches your rules.
You can also use it to create unparsers, which do the opposite of a parser: given an ABNF grammar, the unparser will serialize a string using the rule values that you supply to it.
Installation
npm install heket
Basic parsing
Let's say you had a basic ABNF rule:
a = "foo" / "bar"
You can use Heket to produce a custom parser from this simple grammar via the following:
var Heket = require('heket');
var abnf_string = 'a = "foo" / "bar"';
var parser = Heket.createParser(abnf_string);
Then you can use that parser to parse input text, to see if it matches the ABNF
that you originally specified. If the input text adheres to your grammar, then
parse()
will return a match object with information about the matched result.
If the input text does not adhere to the grammar, parse()
will throw an error
indicating why (see Errors during parsing for more
info).
console.log(parser.parse('foo').getString());
console.log(parser.parse('bar').getString());
console.log(parser.parse('baz'));
The above snippet would print:
"foo"
"bar"
Error: no matching option for string: "baz"
a = "foo" / "bar"
----^
Rule matching
The above examples were pretty trivial; they just checked for matches against basic strings. But Heket also allows you to unpack rule values from more complex grammar definitions. The following snippet...
var parser = Heket.createParser(`
foo = *(baz / bar) wat
baz = nested-baz
nested-baz = "xxx"
bar = "yyy"
wat = [*"z"]
`);
var input = 'xxxyyyzz';
var match = parser.parse('xxxyyyzz');
console.log(match.get('baz'));
console.log(match.get('nested-baz'));
console.log(match.get('bar'));
console.log(match.get('wat'));
console.log(match.getRawResult());
...would print the following output:
"xxx"
"xxx"
"yyy"
"zz"
{
"string": "xxxyyyzz",
"rules": [
{
"rule_name": "baz",
"string": "xxx",
"rules": [
{
"rule_name": "nested-baz",
"string": "xxx",
"rules": [ ]
}
]
},
{
"rule_name": "bar",
"string": "yyy",
"rules": [ ]
},
{
"rule_name": "wat",
"string": "zz",
"rules": [ ]
}
]
}
Since all of the match values for each rule are stored internally as an array,
you can also use the convenience method match.getNext()
to retrieve the next
value for the specified rule. For instance:
var parser = Heket.createParser(`
foo = 1*bar
bar = "x"
`);
var match = parser.parse('xxxxx');
while (let next_result = match.getNext('bar')) {
console.log(next_result);
}
The above would print:
"x"
"x"
"x"
"x"
"x"
Errors during parsing
IMPORTANT: When you try to parse a string using a generated parser, and for whatever reason, that string does not match the specified ABNF, Heket will throw specific types of errors to indicate what type of mismatch occurred, and where.
If uncaught, the error will print to the console and provide more information about the mismatch, and the position in the current ABNF where the mismatch was determined to occur.
Each of the following error types are exposed on the Heket module itself, so you can type check thrown errors in order to determine what type of mismatch scenario you're dealing with, eg:
if (error instanceof Heket.RuleNotFoundError) { ... }
InvalidRuleValueError
var parser = Heket.createParser(`
foo = bar
bar = "baz"
`);
parser.parse('bam');
Error: Invalid value for rule <bar>: "bam"
foo = bar
------^
MissingRuleValueError
var parser = Heket.createParser(`
foo = bar baz
bar = "bar"
baz = "baz"
`);
parser.parse('bar');
Error: Must supply a value for rule <baz>
foo = bar baz
----------^
RuleNotFoundError
var parser = Heket.createParser(`
foo = bar baz
bar = "bar"
`);
parser.parse('barbaz');
Error: Rule not found: <baz>
foo = bar baz
----------^
InvalidQuotedStringError
var parser = Heket.createParser(`
foo = "bar"
`);
parser.parse('baz');
Error: Invalid value for quoted string (expected "bar" but got "baz")
foo = "bar"
------^
InputTooLongError
var parser = Heket.createParser(`
foo = "bar"
`);
parser.parse('barbaz');
Error: Too much text to match (expected "bar", got "barbaz")
foo = "bar"
------^
InputTooShortError
var parser = Heket.createParser(`
foo = "bar" "baz"
`);
parser.parse('bar');
Error: Not enough text to match
foo = "bar" "baz"
------------^
NoMatchingAlternativeError
var parser = Heket.createParser(`
foo = "bar" / "baz"
`);
parser.parse('bam');
Error: No matching option for string: "bam"
foo = "bar" / "baz"
------^
NotEnoughOccurrencesError
var parser = Heket.createParser(`
foo = 3"bar"
`);
parser.parse('barbar');
Error: Not enough occurrences of repeating clause (expected 3, got 2)
foo = 3"bar"
------^
NumericValueMismatchError
var parser = Heket.createParser(`
foo = %d65 ; A
`);
parser.parse('B');
Error: Numeric value did not match (expected 65 / A, got 66 / B)
foo = %d65
------^
NumericValueOutOfRangeError
var parser = Heket.createParser(`
foo = %d65 - 67 ; A - C
`);
parser.parse('D');
Error: Numeric value out of range (expected value within [65 - 67], got 68 / D)
foo = %d65-57
------^
parseSafe()
If you're not interested in error handling, you can use the parser.parseSafe()
convenience method. It basically just wraps parse()
in a try { }
block,
swallows the error, and returns null instead of a match instance.
Practical example: IRC
Let's imagine we were running an IRC server, and we wanted to parse an incoming message from a client. In that case, the following snippet...
var rules = Heket.createParser(`
message = [ ":" prefix " " ] command params CRLF
prefix = nick [ "!" user ] [ "@" host ]
nick = ( ALPHA / special-character) *(ALPHA / DIGIT / special-character / "-" )
host = "burninggarden.com"
; cheating for the purposes of this demonstration!
user = "pachet"
; cheating for the purposes of this demonstration!
command = 1*ALPHA / 3DIGIT
params = " " [ ( ":" trailing ) / ( middle params ) ]
middle = param-octet *( ":" / param-octet )
trailing = *( ":" / " " / param-octet )
param-octet = %x01-09 / %x0B-0C / %x0E-1F / %x21-39 / %x3B-FF
; any octet except NUL, CR, LF, ' ' and ':'
special-character = "-" / "[" / "]" / "\" / "\`" / "^" / "{" / "}"
`);
var input = `:[email protected] PRIVMSG #ops :Test message
`; // <-- Note the trailing CRLF here; this is required per the IRC spec.
var match = parser.parse(input);
console.log(match.get('prefix'));
console.log(match.get('nick'));
console.log(match.get('user'));
console.log(match.get('host'));
console.log(match.get('command'));
console.log(match.get('params'));
console.log(match.get('middle'));
console.log(match.get('trailing'));
console.log(match.getRawResult());
...would produce this output:
"[email protected]"
"pachet"
"pachet"
"burninggarden.com"
"PRIVMSG"
" #ops :Test message"
"#ops"
"Test message"
{
"string": ":[email protected] PRIVMSG #ops :Test message\n",
"rules": [
{
"rule_name": "prefix",
"string": "[email protected]",
"rules": [
{
"rule_name": "nick",
"string": "pachet",
"rules": []
},
{
"rule_name": "user",
"string": "pachet",
"rules": []
},
{
"rule_name": "host",
"string": "burninggarden.com",
"rules": []
}
]
},
{
"rule_name": "command",
"string": "PRIVMSG",
"rules": []
},
{
"rule_name": "params",
"string": " #ops :Test message",
"rules": [
{
"rule_name": "middle",
"string": "#ops",
"rules": [
]
},
{
"rule_name": "params",
"string": " :Test message",
"rules": [
{
"rule_name": "trailing",
"string": "Test message",
"rules": [
]
}
]
}
]
}
]
}
Notice: In the parsed result above, there are two instances of rule results
for the "params" rule. match.get('params')
will only return the first
occurring result. If you need an array of all the matching results, use
match.getAll()
instead:
console.log(match.getAll('params'));
The above snippet would print:
[
" #ops :Test message",
" :Test message"
]
Impractical example: ABNF
As it turns out, it's possible to embody the formal specification for ABNF grammars in ABNF syntax itself. And the authors of RFC-5234 actually set about doing it. Here's the ABNF specification for ABNF:
rulelist = 1*( rule / (*c-wsp c-nl) )
rule = rulename defined-as elements c-nl
; continues if next line starts
; with white space
rulename = ALPHA *(ALPHA / DIGIT / "-")
defined-as = *c-wsp ("=" / "=/") *c-wsp
; basic rules definition and
; incremental alternatives
elements = alternation *c-wsp
c-wsp = WSP / (c-nl WSP)
c-nl = comment / CRLF
; comment or newline
comment = ";" *(WSP / VCHAR) CRLF
alternation = concatenation
*(*c-wsp "/" *c-wsp concatenation)
concatenation = repetition *(1*c-wsp repetition)
repetition = [repeat] element
repeat = 1*DIGIT / (*DIGIT "*" *DIGIT)
element = rulename / group / option /
char-val / num-val / prose-val
group = "(" *c-wsp alternation *c-wsp ")"
option = "[" *c-wsp alternation *c-wsp "]"
char-val = DQUOTE *(%x20-21 / %x23-7E) DQUOTE
; quoted string of SP and VCHAR
; without DQUOTE
num-val = "%" (bin-val / dec-val / hex-val)
bin-val = "b" 1*BIT
[ 1*("." 1*BIT) / ("-" 1*BIT) ]
; series of concatenated bit values
; or single ONEOF range
dec-val = "d" 1*DIGIT
[ 1*("." 1*DIGIT) / ("-" 1*DIGIT) ]
hex-val = "x" 1*HEXDIG
[ 1*("." 1*HEXDIG) / ("-" 1*HEXDIG) ]
prose-val = "<" *(%x20-3D / %x3F-7E) ">"
; bracketed string of SP and VCHAR
; without angles
; prose description, to be used as
; last resort
What this means is that we can use Heket to parse the formal ABNF grammar specification upon which it is based. We can then use the resultant AST to determine whether the ABNF specification... written in ABNF... is actually valid ABNF.
var spec = Heket.getSpec();
var abnf_parser = Heket.createParser(spec);
console.log(parser.parseSafe(spec) !== null);
// Prints "true"; it's valid ABNF!
Now, honestly, we should hope that the authors of the ABNF specification would be capable of embodying their own specification reflexively, so this exercise is mostly useful as a demonstration of Heket's accurate implementation than anything else. Still, it's an interesting mind game.
Unparsing
In addition to allowing you to create generate custom parsers for ABNF-based grammars, Heket also lets you unparse a series of tokens (read: serialize a string based on the rules defined in your grammar).
The way this works is: Heket will walk over each node in the AST used to represent your grammar. Whenever it encounters a node linking to a named rule, it will prompt you to supply it with the value of that rule, like so:
// Like createParser(), createUnparser() accepts an ABNF string:
var unparser = Heket.createUnparser(`
foo = bar baz
bar = "bar"
baz = "baz"
`);
var string = unparser.unparse(function(rule_name, index) {
switch (rule_name) {
case 'bar':
return 'bar';
case 'baz':
return 'baz';
default:
return null;
}
});
console.log(string);
The above would print "barbaz"
.
Notice how the only argument passed to unparser.unparse()
is a function that
receives two arguments: the name of the rule the unparser is requesting a value
for, and a numeric index. The numeric index represents the number of times that
the unparser has requested a value for that same rule name.
The index argument makes it easy to hand values back to the parser if you're dealing with, say, an array of strings:
var unparser = Heket.createUnparser(`
foo = 1*bar [baz] wat
bar = "bar"
baz = "baz"
wat = "wat"
`);
var rule_values = {
bar: ['bar', 'bar', 'bar', 'bar', 'bar'],
baz: [ ],
wat: ['wat']
};
var string = unparser.unparse(function(rule_name, index) {
return rule_values[rule_name][index];
});
console.log(string);
The above would print "barbarbarbarbarwat"
.
NOTE: You can also pass an object to unparser.unparse()
, whose keys are
rule names, and whose values are arrays of strings to use as values for that
rule whenever the unparser encounters it. (You can also just pass single strings
directly, instead of wrapping them in arrays). To modify the above example:
var unparser = Heket.createUnparser(`
foo = 1*bar [baz] wat
bar = "bar"
baz = "baz"
wat = "wat"
`);
var rule_values = {
bar: ['bar', 'bar', 'bar', 'bar', 'bar'],
wat: 'wat'
};
var string = unparser.unparse(rule_values);
console.log(string);
Again, same result as before: "barbarbarbarbarwat"
.
Lastly, you can pass nested object structures to unparse
, and those child
objects will be passed to the corresponding unparse
call for the corresponding
rule. For example:
var unparser = Heket.createUnparser(`
foo = bar
bar = baz / wat
baz = "baz" / "biz"
wat = "wat" / "wit"
`);
unparser.unparse({
bar: {
baz: 'biz'
}
});
Errors during unparsing
Sometimes when you try to unparse something, the values you supply to the unparser won't match the rules they're meant to fill. Or, the unparser will expect a value for a rule, but you fail to give it one, so it will be unable to proceed. In each of these cases, Heket will throw a specific type of error to signal that the unparsing step failed.
These error types also have some additional convenience methods to help you track down the source of unparsing issues:
error.getRuleName()
error.getValue()
InvalidRuleValueError
Heket will throw an InvalidRuleValueError
when the value you return for
a specific rule during unparsing doesn't match the ABNF definition for that
rule. For instance:
var unparser = Heket.createUnparser(`
foo = bar
bar = "bar"
`);
var string;
try {
string = unparser.unparse(function(rule_name) {
switch (rule_name) {
case 'bar':
// Notice how this is invalid per the ABNF spec above:
return 'baz';
default:
return null;
}
});
} catch (error) {
console.log(error instanceof Heket.InvalidRuleValueError);
console.log(error.getRuleName());
console.log(error.getValue());
}
The above would print:
true
"bar"
"baz"
MissingRuleValueError
Heket will throw a MissingRuleValueError
when the unparser isn't given a
value for a required rule. For instance:
var unparser = Heket.createUnparser(`
foo = bar
bar = "bar"
`);
var string;
try {
string = unparser.unparse(function(rule_name) {
// Return null, no matter what, even though "bar" is required.
return null;
});
} catch (error) {
console.log(error instanceof Heket.MissingRuleValueError);
console.log(error.getRuleName());
console.log(error.getValue());
}
The above would print:
true
"bar"
null
Parsing + unparsing
Where things get fun is when you combine parsers with unparsers. By the way,
you can also obtain a reference to the unparser for an ABNF declaration from
the associated parser, via parser.getUnparser()
.
var parser = Heket.createParser(`
foo = 1*bar [baz] wat
bar = "bar"
baz = "baz"
wat = "wat"
`);
var match = parser.parse('barbarbarbarbarbazwat');
var unparser = parser.getUnparser();
var output = unparser.unparse(function(rule_name, index) {
return match.getNext(rule_name);
});
console.log(input === output);
// true
You can also just pass match
instances to unparser.unparse()
directly:
var output = unparser.unparse(match);
Technical details
Backtracking
Technically, Heket generates recursive descent parsers with support for backtracking. This means that a generated parser is able to walk backwards and re-parse previously matched portions of input strings if it realizes that it went down a blind alley. Take the following trivial example:
foo = 1*"foo" "foobar"
The above rule will greedily match the string "foo"
as many times as it can
before proceeding to check for the existence of a "foobar"
occurrence. Now,
without proper backtracking, the parser would never be able to match the string
"foobar"
at the end of the input, because it would already have consumed the
"foo"
portion when parsing the prior segment. In other words, this would fail:
parser.parse('foofoofoobar');
With backtracking, the parser algorithm is able to realize that it went one
step too far, and that it needs to relinquish one of the "foo"
substrings
in order to properly match the tail segment.
No parser combinator?
Heket doesn't use a proper parser combinator for a handful of technical reasons.
Performance
As you can imagine, recursively constructing parse result trees every time you want to pattern match a string gets pretty expensive, mostly due to the constant allocation / deallocation of throwaway objects to hold the results together (when you are throwing lots of short-lived objects on the heap, the GC quickly stops being your friend).
To combat this, Heket tries to create and cache regular expressions that can perform the equivalent of parsing input text via tree descent. One downside of this approach is that Heket is not very good at optimizing grammars with circular rule references (because those can't be embodied in a regex), but such grammars are fortunately uncommon in practice.
The performance implications of this regular expression caching are pretty dramatic. Take this simple test case using the IRC ABNF rules:
var irc_spec = FS.readFileSync('./abnf/irc.abnf', 'utf8');
var parser = Heket.createParser(irc_spec);
// A sample IRC message; I copied this straight from the RFC:
var input = ':[email protected] PART #playzone :I lost\r\n';
var index = 0;
var start = Date.now();
while (index < 100) {
parser.parse(input);
index++;
}
var elapsed_ms = Date.now() - start;
console.log(elapsed_ms + ' ms');
Without regex caching:
33719 ms
With regex caching:
31 ms
That's quite an improvement!
... note, however, that regex caching can be disastrous if your input ABNF
grammar produces certain dangerous regex matches. This phenomenon is known as
catastrophic backtracking,
and manifests as an infinite loop when calling parser.parse()
. If this happens
to you, you can disable regex caching via:
Heket.disableRegexCaching();
Why did you write this?
I needed a general-purpose ABNF parser in order to develop a better IRCD. The grammar for the IRC specification is embodied in ABNF. Turns out it's much less time consuming to just copy+paste the definitions from the spec as ABNF literals and let the parser determine whether input messages to the IRC server match the specified format, than to manually embody all of the logic of the specification in the code directly.
Where did the name come from?
Heket (or Heqet) was an Egyptian fertility goddess. She was represented as a woman with the head of a frog. I don't know what that has to do with parsing formal grammars.
+++++++++++++++++++
++++ +++ +++
++ +++++++
++++++ +++ +++++++
+++++++++++++ +++++++++
+++++++++++++ +++++++++
+++++++++++++ ++++++++
++++++++ +++ ++++++
+++++ ++++ ++++++
++ ++++ +++++
+++ +++ +++++
+++ ++++
+++ ++++
+++ +++ +++
++ +++++ ++++
++++ ++++ +++
+++ ++ ++++ +++
++++ +++ ++ +++ +++
++++++ ++ +++++++
++++ ++ +++++++++ ++++++
++ ++ ++++ +++++++ +++++
++ +++ ++ ++ ++++++ +++++
+++ ++++++ ++ ++++++ ++++++ +++++
+++ +++ +++++ + ++ ++++ +++++++ +++ ++++
+++++++++ +++ +++ ++++++++++++++++++++ +++++++ ++++
+++++++++++++++++++ +++ ++++++++++++++++++++++++++ +++++ +++
+++++++++++++++++++++++ +++++++++++++++++++++++++++ +++++
+++++++++++++++++++++ +++++++++++++++++++++++++++++
+++ ++++ +++++++++++