@indiebloom/lang-parse

v0.1.2

Published

6 months ago

Parse input strings against a custom grammar, with semantic state extraction and suggestions for powering autocompletion and other interactive UXs

Downloads

0High
0Medium
0Low

aheitzmann87

Summary

Lang-parse supports the definition of custom grammars, and parsing of input strings against those grammars. The terms of the grammar can be augmented with logic to extract semantic state from matching inputs, and logic to provide input suggestions when inputs do not match. With these features you can use lang-parse to interpret and act on your users' natural-language directives, and to power autocompletion and other interactive UXs for entry of such directives.

What lang-parse isn't:

An AI/ML library
Intended for processing of general, unstructured natural language input

Defining A Grammar

A lang-parse grammar is defined by the root node of an Expression graph. There are several Expression types that can be composed to form such a graph, as illustrated in figure 1. An input string is evaluated against every possible path through the expression graph, and the evaluation results from the path or paths that match the longest prefix of the input are combined to produce the result returned from the parser.

Figure 1: A simple planning grammar

There are 4 core expression types in lang-parse:

Literal Expressions
Sequence Expressions
Union Expressions
Dynamic Expresions

Several additional utility expressions are provided to simplify solutions for common scenarios which could be achieved less conveniently using the core expressions alone.

Optional Expressions
Conditional Expressions
Repeated Expressions
Permutations Expresssions

Every Expression is evaluated against its input to produce one or more evaluation results, each of which have the following attributes:

isMatch: true if the entire Expression matches some prefix of its input.
A division of the input into:
- matchingPart: The (possibly empty) prefix that fully or partially matched the Expression.
- remainder: The (possibly empty) suffix that did not match the Expression.
suggestions: A list of suggestions that correspond to phrases that could be appended to the matchingPart to make it longer.
state: The state extracted from the input by the Expression

Core Expression Types

Literal Expressions

Literal Expressions are the workhorses of your grammar. They are defined by a regular expression that is matched against the start of their input. The match is performed greedily, meaning that if the regex matches multiple prefixes of different lengths, the longest possible match is used. The Expression "consumes" the matching part, so that the remainder of the input becomes the input to the next node in the current branch of the expression graph.

Conceptually, a Literal Expression typically represents an input segment that has some semantic value in the grammar you are defining. As the parser explores a branch of the expression graph, it aggregates state extracted from all of the Literal Expressions in that branch. To contribute to this aggregate state object, each Literal Expression may have a stateUpdater function that takes the current state object and the regex match groups as inputs, and updates the state.

For example, given the input "6pm alarm", the following Literal Expression would consume "6pm" as the matching part and would set state.hour24 to 18:

literal(/(\d{1,2})(a|p)m\b\s*/, {
  stateUpdater: (state, [hourStr, meridiemStr]) => {
    // extract and store the 24-hour-clock hour
    const hour = parseInt(hourStr) % 12;
    const hour24 = meridiemStr === "a" ? hour : hour + 12;
    state.hour24 = hour24;
  },
});

[!NOTE] We typically use the regex term \b\s* at the end of literal expressions' regexes to ensure that the expression only matches full words, and consumes any whitespace that follows the last matching word. Without this suffix, a regex like (\d{1,2})(a|p)m would also match the input "6pmfoo".

When a Literal Expression does not match its input, it can contribute suggestions of inputs that would match it.

For example, given the input "6 A.M.", the following Literal Expression would not match, and would contribute "6pm" and "10am" as suggestions for input that it would match:

literal(/(\d{1,2})(a|p)m\b\s*/, {
  suggestions: ["6pm", "10am"],
});

These suggestions can be used in an application utilizing lang-parse to teach the user about the grammar, and to provide interactive user experiences like autocompletions.

State-dependent Literal Expressions

The regex and suggestions of a Literal Expression can also be provided by a function that takes the current state as input. This allows for some more intelligent constraints on the accepted input combinations without having to resort to more complex Union Expressions. For a somewhat contrived example, imagine the parser has extracted from earlier nodes in the current expression graph branch a count i, and we are now evaluating the following Literal Expression:

function possibleTimeUnitsForCount(i) {
    const possibleUnits = ["years"];
    if (i <= 12) {
        possibleUnits.append("months");
    }
    if (i <= 52) {
        possibleUnits.append("weeks");
    }
    if (i <= 365) {
        possibleUnits.append("days");
    }
    return possibleUnits;
}

literal(
    state => {
        const possibleUnits = possibleTimeUnitsForCount(state.i);
        return new RegExp(`(${possibleUnits.join("|")})\b\s*`);
    },
    {
        suggestions: state => possibleTimeUnitsForCount(state.i)
        stateUpdater: (state, [unit]) => state.unit = unit
    }        
)

If i is 4, then the expression will match (and suggest) "years", "months", "weeks", or "days", but if i is 80, then it will only match "years", "weeks" and "days".

Sequence Expressions

Sequence Expressions define an ordered sequence of child Expressions. A Sequence Expression with n children [E_1, E_2, ..., E_n] matches its input if a prefix of the input can be broken into n (possibly zero-length) parts [P_1, P_2, ..., P_n] such that E_i matches P_i for all 1 <= i <= n.

For example, for the expression:

sequence(
  literal(/tod(ay)?\b\s*/, { suggestions: ["today"] }),
  literal(/(at\b\s*)?/),
  literal(/(\d{1,2})(a|p)m\b\s*/, { suggestions: ["6pm"] }),
);

We would see evaluation results as follows:

| Input | isMatch | matchingPart | remainder | suggestions | | -------------- | ------- | -------------- | ---------- | ----------- | | "today at 6pm" | true | "today at 6pm" | "" | [] | | "tod 6pm" | true | "tod 6pm" | "" | [] | | "tod 6pm foo" | true | "tod 6pm" | "foo" | [] | | "tomorrow" | false | "" | "tomorrow" | ["today"] | | "today" | false | "today" | "" | ["6pm"] | | "today foo" | false | "today" | "foo" | ["6pm"] |

Union Expressions

Union Expressions define one or more child expression alternates. They are analogous to separating terms with | operator in a regex. They represent branch points in the expression graph, where several alternatives with different semantics exist. A Union Expression produces a result from each of its child expressions. The parser interprets the Union Expression as matching its input the results from any of its child expressions are matching.

Extending the Sequence Expression example above, for the expression:

sequence(
  union(
    literal(/tod(ay)?\b\s*/, { suggestions: ["today"] }),
    literal(/tom(orrow)?\b\s*/, { suggestions: ["tomorrow"] }),
  ),
  literal(/(at\b\s*)?/),
  literal(/(\d{1,2})(a|p)m\b\s*/, { suggestions: ["6pm"] }),
);

We would see two evaluation results for a given input, corresponding to the two union expression alternates, e.g.

| Input | Result # | isMatch | matchingPart | remainder | suggestions | | -------------- | -------- | ------- | -------------- | -------------- | ------------ | | "today at 6pm" | 1 | true | "today at 6pm" | "" | [] | | | 2 | false | "" | "today at 6pm" | ["tomorrow"] | | "tom 6pm foo" | 1 | false | "" | "tom 6pm foo" | ["today] | | | 2 | true | "tom 6pm" | "foo" | [] | | "yesterday" | 1 | false | "" | "yesterday" | ["today"] | | | 2 | false | "" | "yesterday" | ["tomorrow"] | | "tomorrow foo" | 1 | false | "" | "tomorrow foo" | ["today"] | | | 2 | true | "tomorrow" | "foo" | ["6pm"] |

Dynamic Expressions

Dynamic Expressions generate an expression at runtime when encountered by the parser. The choice of expression to generate is based on the state that has been extracted by the parser for previous nodes in the current path through the expression graph. This similar to the dynamicism of State-dependent Literal Expressions, but even more powerful, in that entirely different types of expressions may be generated based on the state, enabling complex grammars that would be difficult or impossible to create with only the three other core Expression types. Under the hood, the Conditional, Repeated and Permutations Utility Expressions are implemented using Dynamic Expressions.

Utility Expression Types

Optional Expressions

Optional Expressions are equivalent to a Union Expression with two alternates:

The Optional Expression's child expression
A no-op Literal Expression that vacuously matches the empty string.

This is somewhat analogous to applying a ? to a term in a regex, but there is an important difference in the parsers behavior between these two similar constructions. When an optional construct is used within a Literal Expression regex:

literal(/(foo)?/, { suggestions: ["foo"] });

it increases the flexibility of the regex match. Since suggestions are only generated by non-matching Literal Expressions, and in this example the Literal Expression will vacuously match the empty string at the start of any input that it is evaluated against that doesn't match the inner term foo, it will never generate suggestions. Moreover, there is only one result branch produced by this expression, representing the longest match of the regex against the Literal Expression's input. Therefore if this Literal Expression were a member of a Sequence Expression, the next member in the Sequence Expression would only be evaluated against the remainder of the input after that longest matching prefix was consumed. For instance:

sequence(literal(/(foo)?/), literal(/foobar/));

will not fully match the input "foobar" because the first Literal Expression consumes "foo" so the second Literal Expression is evaluated against the remainder, "bar".

In contrast, when an Optional Expression is used:

sequence(
  optional(literal(/foo/, { suggestions: ["foo"] })),
  literal(/foobar/, { suggestions: ["foobar"] }),
);

there will be two branches in the expression graph for the parser to explore: one branch as though the entire Optional Expression was omitted from the expression graph entirely, and one branch as though the child of the Optional expression was specified without the Optional wrapper. Hence for this Sequence Expression example, suggestions will be contributed when a non-matching input like "bar" is provided, and the expression will match the input "foobar".

| Input | Result # | isMatch | matchingPart | remainder | suggestions | | -------- | -------- | ------- | ------------ | --------- | ----------- | | "bar" | 1 | false | "" | "bar" | ["foobar"] | | | 2 | false | "" | "bar" | ["foo"] | | "foobar" | 1 | true | "foobar" | "" | [] | | | 2 | false | "foo" | "bar" | ["foobar"] |

Conditional Expressions

Conditional Expressions choose between two possible child expressions to evaluate, based on the state accumulated by the parser in the current branch of the expression graph.

Repeated Expressions

Repeated Expressions are analogous to the {min,max} construct in regexes. They match their input when their child expression matches sequential segments of their input between a min and max number of times.

Permutations Expressions

Permutations Expressions match any permutation of their member expressions. For a PermutationsExression with n required members, you can imagine a tree of n! possible paths representing all of the possible arrangements of those members. For example, given required members A, B, and C, we could construct this tree

ROOT | --> A | --> B --> C
     |       | --> C --> B
     |
     | --> B | --> A --> C
     |       | --> C --> A
     |
     | --> C | --> A --> B
             | --> B --> A

Every branch of this tree will be evaluated against the input, with evaluation of a branch ending as soon as a non-matching node is reached, or when a leaf is reached.

In practice, it is very likely that all but one branch will terminate on non-matching nodes very early in the evaluation, and the children of each node in this tree are generated lazily using Dynamic Expressions, so even if n is large, the number of nodes generated and evaluated will be much smaller than n!.

A Permutations Expression may also have optional members, in which case the expression will still match its input if some or all of the optional members do not match any segment of the input. Referring to the virtual tree depicted above, optional members are treated exactly the same as required members until a node in the tree is reached where only optional members have not yet been matched against an input segment. At that point, one additional branch is added that vacuously matches the empty string. For example, imagine that C was an optional instead of a required member. Then our tree would look like this:

ROOT | --> A | --> B --> C
     |       |     | --> ∅
     |       |
     |       | --> C --> B
     |
     | --> B | --> A --> C
     |       |     | --> ∅
     |       |
     |       | --> C --> A
     |
     | --> C | --> A --> B
             | --> B --> A

where ∅ is an expression that vacuously matches the empty string.