@iceylan/ast-generator

v0.0.4

Published

5 months ago

Creates tokenized nodes by performing logical and sequential reading operations on string arrays.

Downloads

0High
0Medium
0Low

diesel87

abstract syntax tree ast parser javascript markdown parse html parse css parse

Abstract Syntax Tree Generator

This JavaScript library creates tokenized nodes by performing logical and sequential reading operations on string arrays. HTML, Markdown, Math formulas or any other language (or pattern) can be parsed into AST by this library.

It provides a set of methods to remodel the structure of the complex structures step by step. It's pretty similar to the regular expression mechanism, but with this library, plenty of possibilities are unlocked.

Installation

npm install @iceylan/ast-generator

Usage

import { Parser, component, exact, not, until, match, text, consume, endline }
	from "@iceylan/ast-generator";

const components = 
[
	component( "italic", [
		exact( "*" ).as( "startDelimiter" ),
		not( "*" ),
		until( "*" ).as( "inner" ),
		exact( "*" ).as( "endDelimiter" ),
	]),

	component( "title", [
		match( "\n#" ),
		text( "\n" ),
		consume( "#" ).as( "level" ),
		exact( " " ).as( "space" ),
		until( endline ).as( "title" ),
	]),
];

const MarkdownParser = new Parser({ components });
const ast = MarkdownParser.parse( "Hello **World**" );

What Is Parser

A parser is like an engine that takes components and runs them in order. Under the hood, it is a recursive descent parser. It uses AsciiByteStream library to recursively loop through the given raw string or document (however you call it). Even if a component matches, it consumes bytes as long as it needs, and the parser will continue to read the stream from where the component left off and continue trying to find components.

What Is Component

A component represents any complex structure of the given document. It can be a Markdown component, HTML tag, for loop or a function in JavaScript, or a selector and its definition in CSS. Even further, argument definitions of a function in JavaScript can be a component.

Every component eventually ends up as an AST node if all of its sequences succeed.

Components have sequences to match and process complex structures as small pieces. This makes components sequence encapsulators.

Under the hood, it creates rollback points, AST nodes, and scope for the component and starts to run its sequences step by step to capture desired structures. Some sequences can consume bytes as long as needed, while others don't consume anything, just trying to match the current position or not match.

If any of the sequences fail, the component will roll back to the rollback points at the beginning, stop trying sequences, and exit. At this stage, the parser will continue from the next component.

On the other hand, if none of the sequences of the component fail, it doesn't roll back, and the stream cursor stays where it reached (the parser will continue from that point) and returns the final AST node to the parser. The parser will put it into the node stack and restart trying components.

If the parser never finds any component, it will put the current byte into the node stack as a text node, skip the next byte, and start all over again.

What Is Sequence

Sequences are named functions that create Sequence instances when called. They are facades of that class and perform operations on the stream, scope, and AST node of the component in specialized ways.

For example, match tries to match the given target. It doesn't consume bytes; it just tries to match. Another one, consume, will eat all the bytes as long as if it match the given target.

With sequences, we can remodel the steps of a structure.

Whether it consumes bytes or not, sequences always produce results. Sometimes only a boolean, sometimes a symbol, and sometimes strings. If we don't specifically guide, results will disappear, but only the state of the sequence will be monitored by its component, indicating whether it failed or passed.

Sequences have modifier methods to give them shape. For example, as, name, if, is, and optional are some of them, and we will see how to use them and what they do later.

What Is AST

AST stands for Abstract Syntax Tree. It is generated by the parser after the parsing process and is a data structure that tokenizes the structure of the document. By traversing the AST, we can access every section of the document, including where they start and end.

What Is Scope

Scope is a place where sequences can store their results. It is accessible sequentially by all the sequences in the component. This means that a sequence that comes before cannot access the result of the next one because it hasn't been evaluated yet. With modifier methods, we can access the scope and make a sequence dependent on the results of other sequences. Performing such operations with regular expressions can sometimes be difficult.

What Is Node Stack

Node stack is a class to manage sequentially created nodes. It extends the native JavaScript Array constructor, making it array-like. It can push, pop, and get nodes from the stack. Additionally, it can append any amount of bytes to the latest text node. If the latest node in the stack is not a text node, a new one will be created with the given bytes. It also supports rollback for text nodes, enabling cleanup for failed components. We never directly see or manipulate this object; it is completely managed by the parser, but knowing how it works can make a difference.

What Is ASCII Byte Stream

It is a library that allows us to handle raw strings as streams. It supports a cursor mechanism, allowing us to operate on the bytes of the raw strings without losing bytes. It easily adapts to any kind of looping, looking forward, searching targets, consuming conditionally, etc. Every sequence will have the common stream object to operate and consume bytes. The stream starts counting from 0, with zero pointing to the first byte of the stream. This is another part of the parser that we won't go into detail about, but again, knowing how it works is important.

Constants And Symbols

There are some constants and symbols that we can use with sequence methods. Let's briefly explain what they are and what they do.

beginning

Represents the start of the document. In our designed system, this corresponds to the zeroth character. This means that the cursor showing the first character also represents the beginning of the document. To tell a sequence to match the start of the document, we simply pass this symbol.

ending

Represents the end of the document, which is the last character. This means there are no more characters to process, and the last character has been processed. To tell a sequence to go to the end of the stream or check if it is at the end, we will simply pass this symbol.

newline

Matches the beginning of a line. In other words, the cursor points to a \n character. Giving this symbol or the \n character to a sequence serves the same purpose.

endline

Matches the end of a line. However, unlike the newline symbol, it does not match with \n. For example, the cursor is on the x character, and the next byte is a \n character. This situation is interpreted as the end of a line. If this irrelevant character is skipped, the cursor moves to the \n character. At this stage, if another sequence is targeting the newline symbol for matching, it will also succeed. In summary, it indicates that the cursor is pointing to the last character of a line.

space

A constant array holding all known whitespace characters. These characters are [ " ", "\t", "\n", "\r", "\0" ]. It serves the same function as the known \s expression in regular expressions.

failed

This symbol is returned and also placed on the scope when sequences fail to fulfill their committed tasks. For example, if a sequence declares that it will activate when the preceding sequence fails, we can use the expression until(target).is("beforeName", failed).

Sequence Methods

There are many sequence methods to handle different kinds of scenarios. Let's see them in detail.

match

match is a sequence that tries to match the given target. It doesn't consume bytes, meaning the stream cursor won't move. It just tries to match, similar to the LookAheadPositive concept in regular expressions. It accepts a one or multi-byte string or one of the situation symbols like newline, endline, beginning, or ending. It also accepts an array of these. With that, it will act like the "or" operator in regular expressions, being successful if it matches one of the items in the array.

match( 'ip' ) // true

//     v  <=  cursor was here
`Lorem ipsum dolor.`
//     ^  <=  cursor is still here after executing sequence

match( 'ip' ) // true, cursor doesn't moved, so it will match again

We can match newline or endline, beginning or ending.

match([ beginning, newline ]) // returns beginning instead of true

// v  <=  cursor was here
  `Lorem ipsum dolor.`
// ^  <=  it's still here

The match sequence always returns the failed symbol if it doesn't match. If it matches, it always returns true for primitive strings, and if the target is one of the situation symbols like newline, endline, beginning, or ending, then it returns the matched symbol. The return value of the sequence will be placed in the scope of the component, and conditional modifiers like if or as will consume these results. It is important to know what you will be dealing with in the future.

exact

This sequence will match exactly the given target and skip them. It's a form of literal character matching like /foo/ in regular expressions.

exact( "ip" ) // returns true

//     v  <= cursor was here
`Lorem ipsum dolor.`
//       ^  <= cursor moved here

exact( "sum" ) // returns true

//       v  <= cursor was here
`Lorem ipsum dolor.`
//          ^  <= cursor moved here

We can define multiple targets to match. In such a situation, matching will be successful when any target matches. It's similar to the "or" statement known from regular expressions like /foo|bar/. It matches only one time and stops trying to match other targets. It returns the matched target as a result.

exact([ "ip", "su" ]) // returns "ip"

//     v  <= cursor was here
`Lorem ipsum dolor.`
//       ^  <= cursor moved here

The exact sequence works a little bit differently than match when using situation symbols.

If we want to match the beginning of the document, it will check if the stream cursor is at 0. If it is, the result will be the beginning symbol, but the cursor won't be moved because the beginning is not a byte, its just a situation and can't be consumed. As you have already realized, match will work the same way. It's up to you to choose which one to use in such a situation.

exact( beginning ) // returns beginning

// v  <= cursor was here
  `Lorem`
// ^  <= cursor still here

match( beginning ) // returns beginning, cursor still at the same position

consume

The consume sequence consumes all the bytes as long as they match one of the given targets. It's similar to the /(token)*/ expression in regular expressions. This is a silent sequence, meaning it never fails. If it matches, it will collect the matched data until it no longer matches.

// v  <= cursor was here
 `Loooooo ipsum dolor.`

consume( "o" ) // returns "oooooo"

//      v  <= cursor moved here
`Loooooo ipsum dolor.`

consume( space ) // returns " "

//       v  <= cursor moved here
`Loooooo ipsum dolor.`

It supports multiple targets as well, making it similar to an or statement in regular expressions like /(target|target)*/.

Let's represent spacebars as \s, newlines as \n, and tabs as \t to make them visible. Their length is 1 byte.

//    v  <= cursor was here
`Lorem\s\t\s\t\s\t\n\nipsum dolor.`

consume( space ) // returns "\s\t\s\t\s\t\n\n"

`Lorem\s\t\s\t\s\t\n\nipsum dolor.`
//                    ^  <= cursor moved here

The space constant is an array of all known whitespace characters, as we mentioned before. With this, it consumes every kind of whitespace and stops at the first non-whitespace character.

Situation symbols like beginning, ending, newline, or endline are not meaningful for the consume sequence, so you shouldn't use them over this method.

until

The until sequence consumes all the bytes from the current position until it reaches something other than the given targets. It's similar to the LookAheadNegative concept in regular expressions.

// v  <= cursor is here
`Lorem ipsum dolor.`

until( " " )
// returns "rem"

It will move the cursor to the position where it stopped.

//    v  <= cursor moved here
`Lorem ipsum dolor.`

It supports multiple targets as well. For example, space is an array of known whitespace characters.

until( space ) // returns " ipsum"

With this, we can ensure that we capture all the data until any kind of space, including newline or tab.

//          v  <= cursor moved here
`Lorem ipsum dolor.`

We can also use multiple targets.

until([ " ", "\n", "." ]) // returns "dolor"

//                v  <= cursor moved here
`Lorem ipsum dolor.`

Situation symbols can also be used with this method, except for beginning.

until( ending ) // returns "."

//                 v  <= cursor at the end
`Lorem ipsum dolor.`

We can't use beginning because it's not meaningful with the until method, so you shouldn't use it.

It's also possible to combine symbols, literal strings, and grouped character arrays like space, which means we can use nested arrays as well.

// v  <= cursor is here
`Lorem ipsum dolor.`

until([ "m", space, [ "ip", [ "sum" ]], newline, endline, ending ]) // returns "re"

//   v  <= cursor is here
`Lorem ipsum dolor.`

The target array provided above will convert into a flattened array like ["m", "\n", "\r", ..., "ip", "sum", newline, endline, ending], and the until method will consume characters until it hits one of the targets.