@mazard/scanner
v1.2.0
Published
This scanner converts a Markdown document into an array of tokens. These tokens can then be interpreted by a parser into an expression tree. Much inspiration has been taken from Robert Nystrom's [Crafting Interpreters](https://craftinginterpreters.com/) a
Downloads
2
Readme
Mazard Scanner
This scanner converts a Markdown document into an array of tokens. These tokens can then be interpreted by a parser into an expression tree. Much inspiration has been taken from Robert Nystrom's Crafting Interpreters as well as Alfred Aho's The Theory of Parsing, Translation, and Compiling.
Tokens types
| Type | Description | Example |
| ------------------ | ------------------------------------------------------------------------------------- | -------------------------------------- |
| SYMBOL | An alphanumeric string that closely resembles a variable name in other languages | Foo, foo, foo-bar, foo_bar |
| RUNE | Similar to a symbol, but these strings contain non-alphanumeric content | Foo#, -foo, _foo, 1foo, fo>o |
| NUMBER | An integer, decimal, or a number in exponential notation | 1, 1.0, +1, -1, 1.0e1 |
| SPACE | One ore more space characters. The literal value is the number of spaces encountered. | |
| TAB | A "\t" or " " at the start of a line. | |
| BR | One or more line break characters. | |
| COLON | A, well, colon | : |
| COLON_COLON | Two colons in sequence, likely indicated an Obsidian metadata value | :: |
| FRONTMATTER_START | The triple-dash at the start of a frontmatter section | --- |
| FRONTMATTER_END | The triple-dash at the end of a frontmatter section | --- |
| FRONTMATTER_KEY | A frontmatter key | The foo
in foo: bar
|
| FRONTMATTER_VALUE | A frontmatter value | The bar
in foo: bar
|
| FRONTMATTER_BULLET | A dash at the beginning of a line | The -
in - bar
|
| CODE_START | The triple-backtick at the start of a code section | ``` |
| CODE_LANGUAGE | The language specified after the triple backticks of a CODE_START | The typescript
in \
``typescript|
| CODE_KEY | Similar to frontmatter, code blocks can have keys and values after the CODE_START | The
fooin
foo: bar |
| CODE_VALUE | A metadata code value | The
barin
foo: bar |
| CODE_SOURCE | The source code inside of a code block | |
| CODE_END | The triple-backtick at the end of a code section | \
`` |
| HHASH | A one- to six-legged hash tag at the beginning of a line | The ###
in ### Foo
|
| HGTHAN | A >
at the beginning of a line | The >
in > Foo
|
| L_BRACKET | A single left bracket | [
|
| LL_BRACKET | Two left brackets | [[
|
| R_BRACKET | A single right bracket | [
|
| RR_BRACKET | Two right brackets | ]]
|
| LL_BRACE | Two left braces | {{
|
| RR_BRACE | Two right braces | }}
|
| ASTERISK | A single asterisk | *
|
| ASTERISK_ASTERISK | Two asterisks | **
|
| EQUALS_EQUALS | Two equals signs | ==
|
| ORDINAL | A number with an ordinal suffix | 1st, 2nd, 3rd, 4th |
| PIPE | A bar pipe | \|
|
| TAG | A symbol prefixed with a hashtag | #tag, #tag-foo #tag1 |
| TILDE_TILDE | Two tildes | ~~
|
| ESCAPE | A backslash followed by any character | \|
|
| L_PAREN | A left parenthesis | (
|
| R_PAREN | A right parenthesis | )
|
| BACKTICK | A single backtick | \`` |
| DOLLAR | A dollar sign |
$ |
| DOLLAR_DOLLAR | Two dollar signs |
$$ |
| PERCENT_PERCENT | Two percent signs |
%% |
| COMMENT | The content of a comment |
A commentin
%% A comment |
| HTML_TAG | An html tag |
,
,
|
| HR | A horizontal rule |
---,
***,
___ |
| BULLET | A dash or asterisk at the beginning of a line | The
- in
- foo |
| N_BULLET | A numbered bullet at the beginning of a line | The
1.in
1. foo |
| CHECKBOX | A checkbox at the beginning of a line | The
- [ ]in
- [ ] foo |
| URL | A url |
https://www.google.com` |
| EOF | The very end of the string or file | |
Some examples
const tokens = scanTokens([
"# Mazard Scanner",
"",
"This scanner converts a Markdown document into an array of tokens.",
]);
printTokens(tokens);
| No | Type | Lexeme | Literal | Line | Column | | --- | ------ | ---------- | ---------- | ---- | ------ | | 0 | HHASH | "#" | 1 | 0 | 0 | | 1 | SPACE | " " | 1 | 0 | 1 | | 2 | SYMBOL | "Mazard" | "Mazard" | 0 | 2 | | 3 | SPACE | " " | 1 | 0 | 8 | | 4 | SYMBOL | "Scanner" | "Scanner" | 0 | 9 | | 5 | BR | "\n\n" | 2 | 0 | 16 | | 6 | SYMBOL | "This" | "This" | 2 | 0 | | 7 | SPACE | " " | 1 | 2 | 4 | | 8 | SYMBOL | "scanner" | "scanner" | 2 | 5 | | 9 | SPACE | " " | 1 | 2 | 12 | | 10 | SYMBOL | "converts" | "converts" | 2 | 13 | | 11 | SPACE | " " | 1 | 2 | 21 | | 12 | SYMBOL | "a" | "a" | 2 | 22 | | 13 | SPACE | " " | 1 | 2 | 23 | | 14 | SYMBOL | "Markdown" | "Markdown" | 2 | 24 | | 15 | SPACE | " " | 1 | 2 | 32 | | 16 | SYMBOL | "document" | "document" | 2 | 33 | | 17 | SPACE | " " | 1 | 2 | 41 | | 18 | SYMBOL | "into" | "into" | 2 | 42 | | 19 | SPACE | " " | 1 | 2 | 46 | | 20 | SYMBOL | "an" | "an" | 2 | 47 | | 21 | SPACE | " " | 1 | 2 | 49 | | 22 | SYMBOL | "array" | "array" | 2 | 50 | | 23 | SPACE | " " | 1 | 2 | 55 | | 24 | SYMBOL | "of" | "of" | 2 | 56 | | 25 | SPACE | " " | 1 | 2 | 58 | | 26 | RUNE | "tokens." | "tokens." | 2 | 59 | | 27 | EOF | "" | "" | 2 | 66 |
const tokens = scanTokens("here's a *line* with some ~~formatting~~.");
printTokens(tokens);
| No | Type | Lexeme | Literal | Line | Column | | --- | ----------- | ------------ | ------------ | ---- | ------ | | 0 | RUNE | "here's" | "here's" | 0 | 0 | | 1 | SPACE | " " | 1 | 0 | 6 | | 2 | SYMBOL | "a" | "a" | 0 | 7 | | 3 | SPACE | " " | 1 | 0 | 8 | | 4 | ASTERISK | "*" | "*" | 0 | 9 | | 5 | SYMBOL | "line" | "line" | 0 | 10 | | 6 | ASTERISK | "*" | "*" | 0 | 14 | | 7 | SPACE | " " | 1 | 0 | 15 | | 8 | SYMBOL | "with" | "with" | 0 | 16 | | 9 | SPACE | " " | 1 | 0 | 20 | | 10 | SYMBOL | "some" | "some" | 0 | 21 | | 11 | SPACE | " " | 1 | 0 | 25 | | 12 | TILDE_TILDE | "~~" | "~~" | 0 | 26 | | 13 | SYMBOL | "formatting" | "formatting" | 0 | 28 | | 14 | TILDE_TILDE | "~~" | "~~" | 0 | 38 | | 15 | RUNE | "." | "." | 0 | 40 | | 16 | EOF | "" | "" | 0 | 41 |
const tokens = scanTokens([
"- [x] Finish the scanner.",
"- [ ] Write some reasonable documentation",
]);
printTokens(tokens);
| No | Type | Lexeme | Literal | Line | Column | | --- | -------- | --------------- | --------------- | ---- | ------ | | 0 | CHECKBOX | "- [x]" | true | 0 | 0 | | 1 | SPACE | " " | 1 | 0 | 5 | | 2 | SYMBOL | "Finish" | "Finish" | 0 | 6 | | 3 | SPACE | " " | 1 | 0 | 12 | | 4 | SYMBOL | "the" | "the" | 0 | 13 | | 5 | SPACE | " " | 1 | 0 | 16 | | 6 | RUNE | "scanner." | "scanner." | 0 | 17 | | 7 | BR | "\n" | 1 | 0 | 25 | | 8 | CHECKBOX | "- [ ]" | false | 1 | 0 | | 9 | SPACE | " " | 1 | 1 | 5 | | 10 | SYMBOL | "Write" | "Write" | 1 | 6 | | 11 | SPACE | " " | 1 | 1 | 11 | | 12 | SYMBOL | "some" | "some" | 1 | 12 | | 13 | SPACE | " " | 1 | 1 | 16 | | 14 | SYMBOL | "reasonable" | "reasonable" | 1 | 17 | | 15 | SPACE | " " | 1 | 1 | 27 | | 16 | SYMBOL | "documentation" | "documentation" | 1 | 28 | | 17 | EOF | "" | "" | 1 | 41 |