tinylex
v0.7.4
Published
A simple iterative lexer written in TypeScript
Downloads
18
Maintainers
Readme
tinylex
A simple iterative lexer written in TypeScript
Under development
Install:
npm install tinylex
Import:
const lexer = require('tinylex')
Code:
const code = `
#
# Darklord source
#
summon "messenger"
forge harken(msg) {
messenger(msg || 'All shall flee before me!')
}
craft lieutenants = 12
craft message = "I have " + leutenants + " servants"
harken.wield(message)
`
Rules:
const KEYWORDS = [
'summon', 'forge', 'craft', 'wield',
'if', 'while', 'true', 'false', 'null'
]
const KEYWORD = new RegExp(`^(?:${KEYWORDS.join('|')})`)
const COMMENT = /^\s*(#.*)\n/
const IDENTIFIER = /^[a-z]\w*/
const NUMBER = /^(?:\+|-)?(?:\.)?\d+\.?(?:\d+)?/
const STRING_SINGLE = /^'([^']*)'/
const STRING_DOUBLE = /^"([^"]*)"/
const LOGICAL = /^(?:\|\||&&|==|!=|<=|>=)/
const WHITESPACE = /^\s/
const rules = [
[COMMENT, 'COMMENT'],
[KEYWORD, 0],
[IDENTIFIER, 'IDENTIFIER'],
[NUMBER, 'NUMBER'],
[LOGICAL, 0],
[STRING_DOUBLE, 'STRING'],
[STRING_SINGLE, 'STRING'],
[WHITESPACE]
]
Instantiate:
const lexer = new TinyLex(code, rules)
Consume:
for (let token of lexer) {
console.log(token)
}
or
while(!lexer.done()) {
console.log(lexer.lex())
}
or
const tokens = [...lexer]
console.log(tokens)
or
const tokens = lexer.tokenize()
console.log(tokens)
Result:
// ------------------------------------------------------------------
// generated tokens
//
[ 'COMMENT', '#' ]
[ 'COMMENT', '# Darklord source' ]
[ 'COMMENT', '#' ]
[ 'SUMMON', 'summon' ]
[ 'STRING', 'messenger' ]
[ 'FORGE', 'forge' ]
[ 'IDENTIFIER', 'harken' ]
[ '(', '(' ]
[ 'IDENTIFIER', 'msg' ]
[ ')', ')' ]
[ '{', '{' ]
[ 'IDENTIFIER', 'messenger' ]
[ '(', '(' ]
[ 'IDENTIFIER', 'msg' ]
[ '||', '||' ]
[ 'STRING', 'All shall flee before me!' ]
[ ')', ')' ]
[ '}', '}' ]
[ 'CRAFT', 'craft' ]
[ 'IDENTIFIER', 'lieutenants' ]
[ '=', '=' ]
[ 'NUMBER', '12' ]
[ 'CRAFT', 'craft' ]
[ 'IDENTIFIER', 'message' ]
[ '=', '=' ]
[ 'STRING', 'I have ' ]
[ '+', '+' ]
[ 'IDENTIFIER', 'leutenants' ]
[ '+', '+' ]
[ 'STRING', ' servants' ]
[ 'IDENTIFIER', 'harken' ]
[ '.', '.' ]
[ 'WIELD', 'wield' ]
[ '(', '(' ]
[ 'IDENTIFIER', 'message' ]
[ ')', ')' ]
[ 'EOF', 'EOF' ]
Rules
const rules = [
[COMMENT, 'COMMENT'], // ['COMMENT', '# Darklord source']
[KEYWORD, 0], // ['SUMMON', 'summon']
[IDENTIFIER, 'IDENTIFIER'], // ['IDENTIFIER', 'harken']
[NUMBER, 'NUMBER'], // ['NUMBER', '12']
[LOGICAL, 0], // ['||', '||']
[STRING_DOUBLE, 'STRING'], // ['STRING', 'messenger']
[STRING_SINGLE, 'STRING'], // ['STRING', 'All shall flee...']
[WHITESPACE]
]
Rules can be specified in the form [RegExp, string|number|function|null|undefined]
RegExp
: the match criteria specified as a regular expression object.
string
: the name of the token, e.g., 'COMMENT'
as in [COMMENT, 'COMMENT']
. The token content is taken from match group 0 (the lexeme) of the RegExp match object which produces the token ['COMMENT', '# Darklord source']
. If the RegExp contains a match group, then match group 1 is used, as is the case for the RegExp used for the string rules, e.g., /^"([^"]*)"/
, which captures the portion of the match between the quotes. This only works for match group 1.
number
: the number of the match group to use for both the token name and content, as in [KEYWORD, 0]
which produces the token ['SUMMON', 'summon']
. This means that if your regular expression contains a match group, you can use it to generate the name and value for the token: [SOME_REGEXP, 1]
.
null|undefined
: no token should be created from the match - effectively discards the match altogether, as in [WHITESPACE]
which swallows whitespace with no other effect. The cursor is advanced by the length of the lexeme (match group 0).
function
: a function used to create the token, discard the match, and/or advance the cursor by some positive, non-zero integer amount (TinyLex
always advances the cursor to avoid infinite loops). Functions here can also push multiple tokens if desired. If the function returns null
or undefined
, the cursor is advanced by the length of the lexeme (match group 0). If the function returns a number <= 1, the cursor is advanced by one. The function's this
context is set to the lexer instance.
// We could use a function to swallow whitespace.
[WHITESPACE, function (match, tokens, chunk) {
// Advance the cursor by one. If we don't return a number, the
// cursor is advanced by the size of the lexeme (match group 0),
// so in this case returning 1 is no different from returning
// null or undefined.
return 1
}]
// We could use a function to customize the token in some way.
[LOGICAL, function (match, tokens, chunk) {
const lexeme = match[0]
switch (lexeme) {
case '&&': tokens.push(['OPERATOR', '&&']); break
case '||': tokens.push(['OPERATOR', '||']); break
default: tokens.push([lexeme, lexeme])
}
// We don't actually need to do this because by default the
// cursor is advanced by the lexeme length (match group 0).
return lexeme.length
}]
Note: when using a rule function you must push one or more tokens onto the tokens array unless you intentionally intend to discard the match. If no tokens are pushed no token will be generated.
The onToken
Function
This function, if given, is called for every token. It can modify the contents of the token, return an entirely new token, or discard some or all tokens (except for the final EOF
token which can be transformed but not removed). onToken
can be utilized by calling lexer.onToken
and passing a function definition. This function is called with its this
context set to the lexer instance.
const lexer = new TinyLex(code, rules)
// The callback function will have it's 'this' context set
// to the lexer instance.
lexer.onToken(function (token, match) {
// We can return a new token, the original token, a modified
// version of the given token, or nothing at all - in which case
// the token will be discarded except for the EOF token which can
// only be modified or set to null.
return token
})
Options
The option onError
specifies what to do if a match is not found at the cursor.
tokenize
: (default) Tokenize the next single character and advance the cursor by one.
ignore
: Advance the cursor by one and do nothing else.
throw
: Throw an error indicating that a match was not found.
// onError can be 'tokenize', 'throw', or 'ignore'.
const lexer = new TinyLex(code, rules, {onError: 'tokenize'})
Note: onError
is the only configuration option.