universal-lexer
v2.0.6
Published
Universal lexer, where you can pass your rules for lexical analytics
Downloads
28
Maintainers
Readme
Universal Lexer
Lexer which can parse any text input to tokens, according to provided regular expressions.
In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.
Features
- Allow named regular expressions, so you don't have to work with it a lot
- Allow post-processing tokens, to get more information you require
How to install
Package is available as universal-lexer
in NPM, so you can use it in your project using
npm install universal-lexer
or yarn add universal-lexer
What are requirements?
Code itself is written in ES6 and should work in Node.js 6+ environment.
If you would like to use it in browser or older development, there is also transpiled and bundled (UMD) version included.
You can use universal-lexer/browser
in your requires or UniversalLexer
in global environment (in browser):
// Load library
const UniversalLexer = require('universal-lexer/browser')
// Create lexer
const lexer = UniversalLexer.compile(definitions)
// ...
How it works
You've got two sets of functions:
// Load library
const UniversalLexer = require('universal-lexer')
// Build code for this lexer
const code1 = UniversalLexer.build([ { type: 'Colon', value: ':' } ])
const code2 = UniversalLexer.buildFromFile('json.yaml')
// Compile dynamically a function which can be used
const func1 = UniversalLexer.compile([ { type: 'Colon', value: ':' } ])
const func2 = UniversalLexer.compileFromFile('json.yaml')
There are two ways of passing rules to this lexer: from file or array of definitions.
Pass as array of definitions
Simply, pass definitions to lexer:
// Load library
const UniversalLexer = require('universal-lexer')
// Create token definition
const Colon = {
type: 'Colon',
value: ':'
}
// Build array of definitions
const definitions = [ Colon ]
// Create lexer
const lexer = UniversalLexer.compile(definitions)
A definition is more complex object:
// Required fields: 'type' and either `regex` or `value`
{
// Token name
type: 'String',
// String value which should be searched on beginning on string
value: 'abc',
value: '(',
// Regular expression to validate
// if current token should be parsed as this token
// Useful i.e. when you require separator after sentence,
// but you don't want to include it.
valid: '"',
// Regular expression flags for 'valid' field
validFlags: 'i',
// Regular expression to find current token
// You can use named groups as well (?<name>expression):
// Then it will attach this information to token.
regex: '"(?<value>([^"]|\\.)+)"',
// Regular expression flags for 'regex' field
regexFlags: 'i'
}
Pass YAML file
// Load library
const UniversalLexer = require('universal-lexer')
const lexer = UniversalLexer.compileFromFile('scss.yaml')
YAML file for now should contain only Tokens
property with definitions.
Later it may have more advanced stuff like macros (for simpler syntax).
Example:
Tokens:
# Whitespaces
- type: NewLine
value: "\n"
- type: Space
regex: '[ \t]+'
# Math
- type: Operator
regex: '[+-*/]'
# Color
# It has 'valid' field, to be sure that it's not i.e. blacker
# Now, it will check if there is no text after
- type: Color
regex: '(?<value>black|white)'
valid: '(black|white)[^\w]'
Processing data
Processing input data, after you created a lexer is pretty straight-forward with for
method:
// Load library
const UniversalLexer = require('universal-lexer')
// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')
// Build processor
const tokens = tokenize('some { background: code }').tokens
Post-processing tokens
If you would like to make more advanced parsing on parsed tokens, you can do it with addProcessor
method:
// Load library
const UniversalLexer = require('universal-lexer')
// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')
// That's 'Literal' definition:
const Literal = {
type: 'Literal',
regex: '(?<value>([^\t \n;"'',{}()\[\]#=:~&\\]|(\\.))+)'
}
// Create processor which will replace all '\X' to 'X' in value
function process (token) {
if (token.type === 'Literal') {
token.data.value = token.data.value.replace(/\\(.)/g, '$1')
}
return token
}
// Also, you can return a new token
function process2 (token) {
if (token.type !== 'Literal') {
return token
}
return {
type: 'Literal',
data: {
value: token.data.value.replace(/\\(.)/g, '$1')
},
start: token.start,
end: token.end
}
}
// Get all tokens...
const tokens = tokenize('some { background: code }', process).tokens
Beautified code
If you would like to get beautified code of lexer,
you can use second argument of compile
functions:
UniversalLexer.compile(definitions, true)
UniversalLexer.compileFromFile('scss.yaml', true)
Possible results
On success you will retrieve simple object with array of tokens:
{
tokens: [
{ type: 'Whitespace', data: { value: ' ' }, start: 0, end: 5 },
{ type: 'Word', data: { value: 'some' }, start: 5, end: 9 }
]
}
When something is wrong you will get error information:
{
error: 'Unrecognized token',
index: 1,
line: 1,
column: 2
}
Examples
For now, you can see example of JSON semantics in examples/json.yaml
file.
CLI
After installing globally (or inside of NPM scripts) universal-lexer
command is available:
Usage: universal-lexer [options] output.js
Options:
--version Show version number [boolean]
-s, --source Semantics file [required]
-b, --beautify Should beautify code? [boolean] [default: true]
-h, --help Show help [boolean]
Examples:
universal-lexer -s json.yaml lexer.js build lexer from semantics file
Changelog
Version 2
- 2.0.6 - bugfix for single characters
- 2.0.5 - fix mistake in README file (post-processing code)
- 2.0.4 - remove unneeded
benchmark
dependency - 2.0.3 - add unit and E2E tests, fix small bugs
- 2.0.2 - added CLI command
- 2.0.1 - fix typo in README file
- 2.0.0 - optimize it (even 10x faster) by expression analysis and some other things
Version 1
- 1.0.8 - change that current position in syntax error starts from 1 always
- 1.0.7 - optimize definitions with "value", make syntax errors developer-friendly
- 1.0.6 - optimized Lexer performance (20% faster in average)
- 1.0.5 - fix browser version to be put into NPM package properly
- 1.0.4 - bugfix for debugging
- 1.0.3 - add proper sanitization for debug HTML
- 1.0.2 - small fixes for README file
- 1.0.1 - added Rollup.js support to build version for browser