universal-lexer

v2.0.6

Published

3 years ago

Universal lexer, where you can pass your rules for lexical analytics

Downloads

0High
0Medium
0Low

rangoo

tokenizer token lexer universal lexical analysis parser expression

Universal Lexer

Lexer which can parse any text input to tokens, according to provided regular expressions.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of tokens (strings with an assigned and thus identified meaning). A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, though scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

Features

Allow named regular expressions, so you don't have to work with it a lot
Allow post-processing tokens, to get more information you require

How to install

Package is available as universal-lexer in NPM, so you can use it in your project using npm install universal-lexer or yarn add universal-lexer

What are requirements?

Code itself is written in ES6 and should work in Node.js 6+ environment. If you would like to use it in browser or older development, there is also transpiled and bundled (UMD) version included. You can use universal-lexer/browser in your requires or UniversalLexer in global environment (in browser):

// Load library
const UniversalLexer = require('universal-lexer/browser')

// Create lexer
const lexer = UniversalLexer.compile(definitions)

// ...

How it works

You've got two sets of functions:

// Load library
const UniversalLexer = require('universal-lexer')

// Build code for this lexer
const code1 = UniversalLexer.build([ { type: 'Colon', value: ':' } ])
const code2 = UniversalLexer.buildFromFile('json.yaml')

// Compile dynamically a function which can be used
const func1 = UniversalLexer.compile([ { type: 'Colon', value: ':' } ])
const func2 = UniversalLexer.compileFromFile('json.yaml')

There are two ways of passing rules to this lexer: from file or array of definitions.

Pass as array of definitions

Simply, pass definitions to lexer:

// Load library
const UniversalLexer = require('universal-lexer')

// Create token definition
const Colon = {
  type: 'Colon',
  value: ':'
}

// Build array of definitions
const definitions = [ Colon ]

// Create lexer
const lexer = UniversalLexer.compile(definitions)

A definition is more complex object:

// Required fields: 'type' and either `regex` or `value`
{
  // Token name
  type: 'String',

  // String value which should be searched on beginning on string
  value: 'abc',
  value: '(',

  // Regular expression to validate
  // if current token should be parsed as this token
  // Useful i.e. when you require separator after sentence,
  // but you don't want to include it.
  valid: '"',

  // Regular expression flags for 'valid' field
  validFlags: 'i',

  // Regular expression to find current token
  // You can use named groups as well (?<name>expression):
  // Then it will attach this information to token.
  regex: '"(?<value>([^"]|\\.)+)"',

  // Regular expression flags for 'regex' field
  regexFlags: 'i'
}

Pass YAML file

// Load library
const UniversalLexer = require('universal-lexer')

const lexer = UniversalLexer.compileFromFile('scss.yaml')

YAML file for now should contain only Tokens property with definitions. Later it may have more advanced stuff like macros (for simpler syntax).

Example:

Tokens:
  # Whitespaces

  - type: NewLine
    value: "\n"

  - type: Space
    regex: '[ \t]+'

  # Math

  - type: Operator
    regex: '[+-*/]'

  # Color
  # It has 'valid' field, to be sure that it's not i.e. blacker
  # Now, it will check if there is no text after

  - type: Color
    regex: '(?<value>black|white)'
    valid: '(black|white)[^\w]'

Processing data

Processing input data, after you created a lexer is pretty straight-forward with for method:

// Load library
const UniversalLexer = require('universal-lexer')

// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')

// Build processor
const tokens = tokenize('some { background: code }').tokens

Post-processing tokens

If you would like to make more advanced parsing on parsed tokens, you can do it with addProcessor method:

// Load library
const UniversalLexer = require('universal-lexer')

// Create lexer
const tokenize = UniversalLexer.compileFromFile('scss.yaml')

// That's 'Literal' definition:
const Literal = {
  type: 'Literal',
  regex: '(?<value>([^\t \n;"'',{}()\[\]#=:~&\\]|(\\.))+)'
}

// Create processor which will replace all '\X' to 'X' in value
function process (token) {
  if (token.type === 'Literal') {
    token.data.value = token.data.value.replace(/\\(.)/g, '$1')
  }

  return token
}

// Also, you can return a new token
function process2 (token) {
  if (token.type !== 'Literal') {
    return token
  }

  return {
    type: 'Literal',
    data: {
      value: token.data.value.replace(/\\(.)/g, '$1')
    },
    start: token.start,
    end: token.end
  }
}

// Get all tokens...
const tokens = tokenize('some { background: code }', process).tokens

Beautified code

If you would like to get beautified code of lexer, you can use second argument of compile functions:

UniversalLexer.compile(definitions, true)
UniversalLexer.compileFromFile('scss.yaml', true)

Possible results

On success you will retrieve simple object with array of tokens:

{
  tokens: [
    { type: 'Whitespace', data: { value: '     ' }, start: 0, end: 5 },
    { type: 'Word', data: { value: 'some' }, start: 5, end: 9 }
  ]
}

When something is wrong you will get error information:

{
  error: 'Unrecognized token',
  index: 1,
  line: 1,
  column: 2
}

Examples

For now, you can see example of JSON semantics in examples/json.yaml file.

CLI

After installing globally (or inside of NPM scripts) universal-lexer command is available:

Usage: universal-lexer [options] output.js

Options:
  --version       Show version number                                  [boolean]
  -s, --source    Semantics file                                      [required]
  -b, --beautify  Should beautify code?                [boolean] [default: true]
  -h, --help      Show help                                            [boolean]

Examples:
  universal-lexer -s json.yaml lexer.js  build lexer from semantics file

Changelog

Version 2

2.0.6 - bugfix for single characters
2.0.5 - fix mistake in README file (post-processing code)
2.0.4 - remove unneeded benchmark dependency
2.0.3 - add unit and E2E tests, fix small bugs
2.0.2 - added CLI command
2.0.1 - fix typo in README file
2.0.0 - optimize it (even 10x faster) by expression analysis and some other things

Version 1

1.0.8 - change that current position in syntax error starts from 1 always
1.0.7 - optimize definitions with "value", make syntax errors developer-friendly
1.0.6 - optimized Lexer performance (20% faster in average)
1.0.5 - fix browser version to be put into NPM package properly
1.0.4 - bugfix for debugging
1.0.3 - add proper sanitization for debug HTML
1.0.2 - small fixes for README file
1.0.1 - added Rollup.js support to build version for browser