npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

lexerific

v0.0.7

Published

Downloads

7

Readme

Lexerific

WORK IN PROGRESS

A simple streaming tokenizer built on the finite state machine awesome-automata. It implements the Node.js streams API to make working with files and pipelining easy (so multiple parallel lexing passes are possible with good efficiency in I/O limited environments).

The lexer supports string-based tokenization (by processing patterns character-by-character). It also supports secondary lexing of its own tokens to perform a second or even third scan to further reduce the number of tokens or to build slightly more complex tokens.

This library is meant to do only the scanning/tokenization based on a lexicon and does not support syntactic analysis or the building of any Abstract Syntax Tree typical of a parser.

For character-based lexing, this library doesn't support regular expressions, but there are some helpful macros to help cover the most common use-cases. The primary reason for this is that in most cases (at least the ones I can think of) the tokens that can be generated via regex can also be generated with multiple, stratified scanning passes. If you are using much more complex patterns then this tool is probably not for you (you can check out the jison lexer or something that implements a regex-based tokenizer.

What's so lexerific about it?

Lexerific focuses on being a solution for a very narrow problem set:

  1. Make it easy and intuitive to specify a string or token based lexicon that will map (and potentially reduce) an input stream into a series of domain-specific tokens
  2. To implement the scanner via the streams API to make pipelines of transformations easy from external streams (files, http, binary-buffer based databases, etc)

To make the first point even narrower, it is only meant to work on finite character or token patterns (so no <.+> patterns) to make efficiency controllable and to be very strict about the allowable constructs. The only infinite pattern allowed is a single-lexeme loop (so something like a+) that allows a character or token- based Finite State Machine to create a self-referencing node. The repeated element must be represented by a single input token so it represents one state within the FSM (in a raw character stream, every character in the pattern is a separate state within the machine, whereas in a stream of token objects, states are specified by token type).

That being said, it is possible to implement such a wild-card pattern approach via multiple scanning passes. For example to identify an HTML opening tag:

'<div>'

The first pass might produce the following series of tokens:

  |<|   |d|   |i|   |v|   |>|
  tag-  text  text  text  tag-
  open                    close

Then the second pass would produce:

{token: 'open-tag', attributeValue: 'div'}

Because it's pattern would be:


    {
      token: 'open-tag',
      pattern: [
        {token: 'tag-open'}, 
        {token: 'text', min: 1, max: false}, // explicit false means repeat
        {token: 'tag-close'}
      ],
      attributeValue: function (inputTokens) {
        var attrVal = '';

        inputTokens.forEach(function (token) {
          if (token.type === 'text') {
            attrVal += token.getAttrVal();
          }
        });
        
        return attrVal;
      }
    }

Specifying a lexicon

When creating a lexerific instance, there are two options:

  • mode - string or token
  • specialCharacters - Only available in string mode

The mode determines what type of input the incoming pipe should expect. In string mode, it's the typical buffer of binary data that can be converted to strings. In token mode, it's a list of previously lexed tokens of the format {token: 'some-token', attributeValue: {arbitraryValue: true}}. Lexerific is designed to be a multi-pass lexer that can take advantage of the Node.js streams API pipe() interface to create a transformation pipeline.

For performance reasons, in string mode, you should specify an array of special characters which will be used to preprocess the incoming text via regular expressions for performance reasons. It will basically separate out normal text from potentially special character sequences. Without the preprocessing step, a markdown parser built on lexerific is about 300 times slower than the existing marked library, with it, it's about 50 times slower.


    // Supports character-based states and regex-word based states
    var lexer = new Lexerific({
      mode: 'string',
      specialCharacters: ['<', '>', '(', ')']
    }); 
    
    // A convenience / batch wrapper around lexer.addLexeme();
    
    lexer.addLexemes([
      {
        token: 'em-open',
        pattern: ' _',
        attributeValue: null
      },
      {
        token: 'strong-open',
        pattern: ' __',
        attributeValue: null
      }
    ]);
    
    // When nothing else matches - the default terminal state when all other options
    // are exhausted
    lexer.addDefaultLexeme({
      token: 'text',
      attributeValue: true
    });

A character-by-character lexer

The character mode breaks each of the patterns into a sequence of characters. It then maps a state diagram from one possible character sequence to the next range of possible sequences.

Below is a graphical representation of the simple state diagram for Markdown's em and strong tokens. The strong-open or em-open or `text labels represent tokens fired during transition back to the root node.

At the root node, only a space qualifies as a valid transition forward, everything else results in a text token being fired and the state to transition back to the root node.

                strong-open
       <-----------------------------*
      |      em-open                 |
      |<------------------*          |
      |  text             |          |
      |<-------*          |          |
      |        |          |          |
   -->R------ ' ' ------ '_' ------ '_'
  |   |
   ---
   text

Won't specifying a complex lexicon be a real pain?

We'll see...

Patterns with tokens!

If you've previously lexed a set of tokens, you can perform multiple lexical analyses in order to aggregate and simply tokens further by using the token mode.

For example:


    var lexer = new Lexerific({mode: 'token'}); 
    
    lexer.addLexicon([
      {
        token: 'em',
        pattern: [
          {token: 'em-open'},
    
          // Match an infinite number of text nodes - a self-referencing state
          // within the FSM
          {token: 'text', repeat: true},
          {token: 'em-close'}
        ]
      }
    ]);
    
    // When nothing else matches - the default terminal state when all other options
    // are exhausted
    lexer.addDefaultLexeme({
      token: 'text',
      attributeValue: true
    });

Note that this will probably require a third pass in order to group the tokens into paragraphs. The phases for a Markdown parser are as follows:

  • Pass 1: Differentiate between text and special tokens
  • Pass 2: Identify inline styles
  • Pass 3: Identify paragraph and block styles

API Reference

Lexerific()

Construct a Lexerific instance. Lexerific extends from the Node.js streams api and therefore implements all of it's parent's APIs.

In addition the constructor accepts the following options:

  • mode - string or token
  • specialCharacters - Only available in string mode

See the specifying a lexicon section for more details.


var lexer = new Lexerific({
  mode: 'string',
  specialCharacters: ['\\{', '\\}', '\\(', '\\)']
});

lexer.addLexeme()

Add a possible token for the lexer. Configurations are of the following format:


lexer.addLexeme({
  token: 'my-output-token',
  pattern: [{token: 'em'}, {token: 'text', repeat: true}, {token: 'em'}],
  attributeValue: function (history, input) {
    history.splice(0, 2); // remove the root and first em tag
    history.pop(); // remove the closing em tag
    
    // Aggregate the text into one string value
    return history.reduce(function (prev, current) {
      return prev + current.input.attributeValue;
    }, '');
  }
});

Lexeme Patterns

Patterns are objects of the format:


[
  {token: 'some-token', repeat: true, inverted: false, optional: true},
  {token: 'other-token', repeat: false, inverted: false, optional: false}
]

The token is the name of the token that this pattern unit matches against. The repeat option basically sets a one or more constraint, the optional flag sets a 0 or 1 instances constraint, and the inverted flag says match everything EXCEPT for this token. This MUST be followed by the inverted token, otherwise it will match everything.

Attribute Value Generators

Attribute values for lexemes can either be a static value, or function that can be used to generate what the attribute value should be. The inputs to any attribute value generating function are (history, input) where history is an array of the states that have been visited to match this token (especially important when there are a bunch of repeats). Whatever the return value of the the function is, is what is used as the attribute value for the resulting lexer token.

  attributeValue: function (history, input) {
    return input.attributeValue;
  }

lexer.addLexemes()

A helper wrapper around addLexeme() that accepts an array of lexeme options vs a single lexeme.

lexer.setDefaultLexeme()

The default lexeme is the token that gets generated whenever no token patterns are satisfied. Default lexemes require no patterns because they are triggered whenever no other patterns can be satisfied.

In addition to the normal attribute value generator, default lexemes must also specify what to do in the case of a partial match (basically got half way through some other token but failed to complete). It takes the same inputs as the normal attribute generating function.