npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@bablr/regex-vm

v0.9.0

Published

A fully streaming regular expression engine

Downloads

64

Readme

@bablr/regex-vm

@bablr/regex-vm is a fully-featured regex engine, scripted in javascript. The engine's implementation is non-backtracking, which makes it ideal for matching against streaming inputs of any kind. It is expected to be used most commonly in the building of streaming parsers, especially in conjunction with @bablr/parserate (coming soon!).

Not everyone needs a streaming regex engine. If you are matching a static regex against string data, it is very likely that you should be using the native regex implementation. However if you are working on data that is fundamentally a stream and this engine may save you from having to load all the data into a string first. If perf is your only reason to use this engine, make sure to do some tests to see that you are actually gaining perf. The engine is still quite slow!

Performance

The non-backtracking design also means the engine is not vulnerable to the phenomenon known as catastrophic backtracking, which can make some not-uncommon naively written patterns have essentially infinite time cost to evaluate. This makes the engine more suitable for use with user-supplied patterns, especially when combined with tools like glob syntaxes which can offer users some of the power of regex (and compile to regexe) but without the steep learning curve of regex syntax.

While the engine is not vulnerable to catastrophic backtracking, it can still be attacked or misued. Bad patterns will tend to cause the engine's match state to balloon in size, consuming lots of memory.

In terms of raw performance, this library is still extremely slow -- 50x - 80x slower than native regex for normal patterns, and currently up to 2000x slower for certain patterns that do not contain any branches. Current work is on closing this perf gap, and there is reason to think it can be narrowed significantly.

API

test(pattern, input)
exec(pattern, input)
execGlobal(pattern, input)

Note that this API is exported as three separate submodules, each with a slightly different purpose!

The modules are:

  • / (@bablr/regex-vm) is the base module, for use when input is a sync iterator.
  • /async (@bablr/regex-vm/async) is for use with async iterables of characters, such as bablr might produce.
  • /async/chunked (@bablr/regex-vm/async/chunked) is meant to optimize performance when use with streams (iterables of strings, that is) such as those returned by fs.createReadStream(path, 'utf-8').

test

import { test } from '@bablr/regex-vm';
import { test as testAsync } from '@bablr/regex-vm/async';
import { test as testChunked } from '@bablr/regex-vm/async/chunked';

const didMatch = test(pattern, input);
const didMatch = await testAsync(pattern, input);
const didMatch = await testChunked(pattern, input);

didMatch will be true if pattern matches at some location in input.

exec

import { exec } from '@bablr/regex-vm';
import { exec as execAsync } from '@bablr/regex-vm/async';
import { exec as execChunked } from '@bablr/regex-vm/async/chunked';

const captures = exec(pattern, input);
const captures = await execAsync(pattern, input);
const captures = await execChunked(pattern, input);
// 1-indexed by lexical order of `(` ($2 is b)
const [match, $1, $2, $3] = exec(/(a(b))(c)/, input);

captures will be the array of [match, ...captures] from the first location where pattern matches in input. This method differs from the spec in that it returns [] (NOT null) when pattern is not found in input. This is so that it may be used more nicely with destructuring. If you need to check if the match was present, you can still do it nicely with destructuring syntax:

const [match = null, $1] = exec(/.*(a)/, input);

if (match !== null) console.log(`match: '${match}'`);
if ($1 !== undefined) console.log(`$1: '${$1}'`);

execGlobal

import { execGlobal } from '@bablr/regex-vm';
import { execGlobal as execGlobalAsync } from '@bablr/regex-vm/async';
import { execGlobal as execGlobalChunked } from '@bablr/regex-vm/async/chunked';

const [...matches] = execGlobal(pattern, input);
for await (const match of execGlobal(pattern, input)) { }
for await (const match of execGlobalChunked(pattern, input)) { }

matches is an iterable of match arrays (Iterable[[match, ...captures], ...matches]). If pattern is not found in input the iterable of matches will be empty. execGlobal interacts with the global (/g) flag. If the /g flag is not present the matches iterable will never contain more than one match.

Patterns and flags

Some syntaxes are unsupported. Unsupported syntaxes are still parsed as long as they are in the well-supported regexpp parser, so that you will not be allowed to write expressions which would not be compatible with other engines.

  • Patterns use "unicode mode" escaping rules. Only valid escapes are syntactically legal.

  • Patterns do not support lookbehind ((?<=abc) and (?<!abc)).

  • Patterns do not (and will not) support backreferences ((.)\0).

  • Patterns do not support lookahead (yet) ((?=abc) and (?!abc)). See #11.

  • Patterns do not support named capture groups ((?<name>)) (yet).

  • The unicode flag (/u) is not supported yet. Supporting it is a top priority. See #33.

  • The sticky flag (/y) is partially supported. It restricts matching to only attempt to match pattern at the start of input or at the end of a global match (when /g is also present). Not the same as putting a ^ in the pattern, which may be affected by the multiline flag (/m).

Credits

Thanks Jason Priestley! Without your blog post I would not have known where to start. Also thanks to my friends and family, who have heard me talk about this more than any of them could possibly want to.