re2-wasm
v1.0.2
Published
Google's RE2 library distributed as a WASM module
Downloads
129,465
Readme
re2-wasm
This is not an officially supported Google product.
This README is modified from the node-re2 README, licensed under The "New" BSD License
This project provides bindings for RE2: fast, safe alternative to backtracking regular expression engines written by Russ Cox. To learn more about RE2, start with an overview Regular Expression Matching in the Wild. More resources can be found at his Implementing Regular Expressions page.
RE2
's regular expression language is almost a superset of what is provided by RegExp
(see Syntax),
but it lacks two features: backreferences and lookahead assertions. See below for more details.
RE2
object emulates standard RegExp
making it a practical drop-in replacement in most cases.
RE2
is extended to provide String
-based regular expression methods as well. To help to convert
RegExp
objects to RE2
its constructor can take RegExp
directly honoring all properties.
Why use re2-wasm?
The built-in Node.js regular expression engine can run in exponential time with a special combination:
- A vulnerable regular expression
- "Evil input"
This can lead to what is known as a Regular Expression Denial of Service (ReDoS). To tell if your regular expressions are vulnerable, you might try the one of these projects:
However, neither project is perfect.
re2-wasm can protect your Node.js application from ReDoS.
re2-wasm makes vulnerable regular expression patterns safe by evaluating them in RE2
instead of the built-in Node.js regex engine.
Standard features
RE2
object can be created just like RegExp
:
Supported properties:
re2.lastIndex
re2.global
re2.ignoreCase
re2.multiline
re2.unicode
RE2
engine always works in the Unicode mode. See details below.
re2.sticky
re2.source
re2.flags
Supported methods:
The following well-known symbol-based methods are supported (see Symbols):
re2[Symbol.match](str)
re2[Symbol.search](str)
re2[Symbol.replace](str, newSubStr|function)
re2[Symbol.split](str[, limit])
It allows to use RE2
instances on strings directly, just like RegExp
instances:
var re = new RE2("1", 'u');
"213".match(re); // [ '1', index: 1, input: '213' ]
"213".search(re); // 1
"213".replace(re, "+"); // 2+3
"213".split(re); // [ '2', '3' ]
Named groups are supported.
Extensions
Shortcut construction
RE2
object can be created from a regular expression:
var re1 = new RE2(/ab*/igu); // from a RegExp object
var re2 = new RE2(re1); // from another RE2 object
String
methods
Standard String
defines four more methods that can use regular expressions. RE2
provides them as methods
exchanging positions of a string, and a regular expression:
re2.match(str)
re2.replace(str, newSubStr|function)
re2.search(str)
re2.split(str[, limit])
Property: internalSource
Starting 1.8.0 property source
emulates the same property of RegExp
, meaning that it can be used to create an identical RE2
or RegExp
instance. Sometimes, for troubleshooting purposes, a user wants to inspect a RE2
translated source. It is available as a read-only property called internalSource
.
Unicode Mode
The RE2
engine only works in Unicode mode, so the RE2
class must always be constructed with the u
flag to enable unicode mode.
How to install
Installation:
npm install --save re2-wasm
How to use
It is used just like a RegExp
object.
var { RE2 } = require("re2-wasm");
// with default flags
var re = new RE2("a(b*)", 'u');
var result = re.exec("abbc");
console.log(result[0]); // "abb"
console.log(result[1]); // "bb"
result = re.exec("aBbC");
console.log(result[0]); // "a"
console.log(result[1]); // ""
// with explicit flags
re = new RE2("a(b*)", "iu");
result = re.exec("aBbC");
console.log(result[0]); // "aBb"
console.log(result[1]); // "Bb"
// from regular expression object
var regexp = new RegExp("a(b*)", "iu");
re = new RE2(regexp);
result = re.exec("aBbC");
console.log(result[0]); // "aBb"
console.log(result[1]); // "Bb"
// from regular expression literal
re = new RE2(/a(b*)/iu);
result = re.exec("aBbC");
console.log(result[0]); // "aBb"
console.log(result[1]); // "Bb"
// from another RE2 object
var rex = new RE2(re);
result = rex.exec("aBbC");
console.log(result[0]); // "aBb"
console.log(result[1]); // "Bb"
// shortcut
result = new RE2("ab*", 'u').exec("abba");
Limitations (things RE2 does not support)
RE2
consciously avoids any regular expression features that require worst-case exponential time to evaluate.
These features are essentially those that describe a Context-Free Language (CFL) rather than a Regular Expression,
and are extensions to the traditional regular expression language because some people don't know when enough is enough.
The most noteworthy missing features are backreferences and lookahead assertions.
If your application uses these features, you should continue to use RegExp
.
But since these features are fundamentally vulnerable to
ReDoS,
you should strongly consider replacing them.
RE2
will throw a SyntaxError
if you try to declare a regular expression using these features.
If you are evaluating an externally-provided regular expression, wrap your RE2
declarations in a try-catch block. It allows to use RegExp
, when RE2
misses a feature:
var re = /(a)+(b)*/u;
try {
re = new RE2(re);
// use RE2 as a drop-in replacement
} catch (e) {
// suppress an error, and use
// the original RegExp
}
var result = re.exec(sample);
In addition to these missing features, RE2
also behaves somewhat differently from the built-in regular expression engine in corner cases.
Backreferences
RE2
doesn't support backreferences, which are numbered references to previously
matched groups, like so: \1
, \2
, and so on. Example of backrefrences:
/(cat|dog)\1/.test("catcat"); // true
/(cat|dog)\1/.test("dogdog"); // true
/(cat|dog)\1/.test("catdog"); // false
/(cat|dog)\1/.test("dogcat"); // false
Lookahead assertions
RE2
doesn't support lookahead assertions, which are ways to allow a matching dependent on subsequent contents.
/abc(?=def)/; // match abc only if it is followed by def
/abc(?!def)/; // match abc only if it is not followed by def
Mismatched behavior
RE2
and the built-in regex engines disagree a bit. Before you switch to RE2
, verify that your regular expressions continue to work as expected. They should do so in the vast majority of cases.
Here is an example of a case where they may not:
var { RE2 } = require("re2-wasm");
var pattern = '(?:(a)|(b)|(c))+';
var built_in = new RegExp(pattern);
var re2 = new RE2(pattern, 'u');
var input = 'abc';
var bi_res = built_in.exec(input);
var re2_res = re2.exec(input);
console.log('bi_res: ' + bi_res); // prints: bi_res: abc,,,c
console.log('re2_res : ' + re2_res); // prints: re2_res : abc,a,b,c
Unicode
RE2
only works in the Unicode mode. The u
flag must be passed to the RE2
constructor.
License
Apache 2.0