compose-regexp
v0.7.1
Published
A set of functions to build and compose complex regular expressions
Downloads
4,823
Maintainers
Readme
- Build and compose maintainable regular exprssions in JavaScript.
- ReDOS, begone!
Highlights
Fixing ReDOS
import {atomic, sequence} from 'compose-regexp'
// classic ReDOS-vulnerable RegExp:
const ReDOS = /^(([a-z])+.)+[A-Z]([a-z])+$/
// fixed with compose-regexp, this does not backtrack
const fixed = sequence(/^/, atomic(/(([a-z])+.)+/), /[A-Z]([a-z])+$/)
You can test it on flems.io.
Combining character sets
import {bound, charSet, flags, suffix} from 'compose-regexp'
const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
LcGrekLetter.test("Γ") // false
LcGrekLetter.test("γ") // true
LcGrekLetter.test("x") // false
// like /\b/, but for Greek rather than /\w/
const b = bound(/\p{Script=Greek}/u)
const LcGrekWords = flags.add('g', b, suffix("+", LcGrekLetter), b)
for (lc of `Θεωρείται ως ο σημαντικότερος θεμελιωτής ...`.matchAll(LcGrekWords)) {
console.log(lc) //'ως', 'ο', 'σημαντικότερος', 'θεμελιωτής'
}
You can test it live on flems.io.
TOC
Why you may need this lib
Regular expressions don't do justice to regular grammars.
Complex RegExps hard to read, debug and modify. For code reuse, we often resort to source string concatenation which is error prone and even less readable than RegExp literals.
Also, the matching engines are, by spec, required to backtrack on match failures. This can enable surprising behavior, and even the ReDOS attack where maliciously crafted input triggers exponential backtracking, a heavy computation that can freeze programs for hours if not for years.
compose-regexp
to the rescue!
compose-regexp
will let you take advantage of the finely honed RegExp engines shipped in Browsers while making RegExps readable, testable, and ReDOS-proof witout deep knowledge of the engines.
It doesn't make regular grammars more powerful, they are still fundamentally limited, but since they are ubiquitous, we may as well have better tooling to put them to use...
compose-regexp
is reasonably small, and doesn't have dependencies. You can use it as a plain dependency, or, for client-side apps, in a server-side script that generates the RegExps that you ship to the browsers.
Usage
Installation
$ npm install --save compose-regexp
API
In a nutshell
// Core combinators
either(...exprs) // /a|b/
sequence(...exprs) // /ab/
suffix(quantifier, ...exprs) // /a+/, /(?:a|b){1,3}/
maybe(...exprs) // shortcut for `suffix('?', ...)`
flags.add(flags, ...exprs) // add flags
// captures and references
capture(...exprs) // /(...)/
namedCapture(label, ...exprs) // /(?<label>...)/
ref(nuberOrLabel) // /\1/ or /\k<label>/
ref(num, depth) // /\1/
// predicates
lookAhead(...exprs) // positive look ahead: /(?=...)/
notAhead(...exprs) // negative look ahead: /(?!...)/
lookBehind(...exprs) // positive look behind: /(?<=...)/
notBehind(...exprs) // negative look behind: /(?<!...)/
// other functions
atomic(...exprs) // helper to prevent backtracking
bound(...exprs) // like /\b/, but for arbitrary character classes rather than just \w
noBound(...exprs) // like /\B/ for arbitrary character classes
// Operating on character classes.
// The result of these function will match ...
charSet.difference(a, b) // ... characters that match charSet `a` and don't match charSet `b`
charSet.intersection(a, b) // ... characters that match both charSet `a` and charSet `b`
charSet.complement(a) // ... characters that don't match `a`
charSet.union(...cs) // ... character that match any of the provided charsets
// The lack of non-capturing group API is deliberate. We insert them
// automatically where needed, they are not a user concern when using
// this lib.
Here's a flems.io sandbox with the full API pre-loaded, and the string example above.
General notes
The
...exprs
parameters of these functions can be either RegExps, strings, or arrays ofexpr
s. Arrays of exprs are treated as nested sequences.Special characters in strings are escaped, so that
'.*'
is equivalent to/\.\*/
. Therefore:> sequence('.', '*').source === '\\.\\*'
whereas:
> sequence(/./', /a/).source === '.a'
compose-regexp
understand RegExp syntax, and will add non-capturing groups automatically where relevant. e.g.suffix('*', '.', /\w+/)
will turn into/(?:\.\w+)*/
When composing RegExps with mixed flags:
- The
u
flag is contagious, and non-u
. RegExps will be upgraded if possible. - The other flags of regexps passed as parameters are ignored, and always reset to false on the result unless set by
flags()
. This is obviously suboptimal, and will be improved in time.
- The
Back references (
\1
, etc...) are automatically upgraded suc thatsequence(/(\w)\1/, /(\d)\1/)
becomes/(\w)\1(\d)\2/
. Theref()
function lets one create refs programmatically:> sequence(capture(), ref(1)) /()\1/ // whereas this wouldn't work > sequence(capture, /\1/) /()\2/ //
Basic combinators
either(...exprs) : RegExp
> either(/a/, /b/, /c/)
/a|b|c/
// arrays help cut on boilerplate sequence() calls
> either(
[/a/, /b/]
[/c/, /d/]
)
/ab|cd/
sequence(...exprs) : RegExp
> sequence(/a/, /b/, /c/)
/abc/
compose-regexp
inserts non-capturing groups where needed:
> sequence(/a/, /b|c/)
/a(?:b|c)/
suffix(quantifier, ...exprs) : RegExp
suffix(quantifier)(...exprs) : RegExp
Valid quantifiers:
| greedy | non-greedy |
|----------|------------|
| ?
| ??
|
| *
| *?
|
| +
| +?
|
| {n}
| {n}?
|
| {n,}
| {n,}?
|
| {m,n}
| {m,n}?
|
non-string quantifiers are converted to String and wrapped in braces such that
suffix(3)
is equivalent tosuffix('{3}')
suffix([1,3])
is equivalent tosuffix('{1,3}')
suffix([2,,])
is equivalent tosuffix('{2,}')
> suffix("*", either(/a/, /b/, /c/))
/(?:a|b|c)*/
// it is a curried function:
> zeroOrMore = suffix('*')
> zeroOrMore('a')
/a*/
maybe(...exprs) : RegExp
shorcut for the ?
quantifier
> maybe('a')
/a?/
flags(opts, ...exprs) : RegExp
flags(opts)(...exprs) : RegExp
> flags('gm', /a/)
/a/gm
> global = flags(g); global('a')
/a/g
Captures and References
capture(...exprs) : RegExp
Create an anonymous capturing group.
> capture(/a/, /b/, /c/)
/(abc)/
namedCapture(...exprs) : RegExp
Create an named capturing group.
> capture(/a/, /b/, /c/)
/(abc)/
ref(label: string) : RegExp
ref(n: number) : RegExp (thunk)
ref(n: number, depth: number) : RegExp (thunk)
When passed a number, returns a specially crafted RegExp that can't match anything as is but will be turned into a back reference when composed. See the back references section below for a detailed description
> ref("label")
/\k<label>/
> ref(1)
/$d:0,n:1^/
> sequence(capture(/./), ref(1))
/(.)\1/
// a ref to a capture two levels up the call stack
> sequence(capture(/\w/), either(capture(ref(1, 2)), /./u))
/(\w)(?:(\1)|.)/u
Assertions
lookAhead(...exprs) : RegExp
Positive look ahead.
> lookAhead(/a/, /b/, /c/)
/(?=abc)/
notAhead(...exprs) : RegExp
Negative look ahead.
> notAhead(/a/, /b/, /c/)
/(?!abc)/
lookBehind(...exprs) : RegExp
Requires an engine that supports look behind assertions
Positive look behind.
> lookBehind(/a/, /b/, /c/)
/(?<=abc)/
notBehind(...exprs) : RegExp
Requires an engine that supports look behind assertions
Negative look behind.
> notBehind(/a/, /b/, /c/)
/(?<!abc)/
Derived Utilites
atomic(...exprs) : RegExp
Returns a RegExp that will match sequence(...exprs)
, but into which the engine won't backtrack once it has succeeded.
> atomic(/\w+?/)
/(?=(\w+?))\1/
> lookBehind(atomic(/\w+?/))
// Error, but...
> lookBehind(()=>atomic(/\w+?/))
/(?<=\1(?<=(\w+?)))/
atomic()
adds an unnamed capturing group. There's no way around it as of until JS adds support for atomic groups. You'd be better off using named capturing groups if you want to extract sub-matches, they are easier the handle than match indices which go all over the place anyway when you compose RegExps. See the atomic matching section for more detals.
RegExps with captures have custom behavior when using someString.split(regexp)
, which probably isn't what you want (the captures are inserted into the resulting arrays, with undefined
for the capturing groups that didn't match). Avoid atomic()
for that purpose.
bound(...exprs) : RegExp
Requires an engine that supports look behind assertions
Returns a RegExp that works like /\b/
does, but for an arbitrary char set.
const numBound = flags.add('y', bound(/[0-9]/))
numBound.test("q88p") // false (before q)
numBound.lastIndex = 1
numBound.test("q88p") // true (between 'q' and '8')
numBound.lastIndex = 2
numBound.test("q88p") // false (between both '8')
numBound.lastIndex = 3
numBound.test("q88p") // true (between '8' and 'p')
numBound.lastIndex = 4
numBound.test("q88p") // false (after the 'p')
noBound(...exprs) : RegExp
Requires an engine that supports look behind assertions
noBound(x)
returns a RegExp that succeeds where bounds(x)
fails, and vice-versa.
charSet.difference(a, b) : RegExp
charSet.intersection(a, b) : RegExp
charSet.complement(cs) : RegExp
charSet.union(...cs) : RegExp
Set operations on charSets... well, operations on arbitrary RegExps, actually. They can be fed anything but are probably most useful when used with CharSets, CharClasses, and Unicode properties.
charSet.difference(a, b)
: returns a RegExp that matches the characters matched bya
and don't match those ofb
const ab = charSet.difference(/[a-d]/, /[cd]/)
ab.test(a) // true
ab.test(b) // true
ab.test(c) // false
ab.test(d) // false
charSet.intersection(a, b)
: returns a RegExp that matches characters matched by botha
andb
.
const bc = charSet.intersection(/[a-c]/, /[b-d]/)
bc.test(a) // false
bc.test(b) // true
bc.test(c) // true
bc.test(d) // false
This is especially useful when combined with Unicode properties:
const LcCyrl = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Cyrillic}/u)
LcCyrl.test("б") // true
LcCyrl.test("Б") // false
LcCyrl.test("b") // false
const UcGrek = = charSet.intersection(/\p{Uppuercase}/u, /\p{Script=Greek}/u)
UcGrek.test("Γ") // true
UcGrek.test("γ") // false
UcGrek.test("W") // false
// another example
const asciiNonLetter = charSet.difference(/[\0\x7f]/, /[A-Za-z]/)
asciiNonLetter.test(":") // true
asciiNonLetter.test("a") // false
The full list of supported Unicode properties is listed in the ECMAScript spec.
charSet.complement(cs)
: Returns a RegExp that matches when the argument doesn't
const notDF = charSet.complement(/[D-F]/)
notDF.test("C") // true
notDF.test("DEF") // false
notDF.test("G") // true
charSet.union(...cs)
: returns a RegExp that matches any of the arguments
const abcd = charSet.union(/[ab]/, /c/, /d/)
abcd.test(a) // true
abcd.test(b) // true
abcd.test(c) // true
abcd.test(d) // true
abcd.test(e) // false
Atomic matching
Atomic groups prevent the RegExp engine from backtracking into them, aleviating the infamous ReDOS attack. JavaScript doesn't support them out of the box as of 2022, but they can be emulated using the /(?=(your stuff here))\1/
pattern. We provide a convenient atomic()
helper that wraps regexps in such a way, making them atomic at the boundary. Putting an atomic()
call around an exprssion is not enough to prevent backtracking inside of it, you'll have to put them around every exprssion that could backtrack problematically. (//TODO: give examples. Feel free to open a PR or an issue do discuss this).
Also, the atomic()
helper creates a capturing group, offsetting the indices of nested and further captures. It is better to rely on named captures for extracting values from a match, as numbered captures go all over the place when composing.
In look behind assertions (lookBehind(...)
and notBehind(...)
a.k.a. /(?<=...)/
and /(?<!...)/
) matching happens backwards. For atomic matching in lookBehind assertions, wrap the arguments inside a function, in that context, atomic('x')
produces /\1(?<=(x))/
.
// nope:
lookBehind(atomic(/.*?/)) // throws
lookBehind(()=>atomic(/.*?/)) // => /(?<=/\1(?<=(.*?))/)/
To better undestand the behavior of back-references in compound regexps, see the next section.
Let's talk about flags
The g
, d
and y
flags of input RegExps will be ignored by the combinators. The resulting RegExp will not have them (unless added manually with flags.add()
).
The u
flag is contagious when possible. E.G. sequence(/./u, /a/)
returns /.a/u
. However the meaning of sequence(/\p{Any}/u, /./)
is ambiguous. We don't know if you want /\p{Any}./u
, or /\p{Any}(?![\10000-\10ffff])./u
, avoiding matches in the Astral planes, as /./
would. In scenarios like this one, and in cases where a non-u
RegExp whose source is not allowed in u
mode is mixed with one that has a u
flag, an error is thrown.
RegExps with the m
and the s
flags are converted to flagless regexps that match the same input. So for example /./s
becomes /[^]/
. The pattern is a bit more involved for /^/
and /$/
with the m
flag. If your RegExp engine doesn't support look behind assertions, the m
flag is preserved and is handled like the i
flag (see below).
RegExps with the i
flag can't be mixed with i
-less RegExps, and vice-versa. You need an "all-i
" or an "all-non-i
" cast for a given composition (strings are fine in both cases, they are flag-agnostic).
Back References
Regular exprssions let one reference the value of a previous group by either numbered or labeled back-references. Labeled back-references refer to named groups, whereas the index of a numbered back references can map to either a named or an unnamed group.
/(.)\1/.test("aa") // true
/(.)\1/.test("ab") // false
/(?<x>.)\1/.test("aa") // true
/(?<x>.)\k<x>/.test("aa") // true
compose-regexp
takes back references into account when composing RegExps, and will update the indices of numbered refs.
sequence(/(.)\1/, /(.)\1/) // => /(.)\1(.)\2/
In look behind assertions, be they positive or negative, patterns are applied backward (i.e. right-to-left) to the input string. In that scenario, back references must appear before the capture they reference:
/(?<=\1(.))./.exec("abccd") //=> matches "d"
Patterns with back reference intended for general, forward usage will be useless in look behind context, and compose-regexp
will reject them. If you want to use back references in patterns that are interpreted backwards, you must use a function:
// this will not work:
const errorInTheMaking = sequence(capture(/./), ref(1))
const bw = lookBehind(errorInTheMaking) // throws, thankfully
// this works, the ref is evaluated in backward context thanks to the arrow function
const bw = lookBehind(()=>[ref(1), capture(/./)])
// => /(?<=\1(.))/
The atomic()
helper is direction sensitive. When called in backward context, it produces a result that will be atomic when interpreted backwards.
atomic("a") // => /(?=(a))\1/
// this will not work:
lookBehind(atomic(a)) // throws
// this will:
lookBehind(()=>atomic(a)) // => /(?<=\1(?<=(a)))
For scenarios where you'd like to use a back reference in a nested context, you can use the second optional depth
parameter to ref(n, depth)
sequence(capture(/./), either(capture("a", ref(1, 2), "b"), /./u)
// => /(.)(?:(a\1b)|.)/u
// without depth, not what you'd want:
sequence(capture(/./), either(capture("a", ref(1), "b"), /./)
// => /(.)(?:(a\2b)|.)/
The depth is 2
, for the levels in the call stack(one for capture()
, one for either()
).
Browser support
This library is meant to work in ES5 environments, provided you use minimal polyfills (Object.assign()
, essentially), but it hasn't been tested in that environment. If you find bugs please send them my way. TODO: bundle the test suite for old IE testing, stripping out tests that can't work there.
Some of the library's features rely on newer RegExp features. The u
flag can't be pulled out of thin air, for example, and neither can look behind assertions.
Limitations and missing pieces
compose-regexp
will not be able to mixi
-flagged and non-i
-flagged without native support for the scenario. Case-insensitive matching involves case folding both the pattern and the source string, andcompose-regexp
can't access the latter.The logic to detect empty patterns only works for rudimentary ones (enough to deal with what we produce AFAICT). TODO: improve this
License MIT
The MIT License (MIT)
Copyright (c) 2016 Pierre-Yves Gérardy
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.