regexp-composer

v0.2.2

Published

17 days ago

Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.

Downloads

478

0High
0Medium
0Low

rotemdan

regular expression regular expression builder regular expression composer

Regular expression composer

An easy-to-use TypeScript / JavaScript regular expression builder library designed to simplify the writing of regular expressions, in a composable, function-oriented style that's significantly more readable and less error-prone than standard regular expression syntax.

Produces standard JavaScript regular expressions
Supports all regular expression patterns accepted by the JavaScript engine
Supports all JavaScript runtimes (browsers, Node.js, Deno, Bun)
Designed as Unicode aware, from the ground up. Unicode mode enabled and required
Patterns are created using functions and can be composed and embedded on multiple regular expressions
Automatically escapes special characters
Automatically wraps complex patterns with non-capturing groups ((?:pattern))
Accepts codepoints as integers, in addition to hexadecimal strings (converts as needed)
Unifies disjunctions (like hello|world) and character class patterns (like [Va-zX]) to a single anyOf pattern, where they can be freely mixed
Special tokens are expressed as safer constants like inputStart (^), inputEnd ($), anyChar (*) and lineFeed (\n)
Ensures character and codepoint ranges are valid. Will error on charRange('z', 'a') or codepointRange('a4', 'a1')
Fast and lightweight
Full TypeScript type checking
No dependencies

Basic usage

Install package:

npm install regexp-composer

Build and use a simple regular expression

import { buildRegExp, possibly, inputStart } from 'regexp-composer'

// Build regExp object
const regExp = buildRegExp([inputStart, 'Hello world.', possibly(' How are you?')])

// Use it
regExp.test('Hello world.') // returns true
regExp.test('Hello world!') // returns false
regExp.test(' Hello world.') // returns false
regExp.test('Hello world. How are you?') // returns true

You can also encode a pattern to a RegExp source string, without compiling it to a RegExp object, using encodePattern:

import { encodePattern, possibly, inputStart } from 'regexp-composer'

// Build regexp
const regExpSource = encodePattern([inputStart, 'Hello world.', possibly(' How are you?')])

console.log(regExpSource) // Prints '^Hello world\.(?: How are you\?)?'

Example patterns

Match the string 'Hello world.':

'Hello world.'

(note characters like . within strings are always taken as literals and will be automatically escaped if needed)

Encodes to:

Hello world\.

Match the string 'Hello world.', optionally followed by ' How are you?':

['Hello world.', possibly(' How are you?')]

Encodes to:

Hello world\.(?: How are you\?)?

(note (?: ) is a non-capturing group inserted to wrap the optional pattern)

Match a sequence of one or more English characters or digits:

oneOrMore(anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')))

Encodes to:

[a-zA-Z0-9]+

Match a phone number, like +23 (555) 432-1234:

// The `digit` pattern is reused several times in `phoneNumberPattern`:
const digit = charRange('0', '9')

const phoneNumberPattern = [
	possibly(['+', captureAs('countryCode', repeated([1, 3], digit)), oneOrMore(' ')]),
	possibly(['(', captureAs('areaCode', repeated(3, digit)), ')', oneOrMore(' ')]),
	captureAs('localNumber', [
		repeated(3, digit),
		possibly(anyOf('-', ' ')),
		repeated(4, digit),
	])
]

Encodes to:

(?:\+(?<countryCode>(?:[0-9]){1,3}) +)?(?:\((?<areaCode>(?:[0-9]){3})\) +)?(?<localNumber>(?:[0-9]){3}(?:(?:[- ]))?(?:[0-9]){4})

Pattern reference

String and character literals

String and character literals are represented as simple strings, like:

'Hello'
'Cześć'
'こんにちは'
'X'
'嗨'

Sequence of patterns

A sequence of patterns is written as an array:

[pattern1, pattern2, pattern3, ...]

Optional

`possibly(pattern)`

Accept if given pattern is matched, or skip if not.

Encodes to pattern? or (?:pattern)?.

Choice

`anyOf(patterns)`

Accepts the first pattern that is matched in the pattern list, or fails if no pattern match.

Patterns can be both single character (like 'x' or charRange('a', 'z') or multi-character, (like oneOrMore('Hello')).

Encodes to (?:pattern1|pattern2|pattern3|...).

For efficiency, consecutive single-character patterns are grouped when encoded. For example:

anyOf('V', 'B', 'hello', oneOrMore('bye'), 'good', charRange('a', 'z'), lineFeed, 'world')

Encodes to:

(?:[vb]|hello|(?:bye)+|good|[a-z\n]|world)

`notAnyOfChars(singleCharPatterns)`

Accepts any character except characters that match the given list of single character patterns.

Encodes to [^singleCharPatterns].

For example:

notAnyOfChars('V', 'B', charRange('a', 'z'), lineFeed, codepointRange(5234, 5312), unicodeProperty('Punctuation'))

Encodes to [^VBa-z\n\u{1472}-\u{14c0}\p{Punctuation}].

Negating a choice of multi-character patterns

notAnyOfChars only works on single character patterns. Negating a set of multi-character patterns, like NOT('cat', 'dog', 'elephant'), requires knowing the length, or additional criterions, for a successful positive match (otherwise, how would the RegExp engine know what to match?).

To achieve this, you can use a form of conditional matching, like matches(pattern, { except: excludedPattern }), described in a later section:

matches(oneOrMore(unicodeProperty('Letter')), { except: anyOf('cat', 'dog', 'elephant') })

This provides enough information for the RegExp engine to know which patterns to accept, and which to exclude.

Repetition

`zeroOrMore(pattern)`

Accepts the given pattern, repeated zero or more times.

Encodes to pattern* or (?:pattern)*.

`zeroOrMoreNonGreedy(pattern)`

Accepts the given pattern, repeated zero or more times. Non-greedy.

Encodes to pattern*? or (?:pattern)*?.

`oneOrMore(pattern)`

Accepts the given pattern, repeated one or more times.

Encodes to pattern+ or (?:pattern)+.

`oneOrMoreNonGreedy(pattern)`

Accepts the given pattern, repeated one or more times. Non-greedy.

Encodes to pattern+? or (?:pattern)+?.

`repeated(count, pattern)`

Accepts the given pattern, only if repeated exactly count times.

Encodes to (?:pattern){count}.

`repeated([min, max?], pattern)`

Accepts the given pattern, repeated between min and max times.

When max is not given, it default to Infinity.

Encodes to (?:pattern){min,max}, or (?:pattern){min,} when max is not given or set to Infinity.

`repeatedNonGreedy([min, max?], pattern)`

Accepts the given pattern, repeated between min and max times. Non-greedy.

When max is not given, it default to Infinity.

Encodes to (?:pattern){min,max}?, or (?:pattern){min,}? when max is not given or set to Infinity.

Single character patterns

`codepoint(hexCode)`

Accepts a single character with the given Unicode codepoint, provided as a hexadecimal string.

Encodes to \u{hexCode}.

`codepoint(integerCode)`

Accepts a single character with the given Unicode codepoint, provided as an integer.

integerCode is converted to a Hex-valued string when encoded.

Encodes to \u{hexCode}.

`charRange(startChar, endChar)`

Accepts a single character within the given character range.

Encodes to [startChar-endChar].

`codepointRange(startHexCode, endHexCode)`

Accepts a single character within the given Unicode codepoint range.

startHexCode and endHexCode should be provided as hexadecimal strings.

Encodes to [\u{startHexCode}-\u{endHexCode}].

`codepointRange(startIntegerCode, endIntegerCode)`

Accepts any character within given Unicode codepoint range.

startIntegerCode and endIntegerCode are converted to a hexadecimal valued strings when encoded.

Encodes to [\u{startHexCode}-\u{endHexCode}].

`unicodeProperty(propertyName)`

Accepts a character matching the given Unicode property name.

Encodes to \p{propertyName}.

`unicodeProperty(propertyName, value)`

Accepts a character matching the given Unicode property name and value.

Encodes to \p{propertyName=value}.

`notUnicodeProperty(property)`

Accepts any character that doesn't match the given Unicode property.

Encodes to \P{property}].

`notUnicodeProperty(property, value)`

Accepts any character that doesn't match the given Unicode property and value.

Encodes to \P{property=value}.

Grouping

`capture(pattern)`

Captures an unnamed group.

Encodes to (pattern)

`captureAs(name, pattern)`

Captures a named group.

Encodes to (?<name>pattern).

Backreferences

`sameAs(groupIndex)`

Matches a pattern to a previous unnamed capturing group.

groupIndex is the index of a preceding group. It must be an integer between 1 and 9.

Encodes to (?:\groupIndex).

`sameAs(groupName)`

Matches a pattern to a previous named capturing group.

groupName is the name of a preceding named group.

Encodes to \k<groupName>

Potential issues with backreference indexes greater than 9

groupIndex has been limited to the range of 1..9, because otherwise, in the case there are more than 9 groups that precede the backreference, the encoded RegExp would produce an ambiguity with a backreference followed by one or more digit literals. For example \10 can either be interpreted as either a backreference to the 10th group, or as a backreference to the 1st group, followed by the literal character 0.

In the official specification, this ambiguity is resolved by greedily interpreting the sequence \10 as a backreference if there are 10 or more preceding groups. However, this context-sensitive logic breaks the ability to efficiently parse the regular expression using a context-free grammar! For that reason I've decided to disallow those cases. For backreference indexes greater than 9, you can use named backreferences instead.

Conditional matching

These patterns provide a simplified approach to express various lookahead and lookbehind patterns.

`matches(pattern, { ifFollowedBy: followingPattern })`

Matches a pattern, with the condition that it is followed by a second pattern.

Encodes to pattern(?=followingPattern).

(positive lookahead positioned after the pattern)

`matches(pattern, { ifNotFollowedBy: followingPattern })`

Matches a pattern, with the condition that it is not followed by a second pattern.

Encodes to pattern(?!followingPattern).

(negative lookahead positioned after the pattern).

`matches(pattern, { ifPrecededBy: precedingPattern })`

Matches a pattern, with the condition that it is preceded by a second pattern.

Encodes to (?<=precedingPattern)pattern.

(positive lookbehind positioned before the pattern).

`matches(pattern, { ifNotPrecededBy: precedingPattern })`

Matches a pattern, with the condition that it is not preceded by a second pattern.

Encodes to (?<!precedingPattern)pattern.

(negative lookbehind positioned before the pattern).

`matches(pattern, { ifExtendsTo: extendedPattern })`

Matches a pattern, with the condition that it extends to a second pattern.

Encodes to (?=followingPattern)pattern.

(positive lookahead positioned before the pattern).

`matches(pattern, { except: excludedPattern })`

Matches a pattern, with the condition that it doesn't extend to a second pattern (effectively excluding it).

Encodes to (?!excludedPattern)pattern.

(negative lookahead positioned before the pattern).

Example:

matches(
	oneOrMore(unicodeProperty('Letter')), {
	except: anyOf('V', 'hello', charRange('a', 'z'))
})

matches any sequence of letters of length 1 or more, with the exception of the single uppercase letter V, the string hello, or a single lowercase letter between a and z.

`matches(pattern, { ifExtendsBackTo: backwardExtendedPattern })`

Matches a pattern, with the condition that it extends backward to a second pattern.

Encodes to pattern(?<=precedingPattern).

(positive lookbehind positioned after the pattern).

`matches(pattern, { ifNotExtendsBackTo: backwardExtendedPattern })`

Matches a pattern, with the condition that it doesn't extend backward to a second pattern.

Encodes to pattern(?<!precedingPattern).

(negative lookbehind positioned after the pattern).

Combining multiple conditions

Conditions can be combined. For example:

matches(
	oneOrMore(unicodeProperty('Letter')), {
	except: anyOf('Cat', 'Dog'),
	ifNotPrecededBy: charRange('0', '9'),
	ifNotFollowedBy: anyOf('?', '!')
})

Means that any sequence of Unicode letters would be matched, given all of these conditions are met:

It is not 'Cat' or 'Dog'
It is not preceded by a digit
It is not followed by a question mark or exclamation mark

Including multiple conditions of the same kind

Although not likely to be frequently used, the RegExp engine does allow to define multiple lookahead or lookbehind patterns, producing a form of intersection (conjunction) between the conditions. You can achieve that by passing an array of condition objects as the second argument to matches:

matches(
	oneOrMore(unicodeProperty('Letter')),
	[
		{ ifPrecededBy: unicodeProperty('Letter') },
		{ ifPrecededBy: unicodeProperty('Script_Extensions', 'Gothic') }
		{ ifFollowedBy: unicodeProperty('Letter') },
		{ ifFollowedBy: unicodeProperty('Script_Extensions', 'Greek') },
	]
)

Special token patterns

inputStart: ^
inputEnd: $
anyChar: .
whitespace: \s
nonWhitespace: \S
digit: \d
nonDigit: \D
wordBoundary: \b
nonWordBoundary: \B
formFeed: \f
carriageReturn: \r
lineFeed: \n
tab: \t
verticalTab: \v

Word character tokens

The word character tokens \w and \W are not directly supported because they are not consistently Unicode-aware (they are only Unicode aware when the ignoreCase flag is enabled).

To get consistent results, you can use:

anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')) for English word characters
anyOf(unicodeProperty('Letter'), unicodeProperty('Mark'), unicodeProperty('Number')) for Unicode (multilingual) word characters

Options for `buildRegExp`

Customizable flags:

global: enables the g flag when constructing the RegExp
hasIndices: enables the d flag when constructing the RegExp
ignoreCase: enables the i flag when constructing the RegExp
sticky: enables the y flag when constructing the RegExp

Non-customizable flags:

multiline: the m flag, enabling matching of inputStart (^) tokens to line start, is always disabled in the builder, to ensure clear and consistent semantics for inputStart
dotAll: the s flag, causing the anyChar (*) token to match all tokens, including newlines, is always enabled in the builder, to ensure clear and consistent semantics for anyChar
unicode: the u flag, enabling Unicode support, is always enabled in the builder, as it is required by the patterns codepoint, codepointRange, unicodeProperty and notUnicodeProperty
unicodeSets: the v flag, enabling Unicode set support, like \p{Script_Extensions=Greek}&&\p{Letter}, is currently always disabled (it cannot be enabled at the same time when u is enabled), but is likely to become used in the future

If you still want to override the non-customizable flags (risking unexpected errors and confusing behavior) you can encode the pattern to a RegExp source string using encodePattern, and compile the resulting string using the RegExp constructor, with any set of flags, like new RegExp(encodePattern(...), flags).

Future

Unicode sets, using the v flag, would enable things like intersections of Unicode properties, like unicodeProperties('Letter', ['Script_Extensions', 'Greek'])
Case sensitivity assertion, could allow to selectively describe patterns that are interpreted in a case-sensitive or case-insensitive way. For example [caseInsensitive('Hello'), ' world'] would match "Hello world", "HELLO world", "hello world", "hEllO world", etc.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Regular expression composer

Basic usage

Example patterns

Pattern reference

String and character literals

Sequence of patterns

Optional

possibly(pattern)

Choice

anyOf(patterns)

notAnyOfChars(singleCharPatterns)

Negating a choice of multi-character patterns

Repetition

zeroOrMore(pattern)

zeroOrMoreNonGreedy(pattern)

oneOrMore(pattern)

oneOrMoreNonGreedy(pattern)

repeated(count, pattern)

repeated([min, max?], pattern)

repeatedNonGreedy([min, max?], pattern)

Single character patterns

codepoint(hexCode)

codepoint(integerCode)

charRange(startChar, endChar)

codepointRange(startHexCode, endHexCode)

codepointRange(startIntegerCode, endIntegerCode)

unicodeProperty(propertyName)

unicodeProperty(propertyName, value)

notUnicodeProperty(property)

notUnicodeProperty(property, value)

Grouping

capture(pattern)

captureAs(name, pattern)

Backreferences

sameAs(groupIndex)

sameAs(groupName)

Potential issues with backreference indexes greater than 9

Conditional matching

matches(pattern, { ifFollowedBy: followingPattern })

matches(pattern, { ifNotFollowedBy: followingPattern })

matches(pattern, { ifPrecededBy: precedingPattern })

matches(pattern, { ifNotPrecededBy: precedingPattern })

matches(pattern, { ifExtendsTo: extendedPattern })

matches(pattern, { except: excludedPattern })

Example:

matches(pattern, { ifExtendsBackTo: backwardExtendedPattern })

matches(pattern, { ifNotExtendsBackTo: backwardExtendedPattern })

Combining multiple conditions

Including multiple conditions of the same kind

Special token patterns

Word character tokens

Options for buildRegExp

Future

License

`possibly(pattern)`

`anyOf(patterns)`

`notAnyOfChars(singleCharPatterns)`

`zeroOrMore(pattern)`

`zeroOrMoreNonGreedy(pattern)`

`oneOrMore(pattern)`

`oneOrMoreNonGreedy(pattern)`

`repeated(count, pattern)`

`repeated([min, max?], pattern)`

`repeatedNonGreedy([min, max?], pattern)`

`codepoint(hexCode)`

`codepoint(integerCode)`

`charRange(startChar, endChar)`

`codepointRange(startHexCode, endHexCode)`

`codepointRange(startIntegerCode, endIntegerCode)`

`unicodeProperty(propertyName)`

`unicodeProperty(propertyName, value)`

`notUnicodeProperty(property)`

`notUnicodeProperty(property, value)`

`capture(pattern)`

`captureAs(name, pattern)`

`sameAs(groupIndex)`

`sameAs(groupName)`

`matches(pattern, { ifFollowedBy: followingPattern })`

`matches(pattern, { ifNotFollowedBy: followingPattern })`

`matches(pattern, { ifPrecededBy: precedingPattern })`

`matches(pattern, { ifNotPrecededBy: precedingPattern })`

`matches(pattern, { ifExtendsTo: extendedPattern })`

`matches(pattern, { except: excludedPattern })`

`matches(pattern, { ifExtendsBackTo: backwardExtendedPattern })`

`matches(pattern, { ifNotExtendsBackTo: backwardExtendedPattern })`

Options for `buildRegExp`