npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

regexp-composer

v0.2.2

Published

Easy-to-use regular expression builder, using a composable, function-oriented style. Supports all regular expression patterns accepted by the JavaScript RegExp engine.

Downloads

478

Readme

Regular expression composer

An easy-to-use TypeScript / JavaScript regular expression builder library designed to simplify the writing of regular expressions, in a composable, function-oriented style that's significantly more readable and less error-prone than standard regular expression syntax.

  • Produces standard JavaScript regular expressions
  • Supports all regular expression patterns accepted by the JavaScript engine
  • Supports all JavaScript runtimes (browsers, Node.js, Deno, Bun)
  • Designed as Unicode aware, from the ground up. Unicode mode enabled and required
  • Patterns are created using functions and can be composed and embedded on multiple regular expressions
  • Automatically escapes special characters
  • Automatically wraps complex patterns with non-capturing groups ((?:pattern))
  • Accepts codepoints as integers, in addition to hexadecimal strings (converts as needed)
  • Unifies disjunctions (like hello|world) and character class patterns (like [Va-zX]) to a single anyOf pattern, where they can be freely mixed
  • Special tokens are expressed as safer constants like inputStart (^), inputEnd ($), anyChar (*) and lineFeed (\n)
  • Ensures character and codepoint ranges are valid. Will error on charRange('z', 'a') or codepointRange('a4', 'a1')
  • Fast and lightweight
  • Full TypeScript type checking
  • No dependencies

Basic usage

Install package:

npm install regexp-composer

Build and use a simple regular expression

import { buildRegExp, possibly, inputStart } from 'regexp-composer'

// Build regExp object
const regExp = buildRegExp([inputStart, 'Hello world.', possibly(' How are you?')])

// Use it
regExp.test('Hello world.') // returns true
regExp.test('Hello world!') // returns false
regExp.test(' Hello world.') // returns false
regExp.test('Hello world. How are you?') // returns true

You can also encode a pattern to a RegExp source string, without compiling it to a RegExp object, using encodePattern:

import { encodePattern, possibly, inputStart } from 'regexp-composer'

// Build regexp
const regExpSource = encodePattern([inputStart, 'Hello world.', possibly(' How are you?')])

console.log(regExpSource) // Prints '^Hello world\.(?: How are you\?)?'

Example patterns

Match the string 'Hello world.':

'Hello world.'

(note characters like . within strings are always taken as literals and will be automatically escaped if needed)

Encodes to:

Hello world\.

Match the string 'Hello world.', optionally followed by ' How are you?':

['Hello world.', possibly(' How are you?')]

Encodes to:

Hello world\.(?: How are you\?)?

(note (?: ) is a non-capturing group inserted to wrap the optional pattern)

Match a sequence of one or more English characters or digits:

oneOrMore(anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')))

Encodes to:

[a-zA-Z0-9]+

Match a phone number, like +23 (555) 432-1234:

// The `digit` pattern is reused several times in `phoneNumberPattern`:
const digit = charRange('0', '9')

const phoneNumberPattern = [
	possibly(['+', captureAs('countryCode', repeated([1, 3], digit)), oneOrMore(' ')]),
	possibly(['(', captureAs('areaCode', repeated(3, digit)), ')', oneOrMore(' ')]),
	captureAs('localNumber', [
		repeated(3, digit),
		possibly(anyOf('-', ' ')),
		repeated(4, digit),
	])
]

Encodes to:

(?:\+(?<countryCode>(?:[0-9]){1,3}) +)?(?:\((?<areaCode>(?:[0-9]){3})\) +)?(?<localNumber>(?:[0-9]){3}(?:(?:[- ]))?(?:[0-9]){4})

Pattern reference

String and character literals

String and character literals are represented as simple strings, like:

'Hello'
'Cześć'
'こんにちは'
'X'
'嗨'

Sequence of patterns

A sequence of patterns is written as an array:

[pattern1, pattern2, pattern3, ...]

Optional

possibly(pattern)

Accept if given pattern is matched, or skip if not.

Encodes to pattern? or (?:pattern)?.

Choice

anyOf(patterns)

Accepts the first pattern that is matched in the pattern list, or fails if no pattern match.

Patterns can be both single character (like 'x' or charRange('a', 'z') or multi-character, (like oneOrMore('Hello')).

Encodes to (?:pattern1|pattern2|pattern3|...).

For efficiency, consecutive single-character patterns are grouped when encoded. For example:

anyOf('V', 'B', 'hello', oneOrMore('bye'), 'good', charRange('a', 'z'), lineFeed, 'world')

Encodes to:

(?:[vb]|hello|(?:bye)+|good|[a-z\n]|world)

notAnyOfChars(singleCharPatterns)

Accepts any character except characters that match the given list of single character patterns.

Encodes to [^singleCharPatterns].

For example:

notAnyOfChars('V', 'B', charRange('a', 'z'), lineFeed, codepointRange(5234, 5312), unicodeProperty('Punctuation'))

Encodes to [^VBa-z\n\u{1472}-\u{14c0}\p{Punctuation}].

Negating a choice of multi-character patterns

notAnyOfChars only works on single character patterns. Negating a set of multi-character patterns, like NOT('cat', 'dog', 'elephant'), requires knowing the length, or additional criterions, for a successful positive match (otherwise, how would the RegExp engine know what to match?).

To achieve this, you can use a form of conditional matching, like matches(pattern, { except: excludedPattern }), described in a later section:

matches(oneOrMore(unicodeProperty('Letter')), { except: anyOf('cat', 'dog', 'elephant') })

This provides enough information for the RegExp engine to know which patterns to accept, and which to exclude.

Repetition

zeroOrMore(pattern)

Accepts the given pattern, repeated zero or more times.

Encodes to pattern* or (?:pattern)*.

zeroOrMoreNonGreedy(pattern)

Accepts the given pattern, repeated zero or more times. Non-greedy.

Encodes to pattern*? or (?:pattern)*?.

oneOrMore(pattern)

Accepts the given pattern, repeated one or more times.

Encodes to pattern+ or (?:pattern)+.

oneOrMoreNonGreedy(pattern)

Accepts the given pattern, repeated one or more times. Non-greedy.

Encodes to pattern+? or (?:pattern)+?.

repeated(count, pattern)

Accepts the given pattern, only if repeated exactly count times.

Encodes to (?:pattern){count}.

repeated([min, max?], pattern)

Accepts the given pattern, repeated between min and max times.

When max is not given, it default to Infinity.

Encodes to (?:pattern){min,max}, or (?:pattern){min,} when max is not given or set to Infinity.

repeatedNonGreedy([min, max?], pattern)

Accepts the given pattern, repeated between min and max times. Non-greedy.

When max is not given, it default to Infinity.

Encodes to (?:pattern){min,max}?, or (?:pattern){min,}? when max is not given or set to Infinity.

Single character patterns

codepoint(hexCode)

Accepts a single character with the given Unicode codepoint, provided as a hexadecimal string.

Encodes to \u{hexCode}.

codepoint(integerCode)

Accepts a single character with the given Unicode codepoint, provided as an integer.

integerCode is converted to a Hex-valued string when encoded.

Encodes to \u{hexCode}.

charRange(startChar, endChar)

Accepts a single character within the given character range.

Encodes to [startChar-endChar].

codepointRange(startHexCode, endHexCode)

Accepts a single character within the given Unicode codepoint range.

startHexCode and endHexCode should be provided as hexadecimal strings.

Encodes to [\u{startHexCode}-\u{endHexCode}].

codepointRange(startIntegerCode, endIntegerCode)

Accepts any character within given Unicode codepoint range.

startIntegerCode and endIntegerCode are converted to a hexadecimal valued strings when encoded.

Encodes to [\u{startHexCode}-\u{endHexCode}].

unicodeProperty(propertyName)

Accepts a character matching the given Unicode property name.

Encodes to \p{propertyName}.

unicodeProperty(propertyName, value)

Accepts a character matching the given Unicode property name and value.

Encodes to \p{propertyName=value}.

notUnicodeProperty(property)

Accepts any character that doesn't match the given Unicode property.

Encodes to \P{property}].

notUnicodeProperty(property, value)

Accepts any character that doesn't match the given Unicode property and value.

Encodes to \P{property=value}.

Grouping

capture(pattern)

Captures an unnamed group.

Encodes to (pattern)

captureAs(name, pattern)

Captures a named group.

Encodes to (?<name>pattern).

Backreferences

sameAs(groupIndex)

Matches a pattern to a previous unnamed capturing group.

groupIndex is the index of a preceding group. It must be an integer between 1 and 9.

Encodes to (?:\groupIndex).

sameAs(groupName)

Matches a pattern to a previous named capturing group.

groupName is the name of a preceding named group.

Encodes to \k<groupName>

Potential issues with backreference indexes greater than 9

groupIndex has been limited to the range of 1..9, because otherwise, in the case there are more than 9 groups that precede the backreference, the encoded RegExp would produce an ambiguity with a backreference followed by one or more digit literals. For example \10 can either be interpreted as either a backreference to the 10th group, or as a backreference to the 1st group, followed by the literal character 0.

In the official specification, this ambiguity is resolved by greedily interpreting the sequence \10 as a backreference if there are 10 or more preceding groups. However, this context-sensitive logic breaks the ability to efficiently parse the regular expression using a context-free grammar! For that reason I've decided to disallow those cases. For backreference indexes greater than 9, you can use named backreferences instead.

Conditional matching

These patterns provide a simplified approach to express various lookahead and lookbehind patterns.

matches(pattern, { ifFollowedBy: followingPattern })

Matches a pattern, with the condition that it is followed by a second pattern.

Encodes to pattern(?=followingPattern).

(positive lookahead positioned after the pattern)

matches(pattern, { ifNotFollowedBy: followingPattern })

Matches a pattern, with the condition that it is not followed by a second pattern.

Encodes to pattern(?!followingPattern).

(negative lookahead positioned after the pattern).

matches(pattern, { ifPrecededBy: precedingPattern })

Matches a pattern, with the condition that it is preceded by a second pattern.

Encodes to (?<=precedingPattern)pattern.

(positive lookbehind positioned before the pattern).

matches(pattern, { ifNotPrecededBy: precedingPattern })

Matches a pattern, with the condition that it is not preceded by a second pattern.

Encodes to (?<!precedingPattern)pattern.

(negative lookbehind positioned before the pattern).

matches(pattern, { ifExtendsTo: extendedPattern })

Matches a pattern, with the condition that it extends to a second pattern.

Encodes to (?=followingPattern)pattern.

(positive lookahead positioned before the pattern).

matches(pattern, { except: excludedPattern })

Matches a pattern, with the condition that it doesn't extend to a second pattern (effectively excluding it).

Encodes to (?!excludedPattern)pattern.

(negative lookahead positioned before the pattern).

Example:

matches(
	oneOrMore(unicodeProperty('Letter')), {
	except: anyOf('V', 'hello', charRange('a', 'z'))
})

matches any sequence of letters of length 1 or more, with the exception of the single uppercase letter V, the string hello, or a single lowercase letter between a and z.

matches(pattern, { ifExtendsBackTo: backwardExtendedPattern })

Matches a pattern, with the condition that it extends backward to a second pattern.

Encodes to pattern(?<=precedingPattern).

(positive lookbehind positioned after the pattern).

matches(pattern, { ifNotExtendsBackTo: backwardExtendedPattern })

Matches a pattern, with the condition that it doesn't extend backward to a second pattern.

Encodes to pattern(?<!precedingPattern).

(negative lookbehind positioned after the pattern).

Combining multiple conditions

Conditions can be combined. For example:

matches(
	oneOrMore(unicodeProperty('Letter')), {
	except: anyOf('Cat', 'Dog'),
	ifNotPrecededBy: charRange('0', '9'),
	ifNotFollowedBy: anyOf('?', '!')
})

Means that any sequence of Unicode letters would be matched, given all of these conditions are met:

  • It is not 'Cat' or 'Dog'
  • It is not preceded by a digit
  • It is not followed by a question mark or exclamation mark

Including multiple conditions of the same kind

Although not likely to be frequently used, the RegExp engine does allow to define multiple lookahead or lookbehind patterns, producing a form of intersection (conjunction) between the conditions. You can achieve that by passing an array of condition objects as the second argument to matches:

matches(
	oneOrMore(unicodeProperty('Letter')),
	[
		{ ifPrecededBy: unicodeProperty('Letter') },
		{ ifPrecededBy: unicodeProperty('Script_Extensions', 'Gothic') }
		{ ifFollowedBy: unicodeProperty('Letter') },
		{ ifFollowedBy: unicodeProperty('Script_Extensions', 'Greek') },
	]
)

Special token patterns

  • inputStart: ^
  • inputEnd: $
  • anyChar: .
  • whitespace: \s
  • nonWhitespace: \S
  • digit: \d
  • nonDigit: \D
  • wordBoundary: \b
  • nonWordBoundary: \B
  • formFeed: \f
  • carriageReturn: \r
  • lineFeed: \n
  • tab: \t
  • verticalTab: \v

Word character tokens

The word character tokens \w and \W are not directly supported because they are not consistently Unicode-aware (they are only Unicode aware when the ignoreCase flag is enabled).

To get consistent results, you can use:

  • anyOf(charRange('a', 'z'), charRange('A', 'Z'), charRange('0', '9')) for English word characters
  • anyOf(unicodeProperty('Letter'), unicodeProperty('Mark'), unicodeProperty('Number')) for Unicode (multilingual) word characters

Options for buildRegExp

Customizable flags:

  • global: enables the g flag when constructing the RegExp
  • hasIndices: enables the d flag when constructing the RegExp
  • ignoreCase: enables the i flag when constructing the RegExp
  • sticky: enables the y flag when constructing the RegExp

Non-customizable flags:

  • multiline: the m flag, enabling matching of inputStart (^) tokens to line start, is always disabled in the builder, to ensure clear and consistent semantics for inputStart
  • dotAll: the s flag, causing the anyChar (*) token to match all tokens, including newlines, is always enabled in the builder, to ensure clear and consistent semantics for anyChar
  • unicode: the u flag, enabling Unicode support, is always enabled in the builder, as it is required by the patterns codepoint, codepointRange, unicodeProperty and notUnicodeProperty
  • unicodeSets: the v flag, enabling Unicode set support, like \p{Script_Extensions=Greek}&&\p{Letter}, is currently always disabled (it cannot be enabled at the same time when u is enabled), but is likely to become used in the future

If you still want to override the non-customizable flags (risking unexpected errors and confusing behavior) you can encode the pattern to a RegExp source string using encodePattern, and compile the resulting string using the RegExp constructor, with any set of flags, like new RegExp(encodePattern(...), flags).

Future

  • Unicode sets, using the v flag, would enable things like intersections of Unicode properties, like unicodeProperties('Letter', ['Script_Extensions', 'Greek'])
  • Case sensitivity assertion, could allow to selectively describe patterns that are interpreted in a case-sensitive or case-insensitive way. For example [caseInsensitive('Hello'), ' world'] would match "Hello world", "HELLO world", "hello world", "hEllO world", etc.

License

MIT