@hutechtechnical/nobis-ex-dolor-reprehenderit

v1.0.0

Published

8 months ago

The tiny, regex powered, lenient, _almost_ spec-compliant JavaScript tokenizer that never fails.

Downloads

0High
0Medium
0Low

ESnext deep-clone package manager String.prototype.matchAll ie queue linewrap Microsoft trimRight ArrayBuffer#slice parser starter collection.es6 commander callback characters take type find styling elm deepcopy joi parsing settings Object helpers CSSStyleDeclaration bootstrap css ts ECMAScript 6 protocol-buffers Object.is shim offset limit invariant byte immutable cloudfront regex WebSocket test StyleSheet eventDispatcher defineProperty logger full kinesis tester ES in call-bound rgb nope columns io-ts aws matches descriptor write node libphonenumber listeners proxy toSorted Array.prototype.findLastIndex iterate Int16Array asserts check setter jsdiff side Array RxJS descriptors rapid equal authentication forEach Uint8Array path classname stringifier rfc4122 dependency manager random assert wordbreak spinners ava Array.prototype.flatMap names crypt task tslib touch ratelimit uninstall toobject amazon string lesscss Push idle copy ecmascript moment styleguide find-up structuredClone util Object.values watchFile redux-toolkit speed ECMAScript 3 config deepclone escape debugger batch nodejs guid native rm lru simpledb bundling group karma elasticache reducer endpoint a11y airbnb symlink art performant ECMAScript 2018 eslint-plugin text make dir private data mime-db banner array graphql uuid ES2017 types has-own sns encryption ES2018 cli length eventEmitter break Array.prototype.includes option search shebang unicode classnames fast-copy obj apollo emit name internal BigInt64Array es5 flux ender is file system configurable cloudwatch move Array.prototype.findLast syntax omit terminal ECMAScript 5 duplex ECMAScript 2021 byteOffset styles loading quote validation packages promises readablestream Object.entries superstruct Array.prototype.flatten hash hasOwnProperty exec functions routing optimizer utilities formatting accessor workflow extend diff ajv redirect curl ajax traverse URLSearchParams concatMap directory clone expression contains agent vest YAML serialization streams2 width Set url Stream tdd fixed-width extension wrap syntaxerror dynamodb look ES2023 tap HyBi jsonpath nested css Int8Array request Observable Array.prototype.flat fast-clone up bootstrap less input middleware Observables code points call browserlist .env sharedarraybuffer pnpm9 autoprefixer compiler package.json tools tc39 ECMAScript 2019 character react jQuery es-shim API bdd getintrinsic hardlinks log codes delete error drop immer swf watcher web postcss Rx jest 0 es2018 visual console cjk chai RegExp#flags env stylesheet ansi importexport class-validator reuse npm lint require bound autoscaling Float64Array concat jsdom rm -rf call-bind iam typeof walk ReactiveX look-up String.prototype.trim jsx es7 redact flatMap intrinsic _.extend ES2016 Object.getPrototypeOf set filter ES7 dom-testing-library throat colour chinese function parents figlet some emoji byteLength parse deep-copy east-asian-width callbind modules remove live jasmine cloudformation react-testing-library argparse resolve description ECMAScript 2022 RFC-6455 stringify import higher-order spinner json findup plugin mapreduce pretty output fsevents glob ponyfill $.extend validate extra less mixins fastcopy streams hookform pipe Float32Array Object.fromEntries slice properties serialize client toArray computed-types symlinks [[Prototype]]Int32Array wait auth tty korean watching merge ArrayBuffer xterm mocha trim form core TypeBox object ECMAScript 2023 progress typesafe view number waf AsyncIterator sorted lazy ArrayBuffer.prototype.slice flag property chrome fps own Uint16Array rmdir which ES5 TypeScript mimetypes language JSON-Schema writable minimal prop flat setImmediate stateless safe ec2 typescript key elb chromium typedarrays entries Uint8ClampedArray stable logging colors route53 ReactiveExtensions i18n arrays bind collection browser last weakmap ebs every async circular busy redux prototype password assertion groupBy eslintplugin define fastify throttle shared includes bluebird consume schema gdpr express picomatch inspect compare keys value rds whatwg args less css iteration monorepo getOwnPropertyDescriptor JSON postcss-plugin BigUint64Array prune variables in css ECMAScript 2016 preserve-symlinks events sham arktype Function.prototype.name https awesomesauce fastclone query es2017 iterator censor es link route datastructure fast WeakSet flags css-in-js css less typanion ast css variable indicator classes crypto folder efficient optimist recursive enumerable patch fetch getter cloudtrail style api eslintconfig sqs match RegExp.prototype.flags limited dir negative zero wordwrap map dayjs zero buffers proto mkdir dotenv ES2020 getopt mru styled-components reduce cors spec qs trimStart less.js environment bcrypt ECMAScript 2017 Object.keys compile less arraybuffer regular wget install ES3 file preprocessor react-hooks queueMicrotask regexp inference date internal slot equality ECMAScript 2020 flatten mobile s3 generics values package concurrency Symbol.toStringTag shrinkwrap es8 browserslist hooks CSS __proto__stdlib Promise mkdirs regular expression cloudsearch Reflect.getPrototypeOf shell deep id Iterator framework rate format xhr @@toStringTag has weakset symbol coercible yup Object.assign parent electron bundler ascii full-width time Array.prototype.contains toolkit command less positive ECMAScript 2015 buffer rm -fr worker TypedArray fullwidth core-js mime runtime WebSockets gradients css warning regular expressions Map URL prefix es2016 state hot argument assign ES8 persistent querystring Symbol storagegateway predictable accessibility readable Object.defineProperty hasOwn css nesting term sort installer telephone real-time variables cache from es-abstract es6 emr dataView read ES2021 getPrototypeOf js ES6 dom color valid private typeerror promise utility rangeerror push http -0 css testing beanstalk sameValueZero Array.prototype.filter replay jwt

@hutechtechnical/nobis-ex-dolor-reprehenderit

The tiny, regex powered, lenient, almost spec-compliant JavaScript tokenizer that never fails.

const jsTokens = require("@hutechtechnical/nobis-ex-dolor-reprehenderit");

const jsString = 'JSON.stringify({k:3.14**2}, null /*replacer*/, "\\t")';

Array.from(jsTokens(jsString), (token) => token.value).join("|");
// JSON|.|stringify|(|{|k|:|3.14|**|2|}|,| |null| |/*replacer*/|,| |"\t"|)

Installation

npm install @hutechtechnical/nobis-ex-dolor-reprehenderit

import jsTokens from "@hutechtechnical/nobis-ex-dolor-reprehenderit";
// or:
const jsTokens = require("@hutechtechnical/nobis-ex-dolor-reprehenderit");

Usage

jsTokens(string, options?)

| Option | Type | Default | Description | | :----- | :-------- | :------ | :------------------ | | jsx | boolean | false | Enable JSX support. |

This package exports a generator function, jsTokens, that turns a string of JavaScript code into token objects.

For the empty string, the function yields nothing (which can be turned into an empty list). For any other input, the function always yields something, even for invalid JavaScript, and never throws. Concatenating the token values reproduces the input.

The package is very close to being fully spec compliant (it passes all but 3 of test262-parser-tests), but has taken a couple of shortcuts. See the following sections for limitations of some tokens.

// Loop over tokens:
for (const token of jsTokens("hello, !world")) {
  console.log(token);
}

// Get all tokens as an array:
const tokens = Array.from(jsTokens("hello, !world"));

Tokens

Spec: ECMAScript Language: Lexical Grammar + Additional Syntax

export default function jsTokens(input: string): Iterable<Token>;

type Token =
  | { type: "StringLiteral"; value: string; closed: boolean }
  | { type: "NoSubstitutionTemplate"; value: string; closed: boolean }
  | { type: "TemplateHead"; value: string }
  | { type: "TemplateMiddle"; value: string }
  | { type: "TemplateTail"; value: string; closed: boolean }
  | { type: "RegularExpressionLiteral"; value: string; closed: boolean }
  | { type: "MultiLineComment"; value: string; closed: boolean }
  | { type: "SingleLineComment"; value: string }
  | { type: "HashbangComment"; value: string }
  | { type: "IdentifierName"; value: string }
  | { type: "PrivateIdentifier"; value: string }
  | { type: "NumericLiteral"; value: string }
  | { type: "Punctuator"; value: string }
  | { type: "WhiteSpace"; value: string }
  | { type: "LineTerminatorSequence"; value: string }
  | { type: "Invalid"; value: string };

StringLiteral

Spec: StringLiteral

If the ending " or ' is missing, the token has closed: false. JavaScript strings cannot contain (unescaped) newlines, so unclosed strings simply end at the end of the line.

Escape sequences are supported, but may be invalid. For example, "\u" is matched as a StringLiteral even though it contains an invalid escape.

Examples:

"string"
'string'
""
''
"\""
'\''
"valid: \u00a0, invalid: \u"
'valid: \u00a0, invalid: \u'
"multi-\
line"
'multi-\
line'
" unclosed
' unclosed

NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

Spec: NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

A template without interpolations is matched as is. For, example:

`abc`: NoSubstitutionTemplate
`abc: NoSubstitutionTemplate with closed: false

A template with interpolations is matched as many tokens. For example, `head${1}middle${2}tail` is matched as follows (apart from the two NumericLiterals):

`head${: TemplateHead
}middle${: TemplateMiddle
}tail`: TemplateTail

TemplateMiddle is optional, and TemplateTail can be unclosed. For example, `head${1}tail (note the missing ending `):

`head${: TemplateHead
}tail: TemplateTail with closed: false

Templates can contain unescaped newlines, so unclosed templates go on to the end of input.

Just like for StringLiteral, templates can also contain invalid escapes. `\u` is matched as a NoSubstitutionTemplate even though it contains an invalid escape. Also note that in tagged templates, invalid escapes are not syntax errors: x`\u` is syntactically valid JavaScript.

RegularExpressionLiteral

Spec: RegularExpressionLiteral

Regex literals may contain invalid regex syntax. They are still matched as regex literals.

If the ending / is missing, the token has closed: false. JavaScript regex literals cannot contain newlines (not even escaped ones), so unclosed regex literals simply end at the end of the line.

According to the specification, the flags of regular expressions are IdentifierParts (unknown and repeated regex flags become errors at a later stage).

Differentiating between regex and division in JavaScript is really tricky. @hutechtechnical/nobis-ex-dolor-reprehenderit looks at the previous token to tell them apart. As long as the previous tokens are valid, it should do the right thing. For invalid code, @hutechtechnical/nobis-ex-dolor-reprehenderit might be confused and start matching division as regex or vice versa.

Examples:

/a/
/a/gimsuy
/a/Inva1id
/+/
/[/]\//

MultiLineComment

Spec: MultiLineComment

If the ending */ is missing, the token has closed: false. Unclosed multi-line comments go on to the end of the input.

Examples:

/* comment */
/* console.log(
    "commented", out + code);
    */
/**/
/* unclosed

SingleLineComment

Spec: SingleLineComment

Examples:

// comment
// console.log("commented", out + code);
//

HashbangComment

Spec: HashbangComment

Note that a HashbangComment can only occur at the very start of the string that is being tokenized. Anywhere else you will likely get an Invalid token # followed by a Punctuator token !.

Examples:

#!/usr/bin/env node
#! console.log("commented", out + code);
#!

IdentifierName

Spec: IdentifierName

Keywords, reserved words, null, true, false, variable names and property names.

Examples:

if
for
var
instanceof
package
null
true
false
Infinity
undefined
NaN
$variab1e_name
π
℮
ಠ_ಠ
\u006C\u006F\u006C\u0077\u0061\u0074

PrivateIdentifier

Spec: PrivateIdentifier

Any IdentifierName preceded by a #.

Examples:

#if
#for
#var
#instanceof
#package
#null
#true
#false
#Infinity
#undefined
#NaN
#$variab1e_name
#π
#℮
#ಠ_ಠ
#\u006C\u006F\u006C\u0077\u0061\u0074

NumericLiteral

Spec: NumericLiteral

Examples:

0
1.5
1
1_000
12e9
0.123e-32
0xDead_beef
0b110
12n
07
09.5

Punctuator

Spec: Punctuator + DivPunctuator + RightBracePunctuator

All possible values:

&&  ||  ??
--  ++
.   ?.
<   <=   >   >=
!=  !==  ==  ===
   +   -   %   &   |   ^   /   *   **   <<   >>   >>>
=  +=  -=  %=  &=  |=  ^=  /=  *=  **=  <<=  >>=  >>>=
(  )  [  ]  {  }
!  ?  :  ;  ,  ~  ...  =>

WhiteSpace

Spec: WhiteSpace

Unlike the specification, multiple whitespace characters in a row are matched as one token, not one token per character.

LineTerminatorSequence

Spec: LineTerminatorSequence

CR, LF and CRLF, plus \u2028 and \u2029.

Invalid

Spec: n/a

Single code points not matched in another token.

Examples:

#
@
💩

JSX Tokens

Spec: JSX Specification

export default function jsTokens(
  input: string,
  options: { jsx: true },
): Iterable<Token | JSXToken>;

export declare type JSXToken =
  | { type: "JSXString"; value: string; closed: boolean }
  | { type: "JSXText"; value: string }
  | { type: "JSXIdentifier"; value: string }
  | { type: "JSXPunctuator"; value: string }
  | { type: "JSXInvalid"; value: string };

The tokenizer switches between outputting runs of Token and runs of JSXToken.
Runs of JSXToken can also contain WhiteSpace, LineTerminatorSequence, MultiLineComment and SingleLineComment.

JSXString

Spec: " JSXDoubleStringCharacters " + ' JSXSingleStringCharacters '

If the ending " or ' is missing, the token has closed: false. JSX strings can contain unescaped newlines, so unclosed JSX strings go on to the end of input.

Note that JSX don’t support escape sequences as part of the token grammar. A " or ' always closes the string, even with a backslash before.

Examples:

"string"
'string'
""
''
"\"
'\'
"multi-
line"
'multi-
line'
" unclosed
' unclosed

JSXText

Spec: JSXText

Anything but <, >, { and }.

JSXIdentifier

Spec: JSXIdentifier

Examples:

div
class
xml
x-element
x------
$htm1_element
ಠ_ಠ

JSXPunctuator

Spec: n/a

All possible values:

<
>
/
.
:
=
{
}

JSXInvalid

Spec: n/a

Single code points not matched in another token.

Examples in JSX tags:

1
`
+
,
#
@
💩

All possible values in JSX children:

>
}

Compatibility

ECMAScript

The intention is to always support the latest ECMAScript version whose feature set has been finalized.

Currently, ECMAScript 2023 is supported.

Annex B and C (strict mode)

Section B: Additional ECMAScript Features for Web Browsers of the spec is optional if the ECMAScript host is not a web browser, and specifies some additional syntax. Section C: The Strict Mode of ECMAScript disallows certain syntax in Strict Mode.

Numeric literals: @hutechtechnical/nobis-ex-dolor-reprehenderit supports legacy octal and octal like numeric literals, regardless of Strict Mode.
String literals: @hutechtechnical/nobis-ex-dolor-reprehenderit supports legacy octal escapes, since it allows any invalid escapes.
HTML-like comments: Not supported. @hutechtechnical/nobis-ex-dolor-reprehenderit prefers treating 5<!--x as 5 < !(--x) rather than as 5 //x.
Regular expression patterns: @hutechtechnical/nobis-ex-dolor-reprehenderit doesn’t care what’s between the starting / and ending /, so this is supported.

TypeScript

Supporting TypeScript is not an explicit goal, but @hutechtechnical/nobis-ex-dolor-reprehenderit and Babel both tokenize this TypeScript fixture and this TSX fixture the same way, with one edge case:

type A = Array<Array<string>>
type B = Array<Array<Array<string>>>

Both lines above should end with a couple of > tokens, but @hutechtechnical/nobis-ex-dolor-reprehenderit instead matches the >> and >>> operators.

JSX

JSX is supported: jsTokens("<p>Hello, world!</p>", { jsx: true }).

JavaScript runtimes

@hutechtechnical/nobis-ex-dolor-reprehenderit should work in any JavaScript runtime that supports Unicode property escapes.

Known errors

Here are a couple of tricky cases:

// Case 1:
switch (x) {
  case x: {}/a/g;
  case x: {}<div>x</div>/g;
}

// Case 2:
label: {}/a/g;
label: {}<div>x</div>/g;

// Case 3:
(function f() {}/a/g);
(function f() {}<div>x</div>/g);

This is what they mean:

// Case 1:
switch (x) {
  case x:
    {
    }
    /a/g;
  case x:
    {
    }
    <div>x</div> / g;
}

// Case 2:
label: {
}
/a/g;
label: {
}
<div>x</div> / g;

// Case 3:
(function f() {}) / a / g;
(function f() {}) < div > x < /div>/g;

But @hutechtechnical/nobis-ex-dolor-reprehenderit thinks they mean:

// Case 1:
switch (x) {
  case x:
    ({}) / a / g;
  case x:
    ({}) < div > x < /div>/g;
}

// Case 2:
label: ({}) / a / g;
label: ({}) < div > x < /div>/g;

// Case 3:
function f() {}
/a/g;
function f() {}
<div>x</div> / g;

In other words, @hutechtechnical/nobis-ex-dolor-reprehenderit:

Mis-identifies regex as division and JSX as comparison in case 1 and 2.
Mis-identifies division as regex and comparison as JSX in case 3.

This happens because @hutechtechnical/nobis-ex-dolor-reprehenderit looks at the previous token when deciding between regex and division or JSX and comparison. In these cases, the previous token is }, which either means “end of block” (→ regex/JSX) or “end of object literal” (→ division/comparison). How does @hutechtechnical/nobis-ex-dolor-reprehenderit determine if the } belongs to a block or an object literal? By looking at the token before the matching {.

In case 1 and 2, that’s a :. A : usually means that we have an object literal or ternary:

let some = weird ? { value: {}/a/g } : {}/a/g;

But : is also used for case and labeled statements.

One idea is to look for case before the : as an exception to the rule, but it’s not so easy:

switch (x) {
  case weird ? true : {}/a/g: {}/a/g
}

The first {}/a/g is a division, while the second {}/a/g is an empty block followed by a regex. Both are preceded by a colon with a case on the same line, and it does not seem like you can distinguish between the two without implementing a parser.

Labeled statements are similarly difficult, since they are so similar to object literals:

{
  label: {}/a/g
}

({
  key: {}/a/g
})

Finally, case 3 ((function f() {}/a/g);) is also difficult, because a ) before a { means that the { is part of a block, and blocks are usually statements:

if (x) {
}
/a/g;

function f() {}
/a/g;

But function expressions are of course not statements. It’s difficult to tell an function expression from a function statement without parsing.

Luckily, none of these edge cases are likely to occur in real code.

Known failures

@hutechtechnical/nobis-ex-dolor-reprehenderit advertises that it “never fails”. Tell you what, it can fail on extreme inputs. The regex engine of the runtime can eventually give up. @hutechtechnical/nobis-ex-dolor-reprehenderit has worked around it to some extent by changing its regexes to be easier on the regex engine. To solve completely, @hutechtechnical/nobis-ex-dolor-reprehenderit would have to stop using regex, but then it wouldn’t be tiny anymore which is the whole point. Luckily, only extreme inputs can fail, hopefully ones you’ll never encounter.

For example, if you try to parse the string literal "\n\n\n" but with 10 million \n instead of just 3, the regex engine gives up with RangeError: Maximum call stack size exceeded (or similar). Try it out:

Array.from(require("@hutechtechnical/nobis-ex-dolor-reprehenderit")(`"${"\\n".repeat(1e7)}"`));

(Yes, that is the regex engine of the runtime giving up. @hutechtechnical/nobis-ex-dolor-reprehenderit has no recursive functions.)

However, if you repeat a instead of \n 10 million times ("aaaaaa…"), it works:

Array.from(require("@hutechtechnical/nobis-ex-dolor-reprehenderit")(`"${"a".repeat(1e7)}"`));

That’s good, because it’s much more common to have lots of non-escapes in a row in a big string literal, than having mostly escapes. (Obfuscated code might have only escapes though.)

Safari warning

I’ve seen Safari give up instead of throwing an error.

In Safari, Chrome, Firefox and Node.js the following code successfully results in a match:

/(#)(?:a|b)+/.exec("#" + "a".repeat(1e5));

But for the following code (with 1e7 instead of 1e5), the runtimes differ:

/(#)(?:a|b)+/.exec("#" + "a".repeat(1e7));

Chrome, Firefox and Node.js all throw RangeError: Maximum call stack size exceeded (or similar).
Safari returns null (at the time of writing), silently giving up on matching the regex. It’s kind of lying that the regex did not match, while in reality it would given enough computing resources.

This means that in Safari, @hutechtechnical/nobis-ex-dolor-reprehenderit might not fail but instead give you unexpected tokens.

Performance

With @babel/parser for comparison. Node.js 21.6.1 on a MacBook Pro M1 (Sonoma).

| Lines of code | Size | @hutechtechnical/[email protected] | @babel/[email protected] | | ------------: | -------: | --------------: | -------------------: | | ~100 | ~4.0 KiB | ~2 ms | ~10 ms | | ~1 000 | ~39 KiB | ~5 ms | ~27 ms | | ~10 000 | ~353 KiB | ~44 ms | ~108 ms | | ~100 000 | ~5.1 MiB | ~333 ms | ~2.0 s | | ~2 400 000 | ~138 MiB | ~7 s | ~4 m 9 s (*) |

(*) Required increasing the Node.js the memory limit (I set it to 8 GiB).

See benchmark.js if you want to run benchmarks yourself.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@hutechtechnical/nobis-ex-dolor-reprehenderit

Installation

Usage

Tokens

StringLiteral

NoSubstitutionTemplate / TemplateHead / TemplateMiddle / TemplateTail

RegularExpressionLiteral

MultiLineComment

SingleLineComment

HashbangComment

IdentifierName

PrivateIdentifier

NumericLiteral

Punctuator

WhiteSpace

LineTerminatorSequence

Invalid

JSX Tokens

JSXString

JSXText

JSXIdentifier

JSXPunctuator

JSXInvalid

Compatibility

ECMAScript

Annex B and C (strict mode)

TypeScript

JSX

JavaScript runtimes

Known errors

Known failures

Safari warning

Performance