xml-tokenizer

v0.0.18

Published

2 days ago

Straightforward and typesafe XML tokenizer that streams tokens through a callback mechanism

Downloads

125

0High
0Medium
0Low

bennodev

Status: Experimental

xml-tokenizer is a straightforward and typesafe XML tokenizer that streams tokens through a callback mechanism. The implementation is based on the roxmltree tokenizer.rs. See the FAQ why we did not embed the roxmltree crate as WASM.

XML Token Stream: Processes XML documents as a stream, emitting tokens on the fly similar to the SAX approach
Wide Range of Tokens: Handles processing instructions, comments, entity declarations, element starts/ends, attributes, text, and CDATA sections
Validate XML: Validates XML while processing which makes it slower than txml but its still twice as fast as fast-xml-parser
Typesafe: Build with TypeScript for strong type safety

📚 Examples

Vanilla Profiler

🌟 Motivation

Create a typesafe, straightforward, and lightweight XML parser. Many existing parsers either lack TypeScript support, aren't actively maintained, or exceed 20kB gzipped.

My goal was to develop an efficient & flexible alternative by porting roxmltree to TypeScript or integrating it via WASM. While it functions well and is quite versatile due to its streaming approach, it's not as fast as I hoped.

⚖️ Alternatives

📖 Usage

import { select, tokenize, xmlToObject, xmlToSimplifiedObject } from 'xml-tokenizer';

// Parse XML to Javascript object without information lost (uses `tokenize` under the hood)
const xmlObject = xmlToObject('<p>Hello World</p>');

// Or, parse XML to easy to queryable Javascript object
const simplifiedXmlObject = xmlToSimplifiedObject('<p>Hello World</p>');

// Or, parse XML to a stream of tokens
tokenize('<p>Hello World</p>', (token) => {
	switch (token.type) {
		case 'ElementStart':
			console.log('Start of element:', token);
			break;
		case 'Text':
			console.log('Text content:', token.text);
			break;
		// Handle other token types as needed
		default:
			console.log('Token:', token);
	}
});

// Or, stream only a selection of tokens
select(
	xml,
	[
		[
			{ axis: 'child', local: 'bookstore' },
			{ axis: 'child', local: 'book', attributes: [{ local: 'category', value: 'COOKING' }] }
		]
	],
	(selectedToken) => {
		// Handle selected token
	}
);

Token Types

The following token types are supported:

ProcessingInstruction: <?target content?>
Comment: 
EntityDeclaration: <!ENTITY ns_extend "http://test.com">
ElementStart: <ns:elem
Attribute: ns:attr="value"
ElementEnd:
- Open: >
- Close: </ns:name>
- Empty: />
Text: Text content between elements, including whitespace.
Cdata: <![CDATA[text]]>

👀 Differences from XML 1.0 Specification

Attribute Value Handling:
- XML 1.0: Attributes must be explicitly assigned a value in the format Name="Value". An attribute without a value is not valid XML.
- Parser Behavior: Attributes without an explicit value are interpreted as true (e.g., <element attribute/> is parsed as attribute="true").
- Reason: This behavior aligns with HTML-style parsing, which was necessary to handle HTML attributes without explicit values.

🚀 Benchmark

The performance of xml-tokenizer was benchmarked against other popular XML parsers. These tests focus on XML to object conversion and node counting. Interestingly, the version of xml-tokenizer imported directly from npm performed significantly better. The reason for this discrepancy is unclear, but the results seem accurate based on external testing.

XML to Object Conversion

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | -------------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | xml-tokenizer | 46.87 | 19.47 | 24.57 | 21.33 | ±2.06% | | xml-tokenizer (dist) | 53.70 | 17.31 | 25.20 | 18.62 | ±3.28% | | xml-tokenizer (npm) | 163.00 | 5.03 | 8.50 | 6.13 | ±2.32% | | fast-xml-parser | 66.00 | 14.01 | 20.73 | 15.15 | ±3.34% | | txml | 234.52 | 3.38 | 7.61 | 4.26 | ±4.00% | | xml2js | 36.21 | 25.58 | 37.28 | 27.61 | ±4.39% |

Node Counting

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | ------------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | xml-tokenizer | 53.03 | 18.30 | 19.45 | 18.86 | ±0.81% | | xml-tokenizer (npm) | 166.61 | 5.62 | 7.16 | 6.00 | ±0.88% | | saxen | 500.99 | 1.83 | 4.79 | 2.00 | ±1.52% | | sax | 64.44 | 14.96 | 16.34 | 15.52 | ±0.67% |

Running the Benchmarks

The benchmarks can be found in the __tests__ directory and can be executed by running:

pnpm run bench

❓ FAQ

Why removed Rust implementation (WASM)?

We removed the Rust implementation to improve maintainability and because it didn't provide the expected performance boost.

Calling a TypeScript function from Rust on every token event (wasmMix benchmark) results in slow communication, negating Rust's performance benefits. Parsing XML entirely on the Rust site (wasm benchmark) avoids frequent communication but is still too slow due to the overhead of serializing and deserializing data between JavaScript and Rust (mainly the resulting XML-Object). While Rust parsing without returning results is faster than any JavaScript XML parser, needing results in the JavaScript layer makes this approach impractical.

The roxmltree package with the Rust implementation can be found in the _deprecated folder (packages/_deprecated/roxmltree_wasm).

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | ----------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | roxmltree:text | 67.12 | 14.33 | 83.29 | 80.08 | ±1.27% | | roxmltree:wasmMix | 28.17 | 34.83 | 36.71 | 35.49 | ±0.91% | | roxmltree:wasm | 109.30 | 8.30 | 13.16 | 9.15 | ±3.31% |

Why ported `tokenizer.rs` to TypeScript?

We ported tokenizer.rs to TypeScript because frequent communication between Rust and TypeScript negated Rust's performance benefits. The stream architecture required constant interaction between Rust and TypeScript via the tokenCallback, reducing overall efficiency.

Why removed Byte-Based implementation?

We removed the byte-based implementation to enhance maintainability and because it didn't provide the expected performance improvement.

Decoding Uint8Array snippets to JavaScript strings is frequently necessary, nearly on every token event. This decoding process is slow, making this approach less efficient than working directly with strings.

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | -------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | roxmltree:text | 67.12 | 14.33 | 83.29 | 80.08 | ±1.27% | | roxmltree:byte | 12.48 | 78.65 | 16.45 | 14.90 | ±1.15% |

The roxmltree package with the Byte-Based implementation can be found in the _deprecated folder (packages/_deprecated/roxmltree_byte-only).

Why not use a Generator?

While generators can improve developer experience, they introduce significant performance overhead. Our benchmarks show that using a generator dramatically increases the execution time compared to the callback approach. Given our focus on performance, we chose to maintain the callback implementation.

See Generator vs Iterator vs Callback for more details.

Benchmark with Generator

[xml-tokenizer] Total Time: 5345.0000 ms | Average Time per Run: 53.4500 ms | Median Time: 53.0000 ms | Runs: 100
[txml] Total Time: 395.0000 ms | Average Time per Run: 3.9500 ms | Median Time: 4.0000 ms | Runs: 100
[fast-xml-parser] Total Time: 1290.0000 ms | Average Time per Run: 12.9000 ms | Median Time: 13.0000 ms | Runs: 100

Benchmark with Callback

[xml-tokenizer] Total Time: 662.0000 ms | Average Time per Run: 6.6200 ms | Median Time: 6.0000 ms | Runs: 100
[txml] Total Time: 394.0000 ms | Average Time per Run: 3.9400 ms | Median Time: 4.0000 ms | Runs: 100
[fast-xml-parser] Total Time: 1308.0000 ms | Average Time per Run: 13.0800 ms | Median Time: 13.0000 ms | Runs: 100

Benchmark implementation in Vanilla Profiler