npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

xml-tokenizer

v0.0.18

Published

Straightforward and typesafe XML tokenizer that streams tokens through a callback mechanism

Downloads

125

Readme

Status: Experimental

xml-tokenizer is a straightforward and typesafe XML tokenizer that streams tokens through a callback mechanism. The implementation is based on the roxmltree tokenizer.rs. See the FAQ why we did not embed the roxmltree crate as WASM.

  • XML Token Stream: Processes XML documents as a stream, emitting tokens on the fly similar to the SAX approach
  • Wide Range of Tokens: Handles processing instructions, comments, entity declarations, element starts/ends, attributes, text, and CDATA sections
  • Validate XML: Validates XML while processing which makes it slower than txml but its still twice as fast as fast-xml-parser
  • Typesafe: Build with TypeScript for strong type safety

📚 Examples

🌟 Motivation

Create a typesafe, straightforward, and lightweight XML parser. Many existing parsers either lack TypeScript support, aren't actively maintained, or exceed 20kB gzipped.

My goal was to develop an efficient & flexible alternative by porting roxmltree to TypeScript or integrating it via WASM. While it functions well and is quite versatile due to its streaming approach, it's not as fast as I hoped.

⚖️ Alternatives

📖 Usage

import { select, tokenize, xmlToObject, xmlToSimplifiedObject } from 'xml-tokenizer';

// Parse XML to Javascript object without information lost (uses `tokenize` under the hood)
const xmlObject = xmlToObject('<p>Hello World</p>');

// Or, parse XML to easy to queryable Javascript object
const simplifiedXmlObject = xmlToSimplifiedObject('<p>Hello World</p>');

// Or, parse XML to a stream of tokens
tokenize('<p>Hello World</p>', (token) => {
	switch (token.type) {
		case 'ElementStart':
			console.log('Start of element:', token);
			break;
		case 'Text':
			console.log('Text content:', token.text);
			break;
		// Handle other token types as needed
		default:
			console.log('Token:', token);
	}
});

// Or, stream only a selection of tokens
select(
	xml,
	[
		[
			{ axis: 'child', local: 'bookstore' },
			{ axis: 'child', local: 'book', attributes: [{ local: 'category', value: 'COOKING' }] }
		]
	],
	(selectedToken) => {
		// Handle selected token
	}
);

Token Types

The following token types are supported:

  • ProcessingInstruction: <?target content?>
  • Comment: <!-- text -->
  • EntityDeclaration: <!ENTITY ns_extend "http://test.com">
  • ElementStart: <ns:elem
  • Attribute: ns:attr="value"
  • ElementEnd:
    • Open: >
    • Close: </ns:name>
    • Empty: />
  • Text: Text content between elements, including whitespace.
  • Cdata: <![CDATA[text]]>

👀 Differences from XML 1.0 Specification

  • Attribute Value Handling:
    • XML 1.0: Attributes must be explicitly assigned a value in the format Name="Value". An attribute without a value is not valid XML.
    • Parser Behavior: Attributes without an explicit value are interpreted as true (e.g., <element attribute/> is parsed as attribute="true").
    • Reason: This behavior aligns with HTML-style parsing, which was necessary to handle HTML attributes without explicit values.

🚀 Benchmark

The performance of xml-tokenizer was benchmarked against other popular XML parsers. These tests focus on XML to object conversion and node counting. Interestingly, the version of xml-tokenizer imported directly from npm performed significantly better. The reason for this discrepancy is unclear, but the results seem accurate based on external testing.

XML to Object Conversion

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | -------------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | xml-tokenizer | 46.87 | 19.47 | 24.57 | 21.33 | ±2.06% | | xml-tokenizer (dist) | 53.70 | 17.31 | 25.20 | 18.62 | ±3.28% | | xml-tokenizer (npm) | 163.00 | 5.03 | 8.50 | 6.13 | ±2.32% | | fast-xml-parser | 66.00 | 14.01 | 20.73 | 15.15 | ±3.34% | | txml | 234.52 | 3.38 | 7.61 | 4.26 | ±4.00% | | xml2js | 36.21 | 25.58 | 37.28 | 27.61 | ±4.39% |

Node Counting

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | ------------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | xml-tokenizer | 53.03 | 18.30 | 19.45 | 18.86 | ±0.81% | | xml-tokenizer (npm) | 166.61 | 5.62 | 7.16 | 6.00 | ±0.88% | | saxen | 500.99 | 1.83 | 4.79 | 2.00 | ±1.52% | | sax | 64.44 | 14.96 | 16.34 | 15.52 | ±0.67% |

Running the Benchmarks

The benchmarks can be found in the __tests__ directory and can be executed by running:

pnpm run bench

❓ FAQ

Why removed Rust implementation (WASM)?

We removed the Rust implementation to improve maintainability and because it didn't provide the expected performance boost.

Calling a TypeScript function from Rust on every token event (wasmMix benchmark) results in slow communication, negating Rust's performance benefits. Parsing XML entirely on the Rust site (wasm benchmark) avoids frequent communication but is still too slow due to the overhead of serializing and deserializing data between JavaScript and Rust (mainly the resulting XML-Object). While Rust parsing without returning results is faster than any JavaScript XML parser, needing results in the JavaScript layer makes this approach impractical.

The roxmltree package with the Rust implementation can be found in the _deprecated folder (packages/_deprecated/roxmltree_wasm).

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | ----------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | roxmltree:text | 67.12 | 14.33 | 83.29 | 80.08 | ±1.27% | | roxmltree:wasmMix | 28.17 | 34.83 | 36.71 | 35.49 | ±0.91% | | roxmltree:wasm | 109.30 | 8.30 | 13.16 | 9.15 | ±3.31% |

Why ported tokenizer.rs to TypeScript?

We ported tokenizer.rs to TypeScript because frequent communication between Rust and TypeScript negated Rust's performance benefits. The stream architecture required constant interaction between Rust and TypeScript via the tokenCallback, reducing overall efficiency.

Why removed Byte-Based implementation?

We removed the byte-based implementation to enhance maintainability and because it didn't provide the expected performance improvement.

Decoding Uint8Array snippets to JavaScript strings is frequently necessary, nearly on every token event. This decoding process is slow, making this approach less efficient than working directly with strings.

| Parser | Operations per Second (ops/sec) | Min Time (ms) | Max Time (ms) | Mean Time (ms) | Relative Margin of Error (rme) | | -------------- | ------------------------------- | ------------- | ------------- | -------------- | ------------------------------ | | roxmltree:text | 67.12 | 14.33 | 83.29 | 80.08 | ±1.27% | | roxmltree:byte | 12.48 | 78.65 | 16.45 | 14.90 | ±1.15% |

The roxmltree package with the Byte-Based implementation can be found in the _deprecated folder (packages/_deprecated/roxmltree_byte-only).

Why not use a Generator?

While generators can improve developer experience, they introduce significant performance overhead. Our benchmarks show that using a generator dramatically increases the execution time compared to the callback approach. Given our focus on performance, we chose to maintain the callback implementation.

See Generator vs Iterator vs Callback for more details.

Benchmark with Generator

[xml-tokenizer] Total Time: 5345.0000 ms | Average Time per Run: 53.4500 ms | Median Time: 53.0000 ms | Runs: 100
[txml] Total Time: 395.0000 ms | Average Time per Run: 3.9500 ms | Median Time: 4.0000 ms | Runs: 100
[fast-xml-parser] Total Time: 1290.0000 ms | Average Time per Run: 12.9000 ms | Median Time: 13.0000 ms | Runs: 100

Benchmark with Callback

[xml-tokenizer] Total Time: 662.0000 ms | Average Time per Run: 6.6200 ms | Median Time: 6.0000 ms | Runs: 100
[txml] Total Time: 394.0000 ms | Average Time per Run: 3.9400 ms | Median Time: 4.0000 ms | Runs: 100
[fast-xml-parser] Total Time: 1308.0000 ms | Average Time per Run: 13.0800 ms | Median Time: 13.0000 ms | Runs: 100

Benchmark implementation in Vanilla Profiler

💡 Resources