npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

tag-soup

v1.1.1

Published

The fastest pure JS SAX/DOM XML/HTML parser.

Downloads

1,102

Readme

TagSoup 🍜 build

TagSoup is the fastest pure JS SAX/DOM XML/HTML parser.

  • It is the fastest;
  • Tiny and tree-shakable, just 7 kB gzipped, including dependencies;
  • Streaming support with SAX and DOM parsers for XML and HTML;
  • Extremely low memory consumption;
  • Forgives malformed tag nesting and missing end tags;
  • Parses HTML attributes in the same way your browser does, see tests for more details;
  • Recognizes CDATA, processing instructions, and DOCTYPE;
npm install --save-prod tag-soup

Usage

⚠️ API documentation is available here.

SAX

import {createSaxParser} from 'tag-soup';

// Or use
// import {createXmlSaxParser, createHtmlSaxParser} from 'tag-soup';

const saxParser = createSaxParser({

  startTag(token) {
    console.log(token); // → {tokenType: 1, name: 'foo', …} 
  },

  endTag(token) {
    console.log(token); // → {tokenType: 101, data: 'okay', …} 
  },
});

saxParser.parse('<foo>okay');

SAX parser invokes callbacks during parsing.

Callbacks receive tokens which represent structures read from the input. Tokens are pooled objects so when handler callback finishes they are returned to the pool and reused. Object pooling drastically reduces memory consumption and allows passing a lot of data to the callback.

If you need to retain token after callback finishes use token.clone() which returns the deep copy of the token.

startTag and endTag callbacks are always invoked in the correct order even if tags in the input were incorrectly nested or missed. For self-closing tags only startTag callback in invoked.

Defaults

All SAX parser factories accept two arguments the handler with callbacks and options. The most generic parser factory createSaxParser doesn't have any defaults.

For createXmlSaxParser defaults are xmlParserOptions:

  • CDATA sections, processing instructions and self-closing tags are recognized;
  • XML entities are decoded in text and attribute values;
  • Tag and attribute names are preserved as is;

For createHtmlSaxParser defaults are htmlParserOptions:

  • CDATA sections and processing instructions are treated as comments;
  • Self-closing tags are treated as a start tags;
  • Tags like p, li, td and others follow implicit end rules, so <p>foo<p>bar is parsed as <p>foo</p><p>bar</p>;
  • Tag and attribute names are converted to lower case;
  • Legacy HTML entities are decoded in text and attribute values.

You can alter how the parser works through options which give you fine-grained control over parsing dialect.

By default, TagSoup uses speedy-entites to decode XML and HTML entities. Parser created by createHtmlSaxParser decodes only legacy HTML entities. This is done to reduce the bundle size.

To decode all HTML entities use this snippet below. It would add 10 kB gzipped to the bundle size.

import {decodeHtml} from 'speedy-entities/lib/full';

const htmlParser = createHtmlSaxParser({
  decodeText: decodeHtml,
  decodeAttribute: decodeHtml,
});

With speedy-entites you can create a custom decoder that would recognize custom entities.

aacute Aacute acirc Acirc acute aelig AElig agrave Agrave amp AMP aring Aring atilde Atilde auml Auml brvbar ccedil Ccedil cedil cent copy COPY curren deg divide eacute Eacute ecirc Ecirc egrave Egrave eth ETH euml Euml frac12 frac14 frac34 gt GT iacute Iacute icirc Icirc iexcl igrave Igrave iquest iuml Iuml laquo lt LT macr micro middot nbsp not ntilde Ntilde oacute Oacute ocirc Ocirc ograve Ograve ordf ordm oslash Oslash otilde Otilde ouml Ouml para plusmn pound quot QUOT raquo reg REG sect shy sup1 sup2 sup3 szlig thorn THORN times uacute Uacute ucirc Ucirc ugrave Ugrave uml uuml Uuml yacute Yacute yen yuml

Streaming

SAX parsers support streaming. You can use saxParser.write(chunk) to parse input data chunk by chunk.

const saxParser = createSaxParser({/*callbacks*/});

saxParser.write('<foo>ok');
// Triggers startTag callabck for "foo" tag.

saxParser.write('ay');
// Doesn't trigger any callbacks.

saxParser.write('</foo>');
// Triggers text callback for "okay" and endTag callback for "foo" tag.

DOM

import {createDomParser} from 'tag-soup';

// Or use
// import {createXmlDomParser, createHtmlDomParser} from 'tag-soup';

// Minimal DOM handler example
const domParser = createDomParser<any>({

  element(token) {
    return {tagName: token.name, children: []};
  },

  appendChild(parentNode, node) {
    parentNode.children.push(node);
  },
});

const domNode = domParser.parse('<foo>okay');

console.log(domNode[0].children[0].data); // → 'okay'

DOM parser assembles a node three using a handler that describes how nodes are created and appended.

The generic parser factory createDomParser requires a handler to be provided.

Both createXmlDomParser and createHtmlDomParser use domHandler if no other handler was provided and use default options (xmlParserOptions and htmlParserOptions respectively) which can be overridden.

Streaming

DOM parsers support streaming. You can use domParser.write(chunk) to parse input data chunk by chunk.

const domParser = createXmlDomParser();

domParser.write('<foo>ok');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]

domParser.write('ay');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]

domParser.write('</foo>');
// → [{nodeType: 1, tagName: 'foo', children: [{nodeType: 3, data: 'okay', …}], …}]

Performance

To run a performance test use npm ci && npm run build && npm run perf.

Large input

Performance was measured when parsing the 3.81 MB HTML file.

Results are in operations per second. The higher number is better.

SAX benchmark

| | Ops/sec | | --- | ---: | | createSaxParser ¹ | 36.3 ± 0.8% | | createXmlSaxParser ¹ | 30.7 ± 0.5% | | createHtmlSaxParser ¹ | 23.7 ± 0.5% | | createSaxParser | 29.2 ± 0.5% | | createXmlSaxParser | 26.1 ± 0.5% | | createHtmlSaxParser | 19.9 ± 0.5% | | @fb55/htmlparser2 | 14.3 ± 0.5% | | @isaacs/sax-js | 1.7 ± 4.6% |

¹ Parsers were provided a handler with a single text callback. This configuration can be useful if you want to strip tags from the input.

DOM benchmark

| | Ops/sec | | --- | ---: | | createDomParser | 13.7 ± 0.5% | | createXmlDomParser | 12.6 ± 0.5% | | createHtmlDomParser | 10.6 ± 0.5% | | @fb55/htmlparser2 | 8.4 ± 0.5% | | @inikulin/parse5 | 2.8 ± 0.7% |

Small input

The performance was measured when parsing 258 files with 95 kB in size on average from htmlparser-benchmark.

Results are in operations per second. The higher number is better.

SAX benchmark

| | Ops/sec | | --- | ---: | | createSaxParser | 1 998.0 ± 0.1% | | createXmlSaxParser | 1 734.1 ± 0.1% | | createHtmlSaxParser | 1 285.4 ± 0.1% | | @fb55/htmlparser2 | 717.5 ± 0.2% |

DOM benchmark

| | Ops/sec | | --- | ---: | | createDomParser | 1 087.1 ± 0.2% | | createXmlDomParser | 853.5 ± 0.2% | | createHtmlDomParser | 668.0 ± 0.2% | | @fb55/htmlparser2 | 457.7 ± 0.2% | | @inikulin/parse5 | 50.8 ± 0.4% |

Limitations

TagSoup doesn't resolve some weird element structures that malformed HTML may cause.

For example, assume the following markup:

<p><strong>okay
<p>nope

With DOMParser this markup would be transformed to:

<p><strong>okay</strong></p>
<p><strong>nope</strong></p>

TagSoup doesn't insert the second strong tag:

<p><strong>okay</strong></p>
<p>nope</p> <!-- Note the absent "strong" tag  -->