npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

htmlscanner

v0.7.0

Published

A fast C++ HTML scanner that can also parse badly formed HTML

Downloads

16

Readme

Introduction

HTMLScanner is a fast HTML/XML scanner/tokenizer for node.js. The scanner tries to be forgiven and is ideal those messy HTML documents. It should parse most HTML files and ofcourse also valid XML files.

Please note there is no explicit support for namespaces. If you need a full blown XML parser, there are already many good alternatives available for Node.js.

The core of the scanner module is a fast C++ module and is for 80% based on the excelent XHScanner created by Andrew Fedoniouk, see also [http://www.codeproject.com/KB/recipes/HTML_XML_Scanner.aspx]. Without this module HTMLScanner would not be here today.

Installation

Just run the npm install command:

$ npm install htmlscanner

Or if you like to do it yourself:

$ git clone [email protected]:jbaron/htmlscanner.git
$ cd htmlscanner
$ node-waf configure build install

You should now have a file called htmlscanner.node in the lib directory. We use node-waf to build this module. Please note that older versions of node-waf use a different build directory. In that case you should find the file somewhere under the build/default directory. There are also some simple test cases included with this module. Just type for example:

$ node test/test_simple.js

Usage

The usage is straight forward:

var Scanner = require("../lib/htmlscanner").Scanner;
var scanner = new Scanner("<div id=12 class=important>hello</div>");
do {
	token = scanner.next();
	console.dir(token);
} while (token[0]);

The token you get back from the scanner.next() call contains all the info. The above sample would produce the following output.

[1,"div","id","12","class","important"] // Type 1 indicates OPEN TAG. Attribute key/value pairs are also included.
[4,"hello"]				// Type 3 indicates TEXT
[2,"div"] 				// Type 2 indicates CLOSE TAG
[0]					// Type 0 indicates END OF FILE

The first element in the array is the type, the other elements in the array depend on the first one.

TODO

There are several things still to do:

  • Entity decoding of text. Although much of the code is already there, it is not yet Unicode ready.
  • Add routines for entity encoding.
  • Add support for Buffers. Right now only Strings are supported.
  • Add some additional robustness checks.
  • Compile on other platforms besides Linux. The code should be portable, but has never been tested on any other platform besides Linux. So if you have success compiling and using this on OSX or Windows please let us know.

Background

There is not much that cannot be done in plain JavaScript. The Chrome team did a great job making the V8 engine a very fast JavaScript solution. However one area that could become a bottleneck is when you start having to iterate over String, character at the time. For example when peforming encodings or parsing of XML Strings. And to be honest this is not only a problem that is specific to JavaScript. For example when you profile a highly optimized Java program that does a lot of XML parsing and serializing, you see these same type of methods at the top of the CPU usage. So for these types of operations this library contains a set of optimized C/C++ modules to speed up these tasks within V8.