@mojojs/dom

v2.1.5

Published

a year ago

A fast and minimalistic HTML/XML DOM parser with CSS selectors

Downloads

3,615

0High
0Medium
0Low

mojojs dom html xml css scraping

A fast and very small HTML/XML DOM parser with CSS selectors for Node.js and browsers. Written in TypeScript.

import DOM from '@mojojs/dom';

// Parse
const dom = new DOM('<div><p id="a">Test</p><p id="b">123</p></div>');

// Find
console.log(dom.at('#b').text());
console.log(dom.find('p').map(el => el.text()).join('\n'));
console.log(dom.find('[id]').map(el => el.attr.id).join('\n'));

// Modify
dom.at('div p').append('<p id="c">456</p>');
dom.find(':not(p)').forEach(el => el.strip());

// Render
console.log(dom.toString());

HML and XML

There are currently two input formats supported, HTML and XML fragments. By default we use a very relaxed custom parser that will try to make sense of whatever tag soup you hand it. In HTML mode, all tags and attribute names are lowercased and selectors need to be lowercase as well.

// HTML
const dom = new DOM('<p>Hello World!</p>');

// XML
const dom = new DOM('<rss><link>http://example.com</link></rss>', {xml: true});

Nodes and Elements

When we parse an HTML/XML document or fragment, it gets turned into a tree of nodes.

<!DOCTYPE html>
<html>
  <head><title>Hello</title></head>
  <body>World!</body>
</html>

There are currently eight different kinds of nodes, #cdata, #comment, #doctype, #document, #element, #fragment,#pi, and #text.

#fragment
|- #doctype (html)
+- #element (html)
   |- #element (head)
   |  +- #element (title)
   |     +- #text (Hello)
   +- #element (body)
      +- #text (World!)

While nodes such as #document and #fragment can be represented by DOM objects, features like dom.attr and dom.tag will not work for them.

CSS Selectors

All CSS selectors that make sense for a standalone parser are supported.

| Pattern | --- | * | E | E:not(s1, s2, …) | E:is(s1, s2, …) | E.warning | E#myid | E[foo] | E[foo="bar"] | E[foo="bar" i] | E[foo="bar" s] | E[foo~="bar"] | E[foo^="bar"] | E[foo$="bar"] | E[foo*="bar"] | E:any-link | E:link | E:visited | E:checked | E:root | E:empty | E:nth-child(n [of S]?) | E:nth-last-child(n | E:first-child | E:last-child | E:only-child | E:nth-of-type(n) | E:nth-last-of-type(n) | E:first-of-type | E:last-of-type | E:only-of-type | E:text(string) | E:text(/pattern/i) | E F | E > F | E + F | E ~ F | Represents | | --- | | any element | | an element of type E | | an E element that does not match either compound selector s1 or compound selector s2 | | an E element that matches compound selector s1 and/or compound selector s2 | | an E element belonging to the class warning | | an E element with ID equal to myid | | an E element with a foo attribute | | an E element whose foo attribute value is exactly equal to bar | | an E element whose foo attribute value is exactly equal to any (ASCII-range) case-permutation of bar | | an E element whose foo attribute value is exactly and case-sensitively equal to bar | | an E element whose foo attribute value is a list of whitespace-separated values, one of which is exactly equal to bar | | an E element whose foo attribute value begins exactly with the string bar | | an E element whose foo attribute value ends exactly with the string bar | | an E element whose foo attribute value contains the substring bar | | an E element being the source anchor of a hyperlink | | an E element being the source anchor of a hyperlink of which the target is not yet visited | | an E element being the source anchor of a hyperlink of which the target is already visited | | a user interface element E that is checked/selected (for instance a radio-button or checkbox) | | an E element, root of the document | | an E element that has no children (neither elements nor text) except perhaps white space | | an E element, the n-th child of its parent matching S | [of S]?) | an E element, the n-th child of its parent matching S, counting from the last one | | an E element, first child of its parent | | an E element, last child of its parent | | an E element, only child of its parent | | an E element, the n-th sibling of its type | | an E element, the n-th sibling of its type, counting from the last one | | an E element, first sibling of its type | | an E element, last sibling of its type | | an E element, only sibling of its type | | an E element containing text content that substring matches the given string case-insensitively | | an E element containing text content that regex matches the given pattern | | an F element descendant of an E element | | an F element child of an E element | | an F element immediately preceded by an E element | | an F element preceded by an E element |

All supported CSS4 selectors are considered experimental and might change as the spec evolves.

API

Everything you need to extract information from HTML/XML documents and make changes to the DOM tree.

// Parse HTML
const dom = new DOM('<div class="greeting">Hello World!</div>');

// Render `DOM` object to HTML
const html = dom.toString();

// Create a new `DOM` object with one HTML tag
const div = DOM.newTag('div', {class: 'greeting'}, 'Hello World!');

Navigate the DOM tree with and without CSS selectors.

// Find one element matching the CSS selector and return it as `DOM` object
const div = dom.at('div > p');

// Find all elements marching the CSS selector and return them as `DOM` objects
const divs = dom.find('div > p');

// Get root element as `DOM` object (document or fragment node)
const root = dom.root();

// Get parent element as `DOM` object
const parent = dom.parent();

// Get all ancestor elements as `DOM` objects
const ancestors = dom.ancestors();
const ancestors = dom.ancestors('div > p');

// Get all child elements as `DOM` objects
const children = dom.children();
const children = dom.children('div > p');

// Get all sibling elements before this element as `DOM` objects
const preceding = dom.preceding();
const preceding = dom.preceding('div > p');

// Get all sibling elements after this element as `DOM` objects
const following = dom.following();
const following = dom.following('div > p');

// Get sibling element before this element as `DOM` objects
const previous = dom.previous();

// Get sibling element after this element as `DOM` objects
const next = dom.next();

Extract information and manipulate elements.

// Check if element matches the given CSS selector
const isDiv = dom.matches('div > p');

// Extract text content from element
const greeting = dom.text();
const greeting = dom.text({recursive: true});

// Get element tag
const tag = dom.tag;

// Set element tag
dom.tag = 'div';

// Get element attribute value
const class = dom.attr.class;

// Set element attribute value
dom.attr.class = 'whatever';

// Remove element attribute
delete dom.attr.class;

// Get element attribute names
const names = Object.keys(dom.attr);

// Get element's rendered content
const content = dom.content();

// Get form value
const formValue = dom.at('input').val();
const formValue = dom.at('option').val();
const formValue = dom.at('select').val();
const formValue = dom.at('textarea').val();
const formValue = dom.at('button').val();

// Find this element's namespace
const namespace = dom.namespace();

// Get a unique CSS selector for this element
const selector = dom.selector();

// Remove element and its children
dom.remove();

// Remove element but preserve its children
dom.strip();

// Replace element and its children
dom.replace('<p>Hello World!</p>');

// Replace this element's content
dom.replaceContent('<p>Hello World!</p>');

// Append HTML/XML fragment after this element
dom.append('<p>Hello World!</p>');

// Append HTML/XML fragment to this element's content
dom.appendContent('<p>Hello World!</p>');

// Prepend HTML/XML fragment before this element
dom.prepend('<p>Hello World!</p>');

// Prepend HTML/XML fragment to this element's content
dom.prependContent('<p>Hello World!</p>');

// Wrap HTML/XML fragment around this element
dom.wrap('<div></div>');

// Wrap HTML/XML fragment around the content of this element
dom.wrapContent('<div></div>');

There is also a node level API that you can for example use to extend the DOM class. It is however still in flux, and therefore not fully documented yet.

// Remove comment nodes that are children of this element
dom.currentNode.childNodes
  .filter(node => node.nodeType === '#comment')
  .forEach(node => node.detach());

// Extract text surrounding this element
const text = dom.currentNode.parentNode.childNodes
  .filter(node => node.nodeType === '#text')
  .map(node => node.value)
  .join('');

Custom Parsers

Additional input formats, such as fully spec compliant HTML5 documents can be supported with custom parsers. There is a parse5 based example included in this distribution that we use for testing.

import DOM, {DocumentNode, FragmentNode} from '@mojojs/dom';

// Minimal custom HTML/XML parser that only creates the document/fragment objects
class Parser {
  parse(text) {
    return new DocumentNode();
  }

  parseFragment(text) {
    return new FragmentNode();
  }
}

// Parse HTML with a custom parser
const dom = new DOM('<p>Hello World!</p>', {parser: new Parser()});

Installation

All you need is Node.js 16.0.0 (or newer).

$ npm install @mojojs/dom

Support

If you have any questions the documentation might not yet answer, don't hesitate to ask in the Forum, on Matrix, or IRC.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme