als-document

v1.4.0

Published

4 months ago

A powerful HTML parser & DOM manipulation library for both backend and frontend.

Downloads

108

0High
0Medium
0Low

alexsorkin

HTML DOM parser manipulation backend frontend als-document

als-document: HTML Parser & DOM Manipulation Library

Overview

als-document is a powerful library for parsing HTML and XML, building and manipulating virtual DOM structure on backend and frontend. It provides a robust and intuitive API for querying and interacting with DOM elements using selectors, making it a valuable tool for web developers.

Installation

To install the als-document library, use the following npm command:

npm i als-document

Including the Library

The library provides three different files to cater to different module systems:

index.js: This file uses the CommonJS module system. It's suitable for projects using Node.js or bundlers like Browserify or Webpack. The entry point in package.json for this file is "main".

const { parseHTML, Node, Query, TextNode, SingleNode, Root, Document } = require('als-document');

index.mjs: This file uses the ES Modules (ESM) system. It's suitable for modern JavaScript environments that support ESM. The entry point in package.json for this file is "module".

import { parseHTML, Node, Query, TextNode, SingleNode, Root, Document } from 'als-document';

document.js: By including this file, a constant variable named alsDocument is created, which wraps all the exports.

<script src="/node_modules/als-document/document.js"></script>
<script>
   const { parseHTML, Node, Query, TextNode, SingleNode, buildFromCache, cacheDoc, Root, Document } = alsDocument
</script>

Change log for 1.3

added getter and setter for node.innerText
prev and next now works with childIndex=0
querySelctor not includes the parent any more
Document new getters and setters include clone
tagName - uppers, _tagName - lowers

parseHTML

parseHTML is a function that takes an HTML string and constructs a DOM tree representation from it. It recognizes various HTML elements, such as comments, scripts, styles, and CDATA, and organizes them into nodes that can be manipulated and queried.

API:

parseHTML(html: string) -> Node

Parses an HTML string and returns a tree structure representing its content.

html: The HTML string to parse.
Returns: A Node object representing the root of the parsed HTML content tree.

Expected Outcome:

When using the parseHTML function, the output will be a tree of nodes representing the HTML content. Each node can be one of the following:

Node: A standard HTML element node with tag name, attributes, and child nodes.
SingleNode: Represents self-closing or void HTML elements.
TextNode: Represents text content in the HTML.

Each node will have a tag name, a dictionary of attributes, and a list of child nodes (if applicable).

Examples

const parsedHTML = parseHTML('<div class="container"><img src="image.jpg" alt="Image"/><p>Hello, world!</p></div>');

// The returned `parsedHTML` object will be a tree-like structure. 
// For instance, parsedHTML.childNodes[0] would represent the <div> element, 
// and parsedHTML.childNodes[0].childNodes[0] would represent the <img> element inside it.

const parsedScript = parseHTML('<script>console.log("Hello, world!");</script>');

// The returned `parsedScript` object will contain a `script` Node with a child node 
// holding the JavaScript code as text content.

Remember, the actual tree structure will be more complex and detailed, but the provided examples give you a basic understanding of how to navigate through the parsed result.

Node

Node is a fundamental class that represents an element node in the DOM tree. It provides functionality similar to the native DOM API in browsers, but with its own implementation.

Properties:

tagName: Represents the tag name of the element (upper cased).
_tagName: Represents the tag name of the element (lower cased).
innerText
attributes: A dictionary of attributes and their values.
childNodes: An array of child nodes for the element.
isSingle: Boolean value to check if the node is a self-closing tag.
parentNode, previousElementSibling, nextElementSibling, children: Navigation properties to move through the DOM tree.
dataset, classList, style: Special properties for interacting with data-* attributes, classes, and inline styles.

Methods:

getAttribute, setAttribute, removeAttribute: Manipulate element's attributes.
remove: Removes the element from its parent.
innerHTML, outerHTML: Get and set the inner or entire HTML of the element.
querySelector, querySelectorAll: Find elements within the node based on CSS-like selectors.
- limits: pseudo selector like :first-of-type or :checked not available
- namaspace for tags some:namspace available
- there are additional methods $ for querySelector and $$ for querySelectorAll
getElementsByClassName, getElementsByTagName, getElementById: Get elements by class, tag, or id respectively.
insertAdjacentElement, insertAdjacentHTML, insertAdjacentText: Insert content relative to the element.
appendChild: Add a child node to the element.
insert(place,element): place (0-3) or beforebegin,afterbegin,... eleemnt - raw html or element

Examples:

const div = new Node('div');
div.setAttribute('class', 'container');

const img = new SingleNode('img', { src: 'image.jpg', alt: 'An image' });
div.appendChild(img);

console.log(div.outerHTML);  // Outputs: <div class="container"><img src="image.jpg" alt="An image"></div>

const p = new Node('p',{},div); // adding as last child to parent div
p.textContent = "Hello, world!";

const foundP = div.querySelector('p');
console.log(foundP.textContent);  // Outputs: Hello, world!

SingleNode

SingleNode extends from the Node class and represents elements that don't have closing tags (self-closing tags) in HTML. Examples include <img>, <br>, and <!DOCTYPE>. This class has restricted methods and properties since these elements can't have child nodes.

TextNode

TextNode is a class that represents text content within the DOM. A TextNode holds raw text data and does not have child nodes.

Document node (extends Node)

Has additional getters and setters:

get documentElement
get html
get head
get body
get title
get charset
set title
get clone - return cloned new instance of Document

Query

The Query class is designed to parse CSS selector strings and transform them into a structured object format, providing detailed insights into each selector and its components.

By using the class, one can expect to transform a CSS selector string into an array of objects.

Each object will represent a selector, containing detailed information such as its tag, identifier, classes, attributes, and associated selectors if any. This can be useful for further processing or analysis of CSS selectors in an application.

Example

let q1 = 'html>body>div.tabs~.some[type $= "radio and some"]>p+div>.some-id .tab-content~input[disabled] div.some'
let result = new Query(q1).selectors
let result1 = Query.get(q1)
// result and result1 has to be same
console.log(result)

Result:

[
   {
      "query": "div.some",
      "tag": "div",
      "classList": [
         "some"
      ],
      "ancestors": [
         {
            "query": ".some-id",
            "classList": [
               "some-id"
            ],
            "parents": [
               {
                  "query": "div",
                  "tag": "div"
               }
            ],
            "prev": {
               "query": "p",
               "tag": "p",
               "parents": [
                  {
                     "query": ".some[0]",
                     "classList": [
                        "some"
                     ],
                     "attribs": [
                        {
                           check:(f),
                           "query": "[type$=\"radio and some\"]",
                           "name": "type",
                           "value": "radio and some",
                           "sign": "$="
                        }
                     ]
                  }
               ],
               "prevAny": {
                  "query": "div.tabs",
                  "tag": "div",
                  "classList": [
                     "tabs"
                  ],
                  "parents": [
                     {
                        "query": "html",
                        "tag": "html"
                     },
                     {
                        "query": "body",
                        "tag": "body"
                     }
                  ]
               },
               "group": "html>body>div.tabs~.some[0]>p"
            },
            "group": "html>body>div.tabs~.some[0]>p+div>.some-id"
         },
         {
            "query": "input[1]",
            "tag": "input",
            "attribs": [
               {
                  "query": "[disabled]",
                  "name": "disabled"
               }
            ],
            "prevAny": {
               "query": ".tab-content",
               "classList": [
                  "tab-content"
               ]
            },
            "group": ".tab-content~input[1]"
         }
      ],
      "group": "html>body>div.tabs~.some[type $= \"radio and some\"]>p+div>.some-id .tab-content~input[disabled] div.some"
   }
]

Attribs and check function

if attribute has value, attrib object will contain check function with one parameter for value to check.

let s = Query.get('[test^="some"]')[0]
console.log(s.attribs[0].check('some value test')) // true

buildFromCache and cacheDoc

Building DOM from raw html, usually takes tens of milliseconds. But now, you can build DOM once and save it's cache as regular stringified JSON. The caching process and building from cache takes less then 5ms for each and require realy low resources.

How it works?

const html = `` // some real html 255KB
const root = parseHTML(html); // 31.9ms
const cache = cacheDoc(root); // 2.4ms
const root1 = buildFromCache(cache); // 1.2ms
console.log(root.inneHTML === root1.innerHTML) // true