@jrc03c/html-diff
v0.0.3
Published
This tool helps to find differences between HTML files on a per-element basis in addition to finding differences on a per-line or per-character basis. This makes it easier to discover if two elements are basically identical except that one lacks a class n
Downloads
3
Readme
Intro
This tool helps to find differences between HTML files on a per-element basis in addition to finding differences on a per-line or per-character basis. This makes it easier to discover if two elements are basically identical except that one lacks a class name that the other has, or that one has slightly different textContent
than the other, or that they have the same children but in different orders, or that one element has a particular child as an immediate descendant whereas another element has the same child as a deeply-nested descendant, etc.
Installation
For use in Node, bundlers, and the browser:
npm install @jrc03c/html-diff
For use at the command line:
npm install -g @jrc03c/html-diff
Usage
CLI
html-diff file1.html file2.html
Optionally, you can pass a "simple" flag (--simple
or -s
), which will cause the output to be printed in a YAML-ish format, which is sometimes a little easier to read than JS objects. For example:
html-diff -s file1.html file2.html
JS
In Node or bundlers:
const { getDifferences } = require("@jrc03c/html-diff")
Or in the browser:
<!--
This defines all of the relevant functions, variables, and objects in the
global scope.
-->
<script src="path/to/dist/html-diff.js"></script>
Then:
console.log(getDifferences(element1, element2))
NOTE: Some of the functions in this library expect HTMLElement
inputs. If you're using this library in Node, I recommend that you use jsdom
to construct virtual DOMs, and then pass elements from those DOMs into this library's functions. For example:
const { JSDOM } = require("jsdom")
const dom1 = new JSDOM("<div>Hello, world!</div>")
const dom2 = new JSDOM("<div>Goodbye, world!</div>")
console.log(
getDifferences(dom1.window.document.body, dom2.window.document.body)
)
API
DEFAULT_OPTIONS
DEFAULT_OPTIONS
is an object that holds all of the constants used in the library's calculations. It has these properties and default values:
attributeWeight
= represents how much element attribute differences should be weighted relative to other differences; has a default value of 1childDifferenceWeight
= represents how much the total differences between child elements (excluding the order of the children) should be weighted relative to other differences; has a default value of 1childOrderWeight
= represents how much the child order differences should be weighted relative to other differences; has a default value of 1classWeight
= represents how much element class differences should be weighted relative to other differences; has a default value of 1differencePenalty
= represents the power to which all differences should be raised, which is useful for exaggerating differences; has a default value of 1idWeight
= represents how much element ID differences should be weighted relative to other differences; has a default value of 1shouldScoreChildren
= represents whether or not child scores should contribute to the overall score; has a default value oftrue
, but can be set tofalse
to compare the given elements as though their children don't existtagNameWeight
= represents how much element tag name differences should be weighted relative to other differences; has a default value of 1textContentWeight
= represents how much element text content differences should be weighted relative to other differences; has a default value of 1
To adjust any of the above properties, reassign their values, and then pass the entire object (or a copy of it, or whatever) into the relevant functions below that take an options
parameter. Note that the options
parameter is optional everwhere it appears below.
getAttributes(el)
Returns a list of objects, each of which represents a single attribute on the element and which has properties of "name" and "value". Does not include "class" or "id" attributes because those are evaluated separately.
getDifferences(el1, el2, [options])
Returns a list of objects, each of which describes a difference between the two given elements. Each difference object has these properties:
el1
= the path from the document root to the first element in the relevant pair of conflicting elementsel2
= the path from the document root to the second element in the relevant pair of conflicting elementstype
= the type of difference between the relevant pair of conflicting elements; can be one of:ATTRIBUTE_DIFFERENCE
CHILD_CONTENT_DIFFERENCE
CHILD_ORDER_DIFFERENCE
CLASS_DIFFERENCE
ID_DIFFERENCE
ORDER_DIFFERENCE
TAG_NAME_DIFFERENCE
TEXT_CONTENT_DIFFERENCE
el1Value
= the value in the first element where the difference occurredel2Value
= the value in the second element where the difference occurredattribute
= the name of the attribute where the difference occurred; this property is present only when the differencetype
isATTRIBUTE_DIFFERENCE
getDiffScore(e1, e2, [options])
Returns a score and list of differences (from getDifferences
) between the two given elements. The lowest possible score is 1, in which case the elements are identical.
getMostSimilarElement(el, others, options)
Given a list of elements called others
, returns the element that's most similar to el
(when compared using getDiffScore
).
getNonChildTextContent(el)
Returns the text content of the given element that does not include any text content from child elements.
To do
- Write unit tests for the main API functions.
- Implement some dynamic programming features (like a dictionary that holds the differences between two elements so that they don't have to be recalculated multiple times). I'm not actually sure how big of a problem this is, but I do know that the functions recurse quite a bit, so it may make a difference in terms of performance.