page-parser-tree
v0.4.0
Published
Library to find elements in a dynamic web page
Downloads
88,723
Maintainers
Readme
page-parser-tree
This module provides a declarative and robust way to recognize elements on a dynamic webpage. This is useful for building browser extensions that offer rich integration into complex and ever-changing web applications.
When a PageParserTree is instantiated, you provide it with a document or HTML element reference to use as the root and an options object describing how to identify specific types of elements on the page to be tagged with a given identifier. PageParserTree will then produce a TagTree instance of all of the tagged elements found in the page. The TagTree instance will be kept up-to-date with the page's contents through use of MutationObservers.
To identify elements to tag, the primary method is to specify a "Watcher", which uses a CSS selector-like syntax. A watcher specifies a tag name, a list of sources including either the root element or previously-tagged elements to initialize the matched set to, and an array of PageParserTree selectors used to transform the matched set to the set of elements to tag. The PageParserTree selectors may take advantage of MutationObservers so that the page is watched for changes and new elements can be found on the page before the browser has rendered them to the screen. This means a browser extension can react to an element and enhance it before it has appeared on the screen, preventing any visible after-load pop-in effect.
Additionally, a "Finder" is an alternate and more adaptable method of identifying elements to tag. It may be specified in addition to a "Watcher" to provide redundancy, or by itself if the responsiveness of a "Watcher" isn't necessary. To specify a "Finder", you write a function which takes the root element and returns an array of all elements to tag, and this function will be called on an interval to look for elements on the page to tag. MutationObservers are not used here; there will be a likely user-noticeable amount of time between the element appearing on the page and the PageParserTree (and therefore your application) reacting to the presence of the element.
Finders are best used in addition to Watchers as a fallback-method to pick up
any elements missed by the Watchers. Watchers tend to be closely tied to the
known structure of the page, and may be brittle if the web application is
updated or variations of the structure are missed by a browser extension
developer. Finders are easier to make more robust to variations in the
structure of the page (you can use the querySelectorAll
method to find
elements anywhere on the page matching some rule), but they don't have the
immediate responsiveness of Watchers. Use of them together creates a graceful
degradation route for when a web application's page structure exhibits
unforeseen variations.
A PageParserTree instance has a tree
property which is an instance of a
TagTree, which has methods such as
getAllByTag(tag)
that returns a LiveSet
of TagTreeNodes. A TagTreeNode has a getValue()
method to get the element it
contains, and getParent()
and getOwnedByTag(tag)
method to retrieve related
TagTreeNodes as described in TagTree's documentation.
Example
import PageParserTree from 'page-parser-tree';
const page = new PageParserTree(document, {
tags: {
message: {
ownedBy: ['thread']
},
},
watchers: [
{sources: [null], tag: 'thread', selectors: [
'body',
'div.page',
'div.mainPanel',
'div.thread'
]},
{sources: ['thread'], tag: 'message', selectors: [
'div.threadFooter',
'div.replyArea',
// Ignore and don't recurse into the div.replyArea element until the web
// page changes its style attribute so that it's not hidden.
{$watch: {
attributeFilter: ['style'],
cond: element => element.style.display !== 'none'
}},
'div.message'
]},
],
finders: {
thread: {
fn: root => root.querySelectorAll('div.thread')
},
message: {
fn: root => root.querySelectorAll('div.message')
},
}
});
// allMessages is a LiveSet of TagTreeNodes pointing to the message elements
const allMessages = page.tree.getAllByTag('message');
// We can inspect its current values.
allMessages.values().forEach(node => {
const messageElement = node.getValue();
console.log('found message element', messageElement);
// The "message" tag was listed as being owned by the "thread" tag, so if
// this message element is inside an element tagged as "thread", then we can
// access that thread element.
const ownerNode = node.getParent();
// The watcher we gave to find message elements would only find ones contained
// by threads, but the finder could find a message element not contained by a
// thread. If for example the web application was updated to have thread
// elements contain a class name other than "thread", then the watchers would
// fail to find threads or messages, and the finders would only find messages
// that are owned by the tree root instead of by threads.
if (ownerNode.getTag() === 'thread') {
const threadElement = ownerNode.getValue();
console.log('message owned by thread', threadElement);
// From a node, you can also retrieve its nodes. messagesOfThread is also
// a LiveSet of TagTreeNodes like allMessages.
const messagesOfThread = ownerNode.getOwnedByTag('message');
console.log('message is one of', messagesOfThread.values().size, 'messages in thread');
}
});
// We can also subscribe to changes in a LiveSet's values.
allMessages.subscribe(changes => {
changes.forEach(change => {
if (change.type === 'add') {
console.log('added message element', change.value.getValue());
} else if (change.type === 'remove') {
console.log('removed message element', change.value.getValue());
}
});
});
// If we just want to call some callbacks for every present and future message
// and when they're removed, then we can use a handy helper from the LiveSet
// library:
import toValueObservable from 'live-set/toValueObservable';
toValueObservable(allMessages).subscribe(({value, removal}) => {
const messageElement = value.getValue();
console.log('found message element', messageElement);
removal.then(() => {
console.log('message element removed from page', messageElement);
});
});
API
Functions
PageParserTree::constructor
PageParserTree::constructor(root: Document|HTMLElement, options: PageParserTreeOptions)
This creates a new PageParserTree instance which will immediately start populating a TagTree instance based on the options given. See the PageParserTreeOptions for a full description of the options parameter.
PageParserTree::tree
PageParserTree::tree: TagTree<HTMLElement>
This property contains the TagTree instance that the tagged elements can be accessed from. See the documentation of TagTree for information about the API of TagTree instances.
PageParserTree::dump
PageParserTree::dump(): void
This causes the PageParserTree instance to halt all of its Watchers and
Finders, to empty out the tree
TagTree as if all of the tagged elements were
removed from the page, and to end all of the TagTree's LiveSets so that they no
longer keep references to their subscribers. This function is useful if you are
performing a clean shutdown of the browser extension while letting the web page
continue to operate.
PageParserTree::replaceOptions
PageParserTree::replaceOptions(options: PageParserTreeOptions): void
This replaces the options object that the PageParserTree was instantiated with. This is mainly intended for use in development with hot module replacement to allow live-editing of the options within a running page.
Currently this method has some limitations:
- The
tree
TagTree will be emptied out as if all elements were removed from the page, and then the Watchers and Finders specified in the new options are started from scratch. - An error will be thrown if the set of tags or any of their ownedBy lists change.
PageParserTreeOptions
The PageParserTreeOptions specifies the Watchers and Finders used to populate the TagTree and other options.
PageParserTreeOptions::logError
PageParserTreeOptions::logError(err: Error, el: ?HTMLElement): void
This is an optional property specifying a function to be called if PageParserTree encounters an error. It will be passed an Error object and optionally an HTMLElement if one is relevant.
The main reason PageParserTree will call logError is if there are Watchers and a Finder for a tag and they are inconsistent with each other. The error message will include the name of the tag, and the element which was missed by one of them will be passed to logError.
PageParserTreeOptions::tags
PageParserTreeOptions::tags: {[tag:string]: TagOptions}
The tags property is required and must be an object. Each property must be a tag name with a value containing a TagOptions object. Not all tags need to have an entry here; it's legal to pass an empty object as the tags property.
TagOptions is an object that has an optional ownedBy
property which may be an
array of strings referring to other tag names. Each node in a TagTree is owned
by another node, defaulting to the root node. If you specify any tag names
in the ownedBy
array, then any node of this tag will be owned by the node of
the closest ancestor with a tag in the ownedBy
array if any are present.
A tag may own itself; this is useful to represent hierarchical user-interfaces such as comment trees on reddit where a comment element may be the owner of its direct replies.
It is an error to pass options for a tag name that has no Watchers or Finders.
PageParserTreeOptions::finders
PageParserTreeOptions::finders: {[tag:string]: Finder}
The finders property is required and must be an object. Each property names a tag, and the value is a Finder object.
A Finder object has an fn
property which must be a function. The fn
function must take an HTMLElement representing the root element of the
PageParserTree, and it must return an Array or Array-like object of the
HTMLElements to tag.
A finder object may have an interval
property controlling how often in
milliseconds the Finder function is to be called. The interval
property
defaults to 5000. The Finder function may be called less often than this
depending on page and user activity.
Alternative, interval
may be a function that returns a number. The function
will be passed the number of elements that have currently been found on the
page, and the amount of time that has passed since the finder started running.
If there are a limited number of elements expected to be found, then this
allows the finder to throttle back after they're found. If the value Infinity
is returned, then the finder will not be run again.
PageParserTreeOptions::watchers
PageParserTreeOptions::watchers: Array<Watcher>
This property must be an array of Watcher objects. A Watcher object contains the following properties:
{
sources: Array<string|null>;
tag: string;
selectors: Array<Selector>;
}
A watcher functions by starting with a matched set of elements, and transforming that matched set of elements into a new matched set of elements iteratively by using the array of Selector values.
The sources array defines the initial matched set of elements. The value null
represents the root element given to the PageParserTree constructor (usually
the document
). Strings may be given naming tags. Multiple sources may be
given. (Alternatively, multiple Watchers may be given for the same tag, if for
example the tagged element is to be found in very different parts of the page.)
The valid values for the Selector type are described in the Selectors section.
Selectors
Children
string
This will change the matched set to contain only the direct children of each element of the current matched set, and then filters those elements based on a CSS selector string.
Note that if an element does not initially match the given CSS selector string but is later modified to match it (e.g. the web application changed one of its attributes some time after adding it to the page), then the Children selector will not re-run the CSS selector check on the element. The Children selector is only triggered by changes to an element's child list; the Watcher selector must be used if you want to trigger by any other changes to an element.
Filter
{ $filter: (el: HTMLElement) => boolean }
This allows you to specify a function which will be called on every matched element. If the function returns false, then the element will be removed from the matched set.
Map
{ $map: (el: HTMLElement) => ?HTMLElement }
This allows you to specify a function which will be called on every matched
element, and each element in the matched set will be replaced with the element
returned by your function. If your function returns null, then the element will
just be removed from the matched set.
Watch
{ $watch: { attributeFilter: string[], cond: string | (el: HTMLElement) => boolean } }
This selector allows you to specify an array of attribute names to react to changes to, and a CSS selector string or a function to evaluate the element against. Every element will have the condition evaluated when it first becomes part of the matched set and whenever any of the listed attributes are modified.
Or
{ $or: Array<Array<Selector>> }
For each array of selectors, this takes the current matched set and creates a new matched set by applying the list of selectors to it. All of the resulting matched sets are combined to create the output matched set. This selector can be thought of as forking the selector list at a given point, using several alternatives selector lists to continue it, and then recombining the results.
For an example, imagine a site where "message" elements all match the following CSS selector string:
body > div.main > div.border > div.message,
body > div.footer > div.message {}
Imagine for a moment that CSS supported a feature so that this was an equivalent selector string:
body > :or(div.main > div.border, div.footer) > div.message {}
PageParserTree's Or selector implements an operation like that. Here's an example a PageParserTree Watcher supporting the above page structure:
{
sources: [null], tag: 'message', selectors: [
'body',
{$or: [
[
'div.main',
'div.border'
], [
'div.footer'
]
]},
'div.message'
]
}
Log
{ $log: string }
This selector uses console.log
to log every time an element becomes part of
the matched set at the given position in the chain. The given string will be
part of the logged message. This is intended for use in development while
debugging.
Usage
PageParserTree may be installed with npm. We recommend you save the dependency
in your package.json and pin the major version by using the command npm install --save page-parser-tree
.
PageParserTree may be used in browsers via a CommonJS bundler such as Browserify or Webpack.
Some of the examples on this page use ES2015 features. ES2015 features aren't
required to use PageParserTree, though if you're writing a browser extension
targeting a modern browser, then you can probably use let
/const
declarations and arrow functions without issue. Other features in the examples
including import
statements may require Babel to be used. We've had good
experiences with Babel and highly recommend it, but if you aren't using it then
know that you can usually swap import X from 'foo';
with
const X = require('foo');
.
Bundling Note
To use this module in browsers, a CommonJS bundler such as Browserify or Webpack should be used.
This project may add additional checks in some places if process.env.NODE_ENV
is not set to "production". If you're using Browserify, then setting the
NODE_ENV environment variable to "production" during build is enough to disable
these checks. Instructions for other bundlers can be found in React's
documentation,
which uses the same convention.
Types
Both TypeScript and Flow type definitions for this module are included! The type definitions won't require any configuration to use.
Resources
Mixmax has written a blog post with useful notes about their transition to using page-parser-tree and how it solved some performance issues in their browser extension in Gmail.
About
PageParserTree was written by us at Streak, where we produce the Streak CRM browser extension and the InboxSDK, a library for integrating with Gmail and Inbox by Google, which you should also check out if you're reading this page because you're considering writing a browser extension to integrate with them!