untemplate
v0.1.9
Published
A node.js package that uses templates to extract structured info from HTML.
Downloads
4
Readme
untemplate.js
This node module provides a way to scrape structured information from websites based on HTML templates.
tl;dr
Templating engines like handlebars/jade:
{ structured: 'info' } + {{ template }} = <html>
This module:
<html> - {{ template }} = { structured: 'info' }
Alternatively, you can think of untemplate.js
as a declarative DSL for web scraping.
Usage
import { untemplate } from 'untemplate';
// obtain a DOM element from somewhere
let element = get-dom-from-html(`
<div>
<div>
<span> alberta </span>
</div>
<div>
<span> bc </span>
<span> canada </span>
</div>
</div>
`;
let template = `
<div>
<span> {{ region }} </span>
<span ?> {{ country }} </span>
</div>
`;
let data = untemplate(template, element);
// data: [{ region: 'alberta' }, { region: 'bc', country: 'canada' }]
How do you make templates you ask? See the API section below for details on the deduceTemplate
function.
Features
Refer to the specs file at spec/untemplateSpec.js
for examples of each of these features.
- given a page like
<div> hello </div>
and a template like<div> {{ greeting }} </div>
, produces the associative array{ greeting: 'hello' }
- the templates can contain arbitrary HTML elements
- they can occur zero, one, or many times in the target page
- they can appear anywhere on the page, all occurrences at different levels of nesting
- robust to unexpected additions / removals of textnodes in the DOM
- all
{{ property }}
captures are optional to simplify template creation - can mark DOM nodes in the template as "optional", which means they don't need to occur for the template to match
- can have nested optional DOM nodes
- can deduce the optimal template given a list of dom nodes
API
#untemplate(dsl, element[, cb, rate])
- arguments
dsl
: the template as a string; valid templates are valid HTML, sans attributes, with one notable exception:optional="true"
. This makes the node it's attached to optional in the template. Some sugar:<div optional="true"></div>
<=><div?></div>
.element
: the root DOM element to search for the template in. This must be a proper DOM element, either output from some library likexmldom
or from the browsercb
: (optional) a progress callback that is periodically called with the approximate completion percentage (first argument) and a stop function (second argument) that causesuntemplate
to terminate early if calledrate
: (optional) the approximate percent of progress that occurs between each call ofcb
- returns
- a list of associative arrays containing the structured information contained in
element
- each matching of the template corresponds to one associative array in the returned list
- the returned list is in left-to-right BFS order
- a list of associative arrays containing the structured information contained in
- throws
EarlyStopException
: thrown if thecb
function calls itsstop
argument
#precomputeNeedles(dsl[, cb, rate])
- arguments
dsl
: the template as a string; valid templates are valid HTML, sans attributes, with one notable exception:optional="true"
. This makes the node it's attached to optional in the template. Some sugar:<div optional="true"></div>
<=><div?></div>
.cb
: (optional) a progress callback that is periodically called with the precomputation completion percentage (first argument) and a stop function (second argument) that causesprecomputeNeedles
to terminate early if calledrate
: (optional) the approximate percent of progress that occurs between each call ofcb
- returns
- a JSON object containing information that allows you to call
#untemplateWithNeedles
instead of#untemplate
for a drastic performance increase
- a JSON object containing information that allows you to call
- throws
EarlyStopException
: thrown if thecb
function calls itsstop
argument
#deduceTemplate(elements[, cb, rate])
- arguments
examples
: a list of HTML nodes as strings. These nodes should all represent the same "type of thing" in the page you'd like to apply the template to. For instance, the HTML for each search result in a long list of search results. These nodes must all share the same outermost tag.cb
: (optional) a progress callback that is periodically called with the approximate completion percentage (first argument) and a stop function (second argument) that causesdeduceTemplate
to terminate early if calledrate
: (optional) the approximate percent of progress that occurs between each call ofcb
- returns
- a string representing the minimum template that matches all of the provided examples
- minimum means: fewest number of optional nodes in the template (this count includes the descendants of optional nodes)
- this string all of the text in the provided examples in an aggregated form to communicate which portions of the template correspond to what
- a string representing the minimum template that matches all of the provided examples
- throws
UnresolveableExamplesError
: thrown if the input examples cannot be reconciled for any reason (usually just because they do not share a common outermost tag)EarlyStopException
: thrown if thecb
function calls itsstop
argument
#deduceTemplateVerbose(elements[, prefix, cb, rate])
- arguments
examples
: see arguments for#deduceTemplate
prefix
: (optional) a string to prefix all of the generated property selectors withcb
: (optional) a progress callback that is periodically called with the approximate completion percentage (first argument) and a stop function (second argument) that causesdeduceTemplateVerbose
to terminate early if calledrate
: (optional) the approximate percent of progress that occurs between each call ofcb
- returns
- an object literal conaining the following keys
maximalDsl
: a template to match the examples with property selectors in all possible positionsconsolidatedValues
: an object literal mapping the property selectors inmaximalDsl
to the literal values indslWithLiterals
dslWithLiterals
: the same minimum template returned by#deduceTemplate
- an object literal conaining the following keys
- throws
UnresolveableExamplesError
: thrown in same cases as#deduceTemplate
EarlyStopException
: thrown if thecb
function calls itsstop
argument
Running locally
This project uses Webpack to generate a bundle file and Flow to check types. The following commands will use yarn
, but you can use npm
or npm run
interchangeably.
yarn install
to grab the dependencies.yarn flow
to check the typesyarn test
to run tests (this project uses jasmine)yarn build
to create a bundle file inlib/
For a front-end interface to the #deduceTemplate
function, open the index.html
file in a browser. This will allow you to paste in HTML and deduce the template the minimally matches those html examples.
License
MIT © fin ventures