@skypilot/scraper
v1.0.0-alpha.23
Published
Node-based scriptable web scraper
Downloads
12
Maintainers
Readme
@skypilot/scraper
Node-base scriptable web scraper
How to use
- Create a database adapter
const dbFilePath = 'tmp/demo.json';
const database = new LowDb(dbFilePath);
- Create a scraper that uses the database
import { PlaywrightScraper } from './src/PlaywrightScraper';
const scraper = new PlaywrightScraper({ database });
- Use
ScriptBuilder
to build a script:
import { ScriptBuilder } from './src/ScriptBuilder';
const builder = new ScriptBuilder()
.goTo('https://www.iana.org/domains/reserved') // start at a page
.runOnAll({ // Runs the nested `commands` on each element that matches `query`
query: 'table#arpa-table > tbody > tr > td > span.domain.label',
commands: new ScriptBuilder()
.follow('a') // follow the href in the first `a` tag
.query({ // gather this data for each iteration of the elements matching the `runOnAll` query
title: 'head > title',
sponsor: '//h2[contains(text(), "Sponsoring Organisation")]/following-sibling::b',
adminContact: '//h2[contains(text(), "Administrative Contact")]/following-sibling::b',
techContact: '//h2[contains(text(), "Technical Contact")]/following-sibling::b',
})
.write() // writes to the database
});
- Pass the script into the scraper's
run
method:
const result = scraper.run(builder);
Query
There are two ways to write a query:
1. A Query
or ShorthandQuery
object
A Query
object is the standard way to write a selector:
interface Query {
selector: string; // a CSS or XPath selector
attributeName?: string; // if specified, select this attribute's value; otherwise, select the element's text content
scope?: 'one' | 'all'; // default = 'one'; when used with `runOnAll`, `scope: 'all'` is automatically set
limit?: Integer; // limits the selection to `limit` elements
nthOfType?: Integer; // select the `nth` element matching the selector
}
A ShorthandQuery
is the same as Query
object, but uses a shorthand syntax for some of the keys:
interface ShorthandQuery {
sel: string;
attr?: string;
scope?: 'one' | 'all';
limit?: Integer;
nth?: Integer;
}
See CSS and XPath selectors. Support for text selectors will be added soon.
A query matches the first element matching the selector, with two exceptions:
- When used with
runOnAll
or whenscope: 'all'
, the selector selects all matching elements up to thelimit
(if any) - When
nthOfType
is set, the selector selects thenth
matching element
2. A string query
When a string value is used as the query, that value is treated as the selector
param.
E.g., if the argument is 'h2'
, it is understood to mean { selector: 'h2' }
.