evalscraper
v0.6.1
Published
A configurable web page scraper that uses Google Puppeteer
Downloads
3
Readme
evalscraper
evalscraper is middleware for scraping web pages with Google Puppeteer.
Installation
npm install evalscraper
Usage
ESM
import { Scraper, ScrapeTask } from "./dist/evalscraper";
CJS
const { Scraper, ScrapeTask } = require("./dist/evalscraper");
Create a new Scraper
instance.
const scraper = new Scraper();
A ScrapeTask
's first parameter is the url of the page to scrape. Then follow one or more arrays, each containing elements for a scrape of that page. pageFunction
evaluates in browser context.
const scrapeTask =
new ScrapeTask(
'https://url-to-scrape/',
[
'key', // property to hold returned value of this scrape
'selector', // element to select on page
pageFunction(selectors), // a functon passed an array containing all
// instances of 'selector' found on the page;
// pageFunction evaluates in browser context
callback(array) // optional callback that is passed an
// array returned by pageFunction
],
// ...[Next scrape]
);
Pass the ScrapeTask
to the.scrape()
method. It returns a Promise
that resolves to an object with key: value
pairs determined by the ScrapeTask
.
const scrapeOfPage = await scraper.scrape(scrapeTask);
Close the scraper.
await scraper.close();
Mutliple Scraper
instances can be created.
const scraperFoo = new Scraper();
const scraperBar = new Scraper();
const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperBar.scrape(taskBar);
scraperFoo.close();
scraperBar.close();
Or a single Scraper
instance can be reused.
const scraperFoo = new Scraper();
const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperFoo.scrape(taskBar);
scraperFoo.close();
The number of concurrent scrapes you can run will be limited by your hardware.
Configuration
A Scraper
instance can be configured by passing an object to the constructor.
const scraper = new Scraper(
{
// default values
throwError: true,
noisy: false, // when true, progress is logged to console
timeout: 30000,
maxRetries: 2
});
Example
Scrape Hacker News and return the titles and links of the first ten stories.
const { Scraper, ScrapeTask } = require("./dist/evalscraper");
const scraper = new Scraper({
throwError: true,
noisy: true,
timeout: 30000,
maxRetries: 2,
});
// returns the titles and links of
// the first ten Hacker News stories
const newsScrape = new ScrapeTask("https://news.ycombinator.com/", [
"stories",
"a.titlelink",
(anchors) =>
anchors.map((a) => {
const story = [];
story.push(a.textContent);
story.push(a.href);
return story;
}),
(stories) => stories.slice(0, 10),
]);
async function logStories(scrapeTask) {
try {
const hackerNews = await scraper.scrape(scrapeTask);
hackerNews.stories.forEach((story) =>
console.log(story[0], story[1], "\n")
);
scraper.close();
} catch (err) {
console.log(err);
}
}
logStories(newsScrape);