evalscraper

v0.6.1

Published

3 years ago

A configurable web page scraper that uses Google Puppeteer

Downloads

0High
0Medium
0Low

briandehart

puppeteer web scraping

evalscraper

evalscraper is middleware for scraping web pages with Google Puppeteer.

Installation

npm install evalscraper

Usage

ESM

import { Scraper, ScrapeTask } from "./dist/evalscraper";

CJS

const { Scraper, ScrapeTask } = require("./dist/evalscraper");

Create a new Scraper instance.

const scraper = new Scraper();

A ScrapeTask's first parameter is the url of the page to scrape. Then follow one or more arrays, each containing elements for a scrape of that page. pageFunction evaluates in browser context.

const scrapeTask =
  new ScrapeTask(
    'https://url-to-scrape/',
    [
      'key',                   // property to hold returned value of this scrape

      'selector',              // element to select on page

      pageFunction(selectors), // a functon passed an array containing all
                               // instances of 'selector' found on the page;
                               // pageFunction evaluates in browser context

      callback(array)          // optional callback that is passed an
                               // array returned by pageFunction
    ],
    // ...[Next scrape]
);

Pass the ScrapeTask to the.scrape() method. It returns a Promise that resolves to an object with key: value pairs determined by the ScrapeTask.

const scrapeOfPage = await scraper.scrape(scrapeTask);

Close the scraper.

await scraper.close();

Mutliple Scraper instances can be created.

const scraperFoo = new Scraper();
const scraperBar = new Scraper();

const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperBar.scrape(taskBar);

scraperFoo.close();
scraperBar.close();

Or a single Scraper instance can be reused.

const scraperFoo = new Scraper();

const resultsFoo = await scraperFoo.scrape(taskFoo);
const resultsBar = await scraperFoo.scrape(taskBar);

scraperFoo.close();

The number of concurrent scrapes you can run will be limited by your hardware.

Configuration

A Scraper instance can be configured by passing an object to the constructor.

  const scraper = new Scraper(
    {
      // default values
      throwError: true,
      noisy: false, // when true, progress is logged to console
      timeout: 30000,
      maxRetries: 2
    });

Example

Scrape Hacker News and return the titles and links of the first ten stories.

const { Scraper, ScrapeTask } = require("./dist/evalscraper");

const scraper = new Scraper({
  throwError: true,
  noisy: true,
  timeout: 30000,
  maxRetries: 2,
});

// returns the titles and links of
// the first ten Hacker News stories
const newsScrape = new ScrapeTask("https://news.ycombinator.com/", [
  "stories",
  "a.titlelink",
  (anchors) =>
    anchors.map((a) => {
      const story = [];
      story.push(a.textContent);
      story.push(a.href);
      return story;
    }),
  (stories) => stories.slice(0, 10),
]);

async function logStories(scrapeTask) {
  try {
    const hackerNews = await scraper.scrape(scrapeTask);
    hackerNews.stories.forEach((story) =>
      console.log(story[0], story[1], "\n")
    );
    scraper.close();
  } catch (err) {
    console.log(err);
  }
}

logStories(newsScrape);

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

evalscraper

Installation

Usage

Configuration

Example