kick-off-crawling

v2.2.1

Published

3 years ago

make web scraping easy

Downloads

0High
0Medium
0Low

lulurun

crawler scraper scraping crawling nodejs cheerio headless puppeteer

Kick-off-crawling provides simple API/structure for developers to create scrapers and connect those scrapers together to achieve complicated data mining jobs.

Kick-off-crawling is made possible by below powerful libraries

request, request-promise for http request
puppeteer for making http request getting js generated html page
cheerio for translating html text to dom objects
async for scheduling scraping tasks
html-minifier for minifying html in order to get stable scraping result

Overview

Kick-off-crawling exposes a Scraper class and a kickoff function.

Scrapers are self-managed:

scrape data and urls from a DOM object parsed by cheerio.
post the data to developer defined function
pass the url to next scraper

kickoff takes a url and a scraper, kicks off the crawling process. During the crawling process, new scrapers (scraping job) will be generated and scheduled by kickoff function. The crawling process stops when there is no more scraping job.

Qucik start & Running the Examples

$ cd ${REPO_PATH}
$ npm install
$ cd examples
$ node google.js # getting top 30 search results
$ node amazon.js # getting top 5 apps from Amazon Appstore

Usage

Working example source code: examples/amazon.js

Here is an example for getting app info from Amazon Appstore.

1. Define your scrapers

We need to define 2 Scrapers for completing the task

BrowseNodeScraper for getting list of app detail pages.
DetailPageScraper for getting app into (title, stars, version) from each detail page.

Scraping app list

class BrowseNodeScraper extends Scraper {
  scrape($, emitter) {
    $('#mainResults .s-item-container').slice(0, 5).each((i, x) => {
      const detailPageUrl = $(x).find('.s-access-detail-page').attr('href');
      emitter.emitJob(detailPageUrl, new DetailPageScraper()); // <-- New scraping job
    });
  }
}

Scraping the detail page

class DetailPageScraper extends Scraper {
  getVersion($) {
    let version = '';
    $('#mas-technical-details .masrw-content-row .a-section').each((ii, xx) => {
      const text = $(xx).text();
      const re = /バージョン:\s+(.+)/;
      const matched = re.exec(text);
      if (matched) {
        [, version] = matched;
      }
    });
    return version;
  }

  scrape($, emitter) {
    const item = {
      title: $('#btAsinTitle').text(),
      star: $('.a-icon-star .a-icon-alt').eq(0).text().replace('5つ星のうち ', ''),
      version: this.getVersion($),
    };
    emitter.emitItem(item); // <-- the emitted item is recieved by *onItem* callback set with the `kickoff` function
  }
}

2. Kick off

kickoff(
  'https://www.amazon.co.jp/b/?node=2386870051',
  new BrowseNodeScraper(),
  {
    concurrency: 2, // <-- max 2 requests at a time, default 1
    minify: true, // <-- minify html, default true
    headless: false, // <-- set true when scraping js generated page, default false
    onItem: (item) => { // <-- the item is emitted from scraper
      console.log(item);
    },
    onDone: () => { // <-- this is called when there is no more scraping task, optional
      console.log('done');
    },
  },
);

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme