schabbi-webscraper

v1.2.2

Published

5 months ago

Lightweight and easy to use crawling solution for websites.

Downloads

0High
0Medium
0Low

patrickschababerle

nodejs npm crawler nodejs crawler webcrawler crawler node crawler

Downloads

Lightweight and easy to use webcrawler.

Features

Fast and reliable
Supports custom page handling
Result contains also all cookies
Accepts all puppeteer parameters

Requirements

NodeJS v15.*

Installation

via NPM

$ npm i schabbi-webscraper

via Github

$ git clone https://github.com/PatrickSchababerle/schabbi-webscraper
$ npm install

Usage

Standard use case

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();      

Crawler.setUrl('https://www.example.com').crawl();

With custom option parameters

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').withOptions({
    includeExternalLinks :  true,
    userAgent :  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36",
    authentication : {
        username : 'Testuser',
        password : 'Test'
    }
}).crawl();

You can decide which crawled links are added to the queue by using the queue option. F.e. to crawl only pages with a specific attribute, class or target:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({
    queue : {
        pattern : 'a[href*="/2021/05/06"]'
    }
}).crawl();

You can also decide if parameters are ignored when adding urls to the queue:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.digitalsterne.de').withOptions({
    ignoreUrlParameter : true
}).crawl();

Work with the crawled pages while the're beeing processed

With custom functions you can perform actions on each crawled page. The results will be pushed into the final results.

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://digitalsterne.de').eachPage(async (page) => {
    const links = await page.$$eval('a', as => as.map(a => a.href));
    return links;
}).crawl().then((result) => {
    console.log(result);
});

Further work with result

Schabbi is returning a promise which will be resolved as soon as the crawl has finished:

const  Schabbi = require('schabbi-webscraper');
const  Crawler = new Schabbi();

Crawler.setUrl('https://www.example.com').crawl().then((result) => {
    console.log(result);
});

Methods

Configuration

Visit the examples for detailed information on how to use options properly.

About this project

This is one of my first projects on github to be available for you all out there. Please feel free to provide feedback!