@bluggie/nodescrapy

v0.1.6

Published

2 years ago

Web crawler in NodeJS

Downloads

0High
0Medium
0Low

bluggie

scrapy web-crawling web-scraping nodejs typescript

Overview

Nodescrapy is a fast high-level and highly configurable web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.

Nodescrapy is written in Typescript and works in a NodeJs environment.

Nodescrapy comes with a built-in web spider, which will discover automatically all the URLs of the website.

Nodescrapy saves the status of the crawling in a local Sqlite database, so crawling can be stopped and resumed.

Nodescrapy provides a default integration with AXIOS and PUPPETEER to choose if rendering javascript or not.

By default, Nodescrapy saves the results of the scrapping in a local folder in JSON files.

import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';

const onItemCrawledFunction = (response: HtmlResponse) => {
    return { "data1": ... }
}

const crawler = new WebCrawler({
    dataPath: './crawled-items',
    entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
    onItemCrawled: onItemCrawledFunction
});

crawler.crawl()
    .then(() => console.log('Crawled finished'));

What does nodescrapy do?

Provides a web client configurable with retries and delays.
Extremely configurable for writing your own crawler.
Provides a configurable discovery implementation to auto-detecting linked resources and filter the ones you want.
Saves the status of the crawling in a file storage, so crawled can be paused and resumed.
Provides basic statistics on crawling status.
Automatically parses the DOM of the HTMLs with Cheerio
Implementations can be easily extended.
Fully written in Typescript.

Documentation

Installation

npm install --save @bluggie/nodescrapy

Getting Started

Initializing nodescrapy is a simple process. First, you require the module and instantiate it with the config argument. You then configure the properties you like (eg. the request interval), register the onItemCrawled method, and call the crawl method. Let's walk through the process!

After requiring the crawler, we create a new instance of it. We supply the constructor with the Crawler Configuration. A simple configuration contains:

Entry url/urls for the crawler.
Where to store the crawled items (dataPath).
What to do when a new page is crawled (function onItemCrawled)

import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';

const onItemCrawledFunction = (response: HtmlResponse) => {
    if (!response.url.includes('-for-rent')) {
        return undefined;
    }

    const $ = response.$;
    return {
        'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
    }
}

const crawler = new WebCrawler({
    dataPath: './crawled-items',
    entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
    onItemCrawled: onItemCrawledFunction
});

crawler.crawl()
    .then(() => console.log('Crawled finished'));

The function onItemCrawledFunction is required, since the crawler will invoke it to extract the data from hte HTML document. It will return undefined if there is nothing to extract from that page, or an object of {key: values} if data could be extracted from that page. See onItemCrawled for more information.

When running the application, it will produce the following logs:

info: Jul-08-2022 09:08:57: Crawled started.
info: Jul-08-2022 09:08:57: Crawling https://www.pararius.com/apartments/amsterdam
info: Jul-08-2022 09:09:00: Crawling https://www.pararius.com/apartments/amsterdam/map
info: Jul-08-2022 09:09:01: Crawling https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Crawling https://www.pararius.com/real-estate-agents/amsterdam/expathousing-com-amsterdam
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b64f/president-kennedylaan
info: Jul-08-2022 09:09:20: Saving 2 entries into JSON file: data-2022-07-08T07:09:20.115Z.json
info: Jul-08-2022 12:37:16: Crawled 29 urls. Remaining: 328

This will also store the data in a JSON file (by default, 50 entries per JSON file. Configurable with dataBatchSize property).

[
  {
    "provider": "nodescrapy",
    "url": "https://www.pararius.com/apartment-for-rent/amsterdam/2365cc70/gillis-van-ledenberchstraat",
    "data": {
      "data1": "test"
    },
    "added_at": "2022-07-08T10:38:53.431Z",
    "updated_at": "2022-07-08T10:38:53.431Z"
  },
  {
    "provider": "nodescrapy",
    "url": "https://www.pararius.com/apartment-for-rent/amsterdam/61e78537/nieuwezijds-voorburgwal",
    "data": {
      "data1": "test"
    },
    "added_at": "2022-07-08T10:38:55.466Z",
    "updated_at": "2022-07-08T10:38:55.466Z"
  }
]

Crawling modes

Nodescrapy can run in two different modes:

START_BY_SCRATCH
CONTINUE

In START_BY_SCRATCH mode, every time the crawler runs will start from 0, going through the entryUrls and all the discovered links.

In CONTINUE mode, the crawler will only crawl the links which were not processed from the last run, and also the new ones which are being discovered.

To see how to configure this, go to mode

Data Models

HttpRequest

HttpRequest is a wrapper including:

The url which is going to be crawled
The headers which are going to be send in the request (i.e User-Agent)

interface HttpRequest {
  url: string;
  
  headers: { [key: string]: string; }
}

HtmlResponse

HtmlResponse is a wrapper including:

The crawled url
The axios response (see AxiosResponse)
The DOM processed by Cheerio

This information should be enough to extract the information you need from that webpage.

interface HtmlResponse {
  url: string;

  originalResponse: AxiosResponse;

  $: CheerioAPI;
}

DataEntry

Represents the data that will be stored in the file system after a page with data has been crawled.

Contains:

the id of the entry (primary key).
the provider (crawler name).
the url.
the data extracted by the onItemCrawled function.
when the data was added and updated.

interface DataEntry {
    id?: number,
    provider: string,
    url: string,
    data: { [key: string]: string; },
    added_at: Date,
    updated_at: Date
}

CrawlContinuationMode

Enum which defines how the crawler will run; either starting from scratch or continuing with the last execution.

Values:

START_FROM_SCRATCH
CONTINUE

enum CrawlContinuationMode {
    START_FROM_SCRATCH = 'START_FROM_SCRATCH',
    CONTINUE = 'CONTINUE'
}

CrawlerClientLibrary

Enum which defines the implementation of the client. Puppeteer will automatically render javascript using chrome.

If puppeteer, chrome executable should be present in the system.

Values:

AXIOS
PUPPETEER

enum CrawlerClientLibrary {
    AXIOS = 'AXIOS',
    PUPPETEER = 'PUPPETEER'
}

Crawler configuration

Full typescript configuration definition

This is a definition of all the possible configuration supported currently by the crawler.

{
    name: 'ParariusCrawler',
    mode: 'START_FROM_SCRATCH',
    entryUrls: ['http://www.pararius.com'],
    client: {
        library: 'PUPPETEER',
        autoScrollToBottom: true,
        concurrentRequests: 5,
        retries: 5,
        userAgent: 'Firefox',
        retryDelay: 2,
        delayBetweenRequests: 2,
        timeoutSeconds: 100,
        beforeRequest: (htmlRequest: HttpRequest) => { // Only for AXIOS client.
            htmlRequest.headers.Authorization = 'JWT MyAuth';
            return htmlRequest;
        }
    },
    discovery: {
        allowedDomains: ['www.pararius.com'],
        allowedPath: ['amsterdam/'],
        removeQueryParams: true,
        onLinksDiscovered: undefined
    },
    onItemCrawled: (response: HtmlResponse) => {
        if (!response.url.includes('-for-rent')) {
            return undefined;
        }

        const $ = response.$;
        return {
            'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
        }
    }
    dataPath: './output-json',
    dataBatchSize: 10,
    sqlitePath: './cache.sqlite'
}

name : string

Name of the crawler.

The name of the crawler is important in the following scenarios:

When resuming a crawler. The library will find the last status based in crawler name. If you change the name, the status will be reset.
When having multiple crawlers. The library stores the status in a SQLite database indexed by the crawler name.

Default: nodescrapy

mode : string

Mode of the crawler. To see options, check CrawlConfigurationMode

Default: START_BY_SCRATCH

entryUrls : string[]

List of urls which will start to crawl.

Example:

{
    entryUrls: ['https://www.pararius.com/apartments/amsterdam']
}

onItemCrawled : function (response: HtmlResponse) => { [key: string]: any; } | undefined;

Function to extract the data when an url has been crawled.

If returns undefined, the url will be discarded and nothing will be stored for it.

The argument of this function is provided by the crawler, and it is a HtmlResponse

Example

 {
    onItemCrawled: (response: HtmlResponse) => {
        if (!response.url.includes('-for-rent')) {
            return undefined; // Only extract information fron the urls which contains for-rent
        }

        const $ = response.$;
        return {
            'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(), // Extract the title of the page.
        }
    }
}

dataPath : string

Configures where the output of the crawler (DataEntries) will be stored.

Example

{
    dataPath: './output-data'
}

This will produce the following files:

./output-data/data-2022-07-11T08:17:38.188Z.json

./output-data/data-2022-07-11T08:17:41.188Z.json

...

dataBatchSize : number

This property configures how many crawled items will be persisted in an unique file.

For example, if the number is 5, every JSON file will contain 5 crawled items. Default: 50

sqlitePath : string

Configures where to store the sqlite database (full path, including name)

Default: node-modules/nodescrapy/cache.sqlite

Client configuration

client.library : string

Chooses the client implementation between AXIOS or PUPPETEER. Default: AXIOS

client.concurrentRequests : number

Configures the number of concurrent requests. Default: 1

client.retries : number

Configures the number of retries to perform when a request is failed. Default: 2

client.userAgent : string

Configures the user agents of the client.

Default: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36

client.autoScrollToBottom : boolean

If true and client is puppeteer, every page will be scrolled to the bottom before rendered.

Default: true

client.retryDelay : number

Configures how many seconds the client will wait between different requests. Default: 5

client.timeoutSeconds : number

Configures the timeout of the client, in seconds. Default: 10

client.beforeRequest : (htmlRequest: HttpRequest) => HttpRequest

Function which allows to modify the url or the headers before performing the request. Useful to add authentication headers or change the URL for a proxy one.

Default: undefined

Example

    {
        client.beforeRequest: (request: HttpRequest): HttpRequest => {
            const proxyUrl = `http://www.myproxy.com?url=${request.url}`;
    
            const requestHeaders = request.headers;
            requestHeaders.Authorization = 'JWT ...';
    
            return {
                url: proxyUrl,
                headers: requestHeaders,
            };
        }
    }

Discovery configuration

discovery.allowedDomains : string[]

Whitelist of domains to crawl. Default: Same domains that entryUrls

discovery.allowedPath : string[]

How to use this configuration:

If url contains any of the strings of allowedPath, url will be crawled.
If url matches the regex of any of the allowedPath, url will be crawled.

Default: ['.*']

Example

{
    discovery.allowedPath: ["/amsterdam", "houses-to-rent", "house-[A-Z]+"]
}

discovery.removeQueryParams : boolean

If true, it will trim the query parameters from the urls to discover. Default: false

discovery.onLinksDiscovered : (response: HtmlResponse, links: string[]) => string[]

Function that can be used to remove / add links to crawl. Default: undefined

Example

{
    discovery.onLinksDiscovered: (htmlResponse: HtmlResponse, links: string[]) => {
        links.push('https://mycustomurl.com');
        // We can use htmlResponse.$ to find links by css selectors.
        return links;
    }
}

Examples

You can check some examples in the examples folder.

Roadmap

Features to be implemented:

Store status and data in MongoDB.
Create more examples.
Add mode to retry errors.
Increase unit tests coverage.

Contributors

Main contributor: Juan Roldan

The Nodescrapy project welcomes all constructive contributions. Contributions take many forms, from code for bug fixes and enhancements, to additions and fixes to documentation, additional tests, triaging incoming pull requests and issues, and more!

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Overview

What does nodescrapy do?

Documentation

Installation

Getting Started

Crawling modes

Data Models

HttpRequest

HtmlResponse

DataEntry

CrawlContinuationMode

CrawlerClientLibrary

Crawler configuration

Full typescript configuration definition

name : string

mode : string

entryUrls : string[]

Example:

onItemCrawled : function (response: HtmlResponse) => { [key: string]: any; } | undefined;

Example

dataPath : string

Example

dataBatchSize : number

sqlitePath : string

Client configuration

client.library : string

client.concurrentRequests : number

client.retries : number

client.userAgent : string

client.autoScrollToBottom : boolean

client.retryDelay : number

client.timeoutSeconds : number

client.beforeRequest : (htmlRequest: HttpRequest) => HttpRequest

Example

Discovery configuration

discovery.allowedDomains : string[]

discovery.allowedPath : string[]

Example

discovery.removeQueryParams : boolean

discovery.onLinksDiscovered : (response: HtmlResponse, links: string[]) => string[]

Example

Examples

Roadmap

Contributors

License