npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@bluggie/nodescrapy

v0.1.6

Published

Web crawler in NodeJS

Downloads

28

Readme

master Coverage Status

Overview

Nodescrapy is a fast high-level and highly configurable web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.

Nodescrapy is written in Typescript and works in a NodeJs environment.

Nodescrapy comes with a built-in web spider, which will discover automatically all the URLs of the website.

Nodescrapy saves the status of the crawling in a local Sqlite database, so crawling can be stopped and resumed.

Nodescrapy provides a default integration with AXIOS and PUPPETEER to choose if rendering javascript or not.

By default, Nodescrapy saves the results of the scrapping in a local folder in JSON files.

import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';

const onItemCrawledFunction = (response: HtmlResponse) => {
    return { "data1": ... }
}

const crawler = new WebCrawler({
    dataPath: './crawled-items',
    entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
    onItemCrawled: onItemCrawledFunction
});

crawler.crawl()
    .then(() => console.log('Crawled finished'));

What does nodescrapy do?

  • Provides a web client configurable with retries and delays.
  • Extremely configurable for writing your own crawler.
  • Provides a configurable discovery implementation to auto-detecting linked resources and filter the ones you want.
  • Saves the status of the crawling in a file storage, so crawled can be paused and resumed.
  • Provides basic statistics on crawling status.
  • Automatically parses the DOM of the HTMLs with Cheerio
  • Implementations can be easily extended.
  • Fully written in Typescript.

Documentation

Installation

npm install --save @bluggie/nodescrapy

Getting Started

Initializing nodescrapy is a simple process. First, you require the module and instantiate it with the config argument. You then configure the properties you like (eg. the request interval), register the onItemCrawled method, and call the crawl method. Let's walk through the process!

After requiring the crawler, we create a new instance of it. We supply the constructor with the Crawler Configuration. A simple configuration contains:

  • Entry url/urls for the crawler.
  • Where to store the crawled items (dataPath).
  • What to do when a new page is crawled (function onItemCrawled)
import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';

const onItemCrawledFunction = (response: HtmlResponse) => {
    if (!response.url.includes('-for-rent')) {
        return undefined;
    }

    const $ = response.$;
    return {
        'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
    }
}

const crawler = new WebCrawler({
    dataPath: './crawled-items',
    entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
    onItemCrawled: onItemCrawledFunction
});

crawler.crawl()
    .then(() => console.log('Crawled finished'));

The function onItemCrawledFunction is required, since the crawler will invoke it to extract the data from hte HTML document. It will return undefined if there is nothing to extract from that page, or an object of {key: values} if data could be extracted from that page. See onItemCrawled for more information.

When running the application, it will produce the following logs:

info: Jul-08-2022 09:08:57: Crawled started.
info: Jul-08-2022 09:08:57: Crawling https://www.pararius.com/apartments/amsterdam
info: Jul-08-2022 09:09:00: Crawling https://www.pararius.com/apartments/amsterdam/map
info: Jul-08-2022 09:09:01: Crawling https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Crawling https://www.pararius.com/real-estate-agents/amsterdam/expathousing-com-amsterdam
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b64f/president-kennedylaan
info: Jul-08-2022 09:09:20: Saving 2 entries into JSON file: data-2022-07-08T07:09:20.115Z.json
info: Jul-08-2022 12:37:16: Crawled 29 urls. Remaining: 328

This will also store the data in a JSON file (by default, 50 entries per JSON file. Configurable with dataBatchSize property).

[
  {
    "provider": "nodescrapy",
    "url": "https://www.pararius.com/apartment-for-rent/amsterdam/2365cc70/gillis-van-ledenberchstraat",
    "data": {
      "data1": "test"
    },
    "added_at": "2022-07-08T10:38:53.431Z",
    "updated_at": "2022-07-08T10:38:53.431Z"
  },
  {
    "provider": "nodescrapy",
    "url": "https://www.pararius.com/apartment-for-rent/amsterdam/61e78537/nieuwezijds-voorburgwal",
    "data": {
      "data1": "test"
    },
    "added_at": "2022-07-08T10:38:55.466Z",
    "updated_at": "2022-07-08T10:38:55.466Z"
  }
]

Crawling modes

Nodescrapy can run in two different modes:

  • START_BY_SCRATCH
  • CONTINUE

In START_BY_SCRATCH mode, every time the crawler runs will start from 0, going through the entryUrls and all the discovered links.

In CONTINUE mode, the crawler will only crawl the links which were not processed from the last run, and also the new ones which are being discovered.

To see how to configure this, go to mode

Data Models

HttpRequest

HttpRequest is a wrapper including:

  • The url which is going to be crawled
  • The headers which are going to be send in the request (i.e User-Agent)
interface HttpRequest {
  url: string;
  
  headers: { [key: string]: string; }
}

HtmlResponse

HtmlResponse is a wrapper including:

This information should be enough to extract the information you need from that webpage.

interface HtmlResponse {
  url: string;

  originalResponse: AxiosResponse;

  $: CheerioAPI;
}

DataEntry

Represents the data that will be stored in the file system after a page with data has been crawled.

Contains:

  • the id of the entry (primary key).
  • the provider (crawler name).
  • the url.
  • the data extracted by the onItemCrawled function.
  • when the data was added and updated.
interface DataEntry {
    id?: number,
    provider: string,
    url: string,
    data: { [key: string]: string; },
    added_at: Date,
    updated_at: Date
}

CrawlContinuationMode

Enum which defines how the crawler will run; either starting from scratch or continuing with the last execution.

Values:

  • START_FROM_SCRATCH
  • CONTINUE
enum CrawlContinuationMode {
    START_FROM_SCRATCH = 'START_FROM_SCRATCH',
    CONTINUE = 'CONTINUE'
}

CrawlerClientLibrary

Enum which defines the implementation of the client. Puppeteer will automatically render javascript using chrome.

If puppeteer, chrome executable should be present in the system.

Values:

  • AXIOS
  • PUPPETEER
enum CrawlerClientLibrary {
    AXIOS = 'AXIOS',
    PUPPETEER = 'PUPPETEER'
}

Crawler configuration

Full typescript configuration definition

This is a definition of all the possible configuration supported currently by the crawler.

{
    name: 'ParariusCrawler',
    mode: 'START_FROM_SCRATCH',
    entryUrls: ['http://www.pararius.com'],
    client: {
        library: 'PUPPETEER',
        autoScrollToBottom: true,
        concurrentRequests: 5,
        retries: 5,
        userAgent: 'Firefox',
        retryDelay: 2,
        delayBetweenRequests: 2,
        timeoutSeconds: 100,
        beforeRequest: (htmlRequest: HttpRequest) => { // Only for AXIOS client.
            htmlRequest.headers.Authorization = 'JWT MyAuth';
            return htmlRequest;
        }
    },
    discovery: {
        allowedDomains: ['www.pararius.com'],
        allowedPath: ['amsterdam/'],
        removeQueryParams: true,
        onLinksDiscovered: undefined
    },
    onItemCrawled: (response: HtmlResponse) => {
        if (!response.url.includes('-for-rent')) {
            return undefined;
        }

        const $ = response.$;
        return {
            'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
        }
    }
    dataPath: './output-json',
    dataBatchSize: 10,
    sqlitePath: './cache.sqlite'
}

name : string

Name of the crawler.

The name of the crawler is important in the following scenarios:

  • When resuming a crawler. The library will find the last status based in crawler name. If you change the name, the status will be reset.
  • When having multiple crawlers. The library stores the status in a SQLite database indexed by the crawler name.

Default: nodescrapy

mode : string

Mode of the crawler. To see options, check CrawlConfigurationMode

Default: START_BY_SCRATCH

entryUrls : string[]

List of urls which will start to crawl.

Example:
{
    entryUrls: ['https://www.pararius.com/apartments/amsterdam']
}

onItemCrawled : function (response: HtmlResponse) => { [key: string]: any; } | undefined;

Function to extract the data when an url has been crawled.

If returns undefined, the url will be discarded and nothing will be stored for it.

The argument of this function is provided by the crawler, and it is a HtmlResponse

Example
 {
    onItemCrawled: (response: HtmlResponse) => {
        if (!response.url.includes('-for-rent')) {
            return undefined; // Only extract information fron the urls which contains for-rent
        }

        const $ = response.$;
        return {
            'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(), // Extract the title of the page.
        }
    }
}

dataPath : string

Configures where the output of the crawler (DataEntries) will be stored.

Example
{
    dataPath: './output-data'
}

This will produce the following files:

./output-data/data-2022-07-11T08:17:38.188Z.json

./output-data/data-2022-07-11T08:17:41.188Z.json

...

dataBatchSize : number

This property configures how many crawled items will be persisted in an unique file.

For example, if the number is 5, every JSON file will contain 5 crawled items. Default: 50

sqlitePath : string

Configures where to store the sqlite database (full path, including name)

Default: node-modules/nodescrapy/cache.sqlite

Client configuration

client.library : string

Chooses the client implementation between AXIOS or PUPPETEER. Default: AXIOS

client.concurrentRequests : number

Configures the number of concurrent requests. Default: 1

client.retries : number

Configures the number of retries to perform when a request is failed. Default: 2

client.userAgent : string

Configures the user agents of the client.

Default: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36

client.autoScrollToBottom : boolean

If true and client is puppeteer, every page will be scrolled to the bottom before rendered.

Default: true

client.retryDelay : number

Configures how many seconds the client will wait between different requests. Default: 5

client.timeoutSeconds : number

Configures the timeout of the client, in seconds. Default: 10

client.beforeRequest : (htmlRequest: HttpRequest) => HttpRequest

Function which allows to modify the url or the headers before performing the request. Useful to add authentication headers or change the URL for a proxy one.

Default: undefined

Example
    {
        client.beforeRequest: (request: HttpRequest): HttpRequest => {
            const proxyUrl = `http://www.myproxy.com?url=${request.url}`;
    
            const requestHeaders = request.headers;
            requestHeaders.Authorization = 'JWT ...';
    
            return {
                url: proxyUrl,
                headers: requestHeaders,
            };
        }
    }

Discovery configuration

discovery.allowedDomains : string[]

Whitelist of domains to crawl. Default: Same domains that entryUrls

discovery.allowedPath : string[]

How to use this configuration:

  • If url contains any of the strings of allowedPath, url will be crawled.
  • If url matches the regex of any of the allowedPath, url will be crawled.

Default: ['.*']

Example
{
    discovery.allowedPath: ["/amsterdam", "houses-to-rent", "house-[A-Z]+"]
}

discovery.removeQueryParams : boolean

If true, it will trim the query parameters from the urls to discover. Default: false

discovery.onLinksDiscovered : (response: HtmlResponse, links: string[]) => string[]

Function that can be used to remove / add links to crawl. Default: undefined

Example
{
    discovery.onLinksDiscovered: (htmlResponse: HtmlResponse, links: string[]) => {
        links.push('https://mycustomurl.com');
        // We can use htmlResponse.$ to find links by css selectors.
        return links;
    }
}

Examples

You can check some examples in the examples folder.

Roadmap

Features to be implemented:

  • Store status and data in MongoDB.
  • Create more examples.
  • Add mode to retry errors.
  • Increase unit tests coverage.

Contributors

Main contributor: Juan Roldan

The Nodescrapy project welcomes all constructive contributions. Contributions take many forms, from code for bug fixes and enhancements, to additions and fixes to documentation, additional tests, triaging incoming pull requests and issues, and more!

License

MIT