@bluggie/nodescrapy
v0.1.6
Published
Web crawler in NodeJS
Downloads
28
Maintainers
Readme
Overview
Nodescrapy is a fast high-level and highly configurable web crawling and web scraping framework, used to crawl websites and extract structured data from their pages.
Nodescrapy is written in Typescript and works in a NodeJs environment.
Nodescrapy comes with a built-in web spider, which will discover automatically all the URLs of the website.
Nodescrapy saves the status of the crawling in a local Sqlite database, so crawling can be stopped and resumed.
Nodescrapy provides a default integration with AXIOS and PUPPETEER to choose if rendering javascript or not.
By default, Nodescrapy saves the results of the scrapping in a local folder in JSON files.
import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';
const onItemCrawledFunction = (response: HtmlResponse) => {
return { "data1": ... }
}
const crawler = new WebCrawler({
dataPath: './crawled-items',
entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
onItemCrawled: onItemCrawledFunction
});
crawler.crawl()
.then(() => console.log('Crawled finished'));
What does nodescrapy do?
- Provides a web client configurable with retries and delays.
- Extremely configurable for writing your own crawler.
- Provides a configurable discovery implementation to auto-detecting linked resources and filter the ones you want.
- Saves the status of the crawling in a file storage, so crawled can be paused and resumed.
- Provides basic statistics on crawling status.
- Automatically parses the DOM of the HTMLs with Cheerio
- Implementations can be easily extended.
- Fully written in Typescript.
Documentation
Installation
npm install --save @bluggie/nodescrapy
Getting Started
Initializing nodescrapy is a simple process. First, you require the module and instantiate it with the config argument. You then configure the properties you like (eg. the request interval), register the onItemCrawled
method, and call the crawl method. Let's walk through the process!
After requiring the crawler, we create a new instance of it. We supply the constructor with the Crawler Configuration. A simple configuration contains:
- Entry url/urls for the crawler.
- Where to store the crawled items (dataPath).
- What to do when a new page is crawled (function onItemCrawled)
import {HtmlResponse, WebCrawler} from '@bluggie/nodescrapy';
const onItemCrawledFunction = (response: HtmlResponse) => {
if (!response.url.includes('-for-rent')) {
return undefined;
}
const $ = response.$;
return {
'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
}
}
const crawler = new WebCrawler({
dataPath: './crawled-items',
entryUrls: ['https://www.pararius.com/apartments/amsterdam'],
onItemCrawled: onItemCrawledFunction
});
crawler.crawl()
.then(() => console.log('Crawled finished'));
The function onItemCrawledFunction
is required, since the crawler will invoke it to extract the data from hte HTML document.
It will return undefined
if there is nothing to extract from that page, or an object of {key: values}
if data could be extracted from that page.
See onItemCrawled for more information.
When running the application, it will produce the following logs:
info: Jul-08-2022 09:08:57: Crawled started.
info: Jul-08-2022 09:08:57: Crawling https://www.pararius.com/apartments/amsterdam
info: Jul-08-2022 09:09:00: Crawling https://www.pararius.com/apartments/amsterdam/map
info: Jul-08-2022 09:09:01: Crawling https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b6df/president-kennedylaan
info: Jul-08-2022 09:09:04: Crawling https://www.pararius.com/real-estate-agents/amsterdam/expathousing-com-amsterdam
info: Jul-08-2022 09:09:04: Adding crawled entry to data: https://www.pararius.com/apartment-for-rent/amsterdam/b180b64f/president-kennedylaan
info: Jul-08-2022 09:09:20: Saving 2 entries into JSON file: data-2022-07-08T07:09:20.115Z.json
info: Jul-08-2022 12:37:16: Crawled 29 urls. Remaining: 328
This will also store the data in a JSON file (by default, 50 entries per JSON file. Configurable with dataBatchSize
property).
[
{
"provider": "nodescrapy",
"url": "https://www.pararius.com/apartment-for-rent/amsterdam/2365cc70/gillis-van-ledenberchstraat",
"data": {
"data1": "test"
},
"added_at": "2022-07-08T10:38:53.431Z",
"updated_at": "2022-07-08T10:38:53.431Z"
},
{
"provider": "nodescrapy",
"url": "https://www.pararius.com/apartment-for-rent/amsterdam/61e78537/nieuwezijds-voorburgwal",
"data": {
"data1": "test"
},
"added_at": "2022-07-08T10:38:55.466Z",
"updated_at": "2022-07-08T10:38:55.466Z"
}
]
Crawling modes
Nodescrapy can run in two different modes:
- START_BY_SCRATCH
- CONTINUE
In START_BY_SCRATCH mode, every time the crawler runs will start from 0, going through the entryUrls and all the discovered links.
In CONTINUE mode, the crawler will only crawl the links which were not processed from the last run, and also the new ones which are being discovered.
To see how to configure this, go to mode
Data Models
HttpRequest
HttpRequest is a wrapper including:
- The url which is going to be crawled
- The headers which are going to be send in the request (i.e User-Agent)
interface HttpRequest {
url: string;
headers: { [key: string]: string; }
}
HtmlResponse
HtmlResponse is a wrapper including:
- The crawled url
- The axios response (see AxiosResponse)
- The DOM processed by Cheerio
This information should be enough to extract the information you need from that webpage.
interface HtmlResponse {
url: string;
originalResponse: AxiosResponse;
$: CheerioAPI;
}
DataEntry
Represents the data that will be stored in the file system after a page with data has been crawled.
Contains:
- the id of the entry (primary key).
- the provider (crawler name).
- the url.
- the data extracted by the onItemCrawled function.
- when the data was added and updated.
interface DataEntry {
id?: number,
provider: string,
url: string,
data: { [key: string]: string; },
added_at: Date,
updated_at: Date
}
CrawlContinuationMode
Enum which defines how the crawler will run; either starting from scratch or continuing with the last execution.
Values:
- START_FROM_SCRATCH
- CONTINUE
enum CrawlContinuationMode {
START_FROM_SCRATCH = 'START_FROM_SCRATCH',
CONTINUE = 'CONTINUE'
}
CrawlerClientLibrary
Enum which defines the implementation of the client. Puppeteer will automatically render javascript using chrome.
If puppeteer, chrome executable should be present in the system.
Values:
- AXIOS
- PUPPETEER
enum CrawlerClientLibrary {
AXIOS = 'AXIOS',
PUPPETEER = 'PUPPETEER'
}
Crawler configuration
Full typescript configuration definition
This is a definition of all the possible configuration supported currently by the crawler.
{
name: 'ParariusCrawler',
mode: 'START_FROM_SCRATCH',
entryUrls: ['http://www.pararius.com'],
client: {
library: 'PUPPETEER',
autoScrollToBottom: true,
concurrentRequests: 5,
retries: 5,
userAgent: 'Firefox',
retryDelay: 2,
delayBetweenRequests: 2,
timeoutSeconds: 100,
beforeRequest: (htmlRequest: HttpRequest) => { // Only for AXIOS client.
htmlRequest.headers.Authorization = 'JWT MyAuth';
return htmlRequest;
}
},
discovery: {
allowedDomains: ['www.pararius.com'],
allowedPath: ['amsterdam/'],
removeQueryParams: true,
onLinksDiscovered: undefined
},
onItemCrawled: (response: HtmlResponse) => {
if (!response.url.includes('-for-rent')) {
return undefined;
}
const $ = response.$;
return {
'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(),
}
}
dataPath: './output-json',
dataBatchSize: 10,
sqlitePath: './cache.sqlite'
}
name : string
Name of the crawler.
The name of the crawler is important in the following scenarios:
- When resuming a crawler. The library will find the last status based in crawler name. If you change the name, the status will be reset.
- When having multiple crawlers. The library stores the status in a SQLite database indexed by the crawler name.
Default: nodescrapy
mode : string
Mode of the crawler. To see options, check CrawlConfigurationMode
Default: START_BY_SCRATCH
entryUrls : string[]
List of urls which will start to crawl.
Example:
{
entryUrls: ['https://www.pararius.com/apartments/amsterdam']
}
onItemCrawled : function (response: HtmlResponse) => { [key: string]: any; } | undefined;
Function to extract the data when an url has been crawled.
If returns undefined, the url will be discarded and nothing will be stored for it.
The argument of this function is provided by the crawler, and it is a HtmlResponse
Example
{
onItemCrawled: (response: HtmlResponse) => {
if (!response.url.includes('-for-rent')) {
return undefined; // Only extract information fron the urls which contains for-rent
}
const $ = response.$;
return {
'title': $('.listing-detail-summary__title , #onetrust-accept-btn-handler').text(), // Extract the title of the page.
}
}
}
dataPath : string
Configures where the output of the crawler (DataEntries) will be stored.
Example
{
dataPath: './output-data'
}
This will produce the following files:
./output-data/data-2022-07-11T08:17:38.188Z.json
./output-data/data-2022-07-11T08:17:41.188Z.json
...
dataBatchSize : number
This property configures how many crawled items will be persisted in an unique file.
For example, if the number is 5, every JSON file will contain 5 crawled items. Default: 50
sqlitePath : string
Configures where to store the sqlite database (full path, including name)
Default: node-modules/nodescrapy/cache.sqlite
Client configuration
client.library : string
Chooses the client implementation between AXIOS or PUPPETEER.
Default: AXIOS
client.concurrentRequests : number
Configures the number of concurrent requests.
Default: 1
client.retries : number
Configures the number of retries to perform when a request is failed.
Default: 2
client.userAgent : string
Configures the user agents of the client.
Default: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
client.autoScrollToBottom : boolean
If true and client is puppeteer, every page will be scrolled to the bottom before rendered.
Default: true
client.retryDelay : number
Configures how many seconds the client will wait between different requests.
Default: 5
client.timeoutSeconds : number
Configures the timeout of the client, in seconds. Default: 10
client.beforeRequest : (htmlRequest: HttpRequest) => HttpRequest
Function which allows to modify the url or the headers before performing the request. Useful to add authentication headers or change the URL for a proxy one.
Default: undefined
Example
{
client.beforeRequest: (request: HttpRequest): HttpRequest => {
const proxyUrl = `http://www.myproxy.com?url=${request.url}`;
const requestHeaders = request.headers;
requestHeaders.Authorization = 'JWT ...';
return {
url: proxyUrl,
headers: requestHeaders,
};
}
}
Discovery configuration
discovery.allowedDomains : string[]
Whitelist of domains to crawl. Default: Same domains that entryUrls
discovery.allowedPath : string[]
How to use this configuration:
- If url contains any of the strings of allowedPath, url will be crawled.
- If url matches the regex of any of the allowedPath, url will be crawled.
Default: ['.*']
Example
{
discovery.allowedPath: ["/amsterdam", "houses-to-rent", "house-[A-Z]+"]
}
discovery.removeQueryParams : boolean
If true, it will trim the query parameters from the urls to discover. Default: false
discovery.onLinksDiscovered : (response: HtmlResponse, links: string[]) => string[]
Function that can be used to remove / add links to crawl. Default: undefined
Example
{
discovery.onLinksDiscovered: (htmlResponse: HtmlResponse, links: string[]) => {
links.push('https://mycustomurl.com');
// We can use htmlResponse.$ to find links by css selectors.
return links;
}
}
Examples
You can check some examples in the examples folder.
Roadmap
Features to be implemented:
- Store status and data in MongoDB.
- Create more examples.
- Add mode to retry errors.
- Increase unit tests coverage.
Contributors
Main contributor: Juan Roldan
The Nodescrapy project welcomes all constructive contributions. Contributions take many forms, from code for bug fixes and enhancements, to additions and fixes to documentation, additional tests, triaging incoming pull requests and issues, and more!