dcrawler
v0.0.8
Published
DCrawler is a distribited web spider written in Nodejs and queued with Mongodb. It gives you the full power of jQuery to parse big pages as they are downloaded, asynchronously. Simplifying distributed crawler!
Downloads
22
Maintainers
Readme
node-distributed-crawler
Features
- Distributed crawler
- Configurable url parser and data parser
- jQuery selector using cheerio
- Parsed data insertion in Mongodb collection
- Domain wise interval configuration in distributed enviroment
- node 0.8+ support
Note: update to latest version (0.0.4+), don't use 0.0.1
I am actively updating this library, for any feature suggestion or git fork request are welcomed :)
Installation
$ npm install dcrawler
Usage
var DCrawler = require("dcrawler");
var options = {
mongodbUri: "mongodb://0.0.0.0:27017/crawler-data",
profilePath: __dirname + "/" + "profile"
};
var logs = {
dbUri: "mongodb://0.0.0.0:27017/crawler-log",
storeHost: true
};
var dc = new DCrawler(options, logs);
dc.start();
Note: mongodb connection uri (mongodbUri
& dbUri
) should be same (queueing of urls should be centeralized)
The DCrawler takes options and log options construcotr:
- options with following porperties *:
- mongodbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler') *
- profilePath: Location of profile directory which contains config files. (Eg: /home/crawler/profile) *
- logs to store logs in centrelized location using winston-mongodb with following porperties:
- dbUri: Mongodb connection uri (Eg: 'mongodb://0.0.0.0:27017/crawler')
- storeHost: Boolean, true or false to store workers host name or not in log collection.
Note: logs
is required when you want to store centralize logs in mongodb, if you don't want to store logs no need to pass logOptions variable in DCrawler constructor
var dc = new DCrawler(options);
Create config file for each domain inside profilePath directory. Check example profile example.com, contains config with following porperties:
- collection: Name on collection to store parsed data in mongodb. (Eg: 'products') *
- url: Url to start crawling. String or Array of url. (Eg: 'http://example.com' or ['http://example.com']) *
- interval: Interval between request in miliseconds. Default is
1000
(Eg: For 2 secods interval:2000
) - followUrl: Boolean, true or false to fetch further url from the crawled page and crawl that url as well.
- resume: Boolean, true or false to resume crawling from previous crawled data.
- beforeStart: Function to execute before start crawling. Function has config param which contains perticular profile config object. Example function:
beforeStart: function (config) {
console.log("started crawling example.com");
}
- parseUrl: Function to get further url from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns Array of url string. Example function:
parseUrl: function (error, response, $) {
var _url = [];
try {
$("a").each(function(){
var href = $(this).attr("href");
if (href && href.indexOf("/products") > -1) {
if (href.indexOf("http://example.com") === -1) {
href = "http://example.com/" + href;
}
_url.push(href);
}
)};
} catch (e) {
console.log(e);
}
return _url;
}
- parseData: Function to exctract information from crawled page. Function has
error
,response
object and$
jQuery object param. Function returns data Object to insert in collection . Example function:
parseData: function (error, response, $) {
var _data = null;
try {
var _id = $("h1#productId").html();
var name = $("span#productName").html();
var price = $("label#productPrice").html();
var url = response.uri;
_data = {
_id: _id,
name: name,
price: price,
url: url
}
} catch (e) {
console.log(e);
}
return _data;
}
- onComplete: Function to execute on completing crawling. Function has
config
param which contains perticular profile config object. Example function:
onComplete: function (config) {
console.log("completed crawling example.com");
}
Chirag (blikenoother -[at]- gmail [dot] com)