mongodb-simplecrawler-queue
v0.4.1
Published
MongoDB FetchQueue Implementation for Simplecrawler
Downloads
8
Maintainers
Readme
MongoDB Implementation of FetchQueue Interface for Simplecrawler
This is an implementation of FetchQueue Interface for simplecrawler queue with MongoDB usage as backend.
Preferences:
- Possibility to pause/stop/kill/terminate running job without queue state losing
- Possibility to run one crawler job in parallel with several crawler instances using one queue. (Including adding/removing instances in runtime)
Installation
npm install --save mongodb-simplecrawler-queue
Usage
All you need is connection configuration: url of the MongoDB instance
const MongoDbQueue = require('mongodb-simplecrawler-queue').MongoDbQueue; // or import { MongoDbQueue } from 'mongodb-simplecrawler-queue';
const Crawler = require('simplecrawler');
const connectionConfig = {
url: 'mongodb://localhost:27017',
dbName: 'crawler',
collectionName: 'queue',
};
const crawlerQueue = new MongoDbQueue(connectionConfig);
crawlerQueue.init(err => {
if (err) {
console.log(err);
process.exit(0);
}
const crawler = new Crawler('https://en.wikipedia.org/wiki/Main_Page');
crawler.maxDepth = 3;
crawler.allowInitialDomainChange = false;
crawler.filterByDomain = true;
// here we are - tell crawler to use mongo as a queue
crawler.queue = crawlerQueue;
crawler.start();
});
Monitoring and garbage collector
It is possible to schedule the queue to execute monitoring tasks or garbage collector tasks periodically using queue configuration:
const connectionConfig = {
url: 'mongodb://localhost:27017',
dbName: 'crawler',
collectionName: 'queue',
// Garbage collector configuration.
GCConfig: {
run: true,
msInterval: 1000 * 60 * 5,
},
// monitoring configuration
monitorConfig: {
run: true,
statisticCollectionName: 'statistic',
msInterval: 1000 * 60,
},
};
Garbage collector
Garbage collector is a task which will be executed periodically and find "old" items which have been spooled but not fetched. It could happen in case when crawler instance started to process new item but job failed/process has been terminated etc. So you don't have "missing" items in the queue.
Garbage collector config object properties:
| Property name | Type | Comment | |---|---|---| | run | boolean | disable/enable GC | | msInterval | number | Interval between Monitoring tasks in milliseconds. |
Monitoring
Monitoring task is a task which will aggregate all statistic information about current queue state and put data in another collection of the MongoDB instance
Monitoring config object properties:
| Property name | Type | Comment | |---|---|---| | run | boolean | disable/enable GC | | statisticCollectionName | String | [default = 'statistic'] Collection name to put data. Warning Please don't use same collections names for queue and statistic! It will break crawler logic and crawler job will never finish. |
Current aggregation data:
Queue monitoring will add statistic data items into statistic Collection each msInterval
milliseconds. So you can use this collection in runtime to see progress sof the job.
Example of the statistic item:
{
// Aggregation statistic from the queue items - see https://github.com/simplecrawler/simplecrawler#queue-statistics-and-reporting
actualDataSizeAvg: 40641.18656716418,
actualDataSizeMax: 203492,
actualDataSizeMin: 179,
contentLengthAvg: 40641.18656716418,
contentLengthMax: 203492,
contentLengthMin: 179,
downloadTimeAvg: 3629.634328358209,
downloadTimeMax: 19584,
downloadTimeMin: 2,
requestLatencyAvg: 34.93283582089552,
requestLatencyMax: 265,
requestLatencyMin: 22,
requestTimeAvg: 3664.5671641791046,
requestTimeMax: 19608,
requestTimeMin: 25,
// reduced count of elements grouped by QueueItem.status (see https://github.com/simplecrawler/simplecrawler#queue-items):
queued: 37449,
downloaded: 696,
headers: 7,
spooled: 2,
downloadprevented: 1,
timeout: 1,
created: 1,
failed: 1,
notfound: 1,
redirected: 1,
pulled: 0,
// general info - total count of items
totalCount: 22952,
// general info - count of items with property "fetched": true. Means items is fully processed
fetchedCount: 134,
// timestamp of the request
timestamp: 1554661504815,
// mongoDB id
_id: "5caa418467fa3c00083b4b7a"
}
Note: if there are no elements with some QueueItem.status, this property will not be included in the statistic item.
WARNING: These GC and Monitoring tasks could be slowed down by parallel general crawler execution. For better performance, please use GC and Monitoring as separate utilities in separate process.
Additional utilities
You can run GCTasks, Monitoring as blocking operations using Utils class and runGC
and runMonitoring
methods.
Also you can fully drop the queue using dropQueue
method. Note: this method also cleans statistic collection.
// TODO examples
Performance
// TODO tests and results, some graphics
MongoDB as a docker image locally
Start mongo as a docker
docker run --name <IMAGE_NAME> -p 27017:27017 -d mongo:3.6.17-xenial
Connect to the running image
docker exec -it <IMAGE_NAME> bash