mongodb-simplecrawler-queue

v0.4.1

Published

3 years ago

MongoDB FetchQueue Implementation for Simplecrawler

Downloads

0High
0Medium
0Low

novikovevgeniy

mongo simplecrawler queue

MongoDB Implementation of FetchQueue Interface for Simplecrawler

This is an implementation of FetchQueue Interface for simplecrawler queue with MongoDB usage as backend.

Preferences:

Possibility to pause/stop/kill/terminate running job without queue state losing
Possibility to run one crawler job in parallel with several crawler instances using one queue. (Including adding/removing instances in runtime)

Installation

npm install --save mongodb-simplecrawler-queue

Usage

All you need is connection configuration: url of the MongoDB instance

const MongoDbQueue = require('mongodb-simplecrawler-queue').MongoDbQueue; // or import { MongoDbQueue } from 'mongodb-simplecrawler-queue';
const Crawler = require('simplecrawler');

const connectionConfig = {
  url:  'mongodb://localhost:27017',
  dbName: 'crawler',
  collectionName: 'queue',
};

const crawlerQueue = new MongoDbQueue(connectionConfig);

crawlerQueue.init(err => {
  if (err) {
    console.log(err);
    process.exit(0);
  }

  const crawler = new Crawler('https://en.wikipedia.org/wiki/Main_Page');
  crawler.maxDepth = 3;
  crawler.allowInitialDomainChange = false;
  crawler.filterByDomain = true;
  
  // here we are - tell crawler to use mongo as a queue
  crawler.queue = crawlerQueue;

  crawler.start();
});

Monitoring and garbage collector

It is possible to schedule the queue to execute monitoring tasks or garbage collector tasks periodically using queue configuration:

const connectionConfig = {
  url:  'mongodb://localhost:27017',
  dbName: 'crawler',
  collectionName: 'queue',
  // Garbage collector configuration. 
  GCConfig: { 
    run: true,
    msInterval: 1000 * 60 * 5,
  },
  // monitoring configuration
  monitorConfig: {
    run: true,
    statisticCollectionName: 'statistic',
    msInterval: 1000 * 60,
  },
};

Garbage collector

Garbage collector is a task which will be executed periodically and find "old" items which have been spooled but not fetched. It could happen in case when crawler instance started to process new item but job failed/process has been terminated etc. So you don't have "missing" items in the queue.

Garbage collector config object properties:

| Property name | Type | Comment | |---|---|---| | run | boolean | disable/enable GC | | msInterval | number | Interval between Monitoring tasks in milliseconds. |

Monitoring

Monitoring task is a task which will aggregate all statistic information about current queue state and put data in another collection of the MongoDB instance

Monitoring config object properties:

| Property name | Type | Comment | |---|---|---| | run | boolean | disable/enable GC | | statisticCollectionName | String | [default = 'statistic'] Collection name to put data. Warning Please don't use same collections names for queue and statistic! It will break crawler logic and crawler job will never finish. |

Current aggregation data: Queue monitoring will add statistic data items into statistic Collection each msInterval milliseconds. So you can use this collection in runtime to see progress sof the job. Example of the statistic item:

{
    // Aggregation statistic from the queue items - see https://github.com/simplecrawler/simplecrawler#queue-statistics-and-reporting
    actualDataSizeAvg: 40641.18656716418,
    actualDataSizeMax: 203492,
    actualDataSizeMin: 179,
    contentLengthAvg: 40641.18656716418,
    contentLengthMax: 203492,
    contentLengthMin: 179,
    downloadTimeAvg: 3629.634328358209,
    downloadTimeMax: 19584,
    downloadTimeMin: 2,
    requestLatencyAvg: 34.93283582089552,
    requestLatencyMax: 265,
    requestLatencyMin: 22,
    requestTimeAvg: 3664.5671641791046,
    requestTimeMax: 19608,
    requestTimeMin: 25,
    
    // reduced count of elements grouped by QueueItem.status (see https://github.com/simplecrawler/simplecrawler#queue-items):
    queued: 37449,
    downloaded: 696,
    headers: 7,
    spooled: 2,
    downloadprevented: 1,
    timeout: 1,
    created: 1,
    failed: 1,
    notfound: 1,
    redirected: 1,
    pulled: 0,
  
    // general info - total count of items
    totalCount: 22952,
    // general info - count of items with property "fetched": true. Means items is fully processed
    fetchedCount: 134,
    // timestamp of the request
    timestamp: 1554661504815,
    // mongoDB id
    _id: "5caa418467fa3c00083b4b7a"
}

Note: if there are no elements with some QueueItem.status, this property will not be included in the statistic item.

WARNING: These GC and Monitoring tasks could be slowed down by parallel general crawler execution. For better performance, please use GC and Monitoring as separate utilities in separate process.

Additional utilities

You can run GCTasks, Monitoring as blocking operations using Utils class and runGC and runMonitoring methods.

Also you can fully drop the queue using dropQueue method. Note: this method also cleans statistic collection. // TODO examples

Performance

// TODO tests and results, some graphics

MongoDB as a docker image locally

Start mongo as a docker

docker run --name <IMAGE_NAME> -p 27017:27017 -d mongo:3.6.17-xenial

Connect to the running image

docker exec -it <IMAGE_NAME> bash

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme