spidey-redis

v1.0.4

Published

2 years ago

Spidey with the power of Redis

Downloads

0High
0Medium
0Low

smasadhaider

crawler spider data miner scraper web scraper spidey redis

Redis Spidey - Distributed Web Scraping Solution Powered by Redis

RedisSpidey is a powerful tool that combines the capabilities of Spidey and Redis to enable efficient distributed crawling and web scraping. Leveraging the advanced features of Redis, RedisSpidey features a distributed architecture that supports parallel operation of multiple instances, all listening to the same queue. Additionally, RedisSpidey pushes scraped data back to Redis queues for easy distributed post-processing, enhancing the overall efficiency of the scraping process.

Features

Distributed Crawling: RedisSpidey enables seamless operation of multiple instances of crawlers, all listening to the same queue, for efficient distributed crawling.
RedisPipeline: RedisSpidey provides support to push crawled data back to Redis queues for distributed post-processing

Installation

npm install spidey-redis

Options

RedisSpidey supports all Spidey options in addition to the following specific options.

| Configuration | Type | Description | Default | Required | | --- | --- | --- | --- | --- | | redisUrl | string | Redis url such as redis://localhost:6379 | null | Yes | | urlsKey | string | Redis input queue name such as urls:queue | null | Yes | | dataKey | string | Redis output queue name such as data:queue | null | Yes if using RedisPipeline | | sleepDelay | number | Wait for new items in queue if empty | 5000ms | No |

Usage

import { RedisSpidey, RedisPipeline } from 'spidey-redis';

class AmazonSpidey extends RedisSpidey {
  constructor() {
    super({
      // spidey options ...
      redisUrl: 'redis://localhost:6379',

      // Input queue
      urlsKey: 'amazon:urls',

      // Output queue
      dataKey: 'amazon:data',

      // Redis pipeline to push crawled data to data queue 
      pipelines: [RedisPipeline],
    });
  }
}

Conclusion

RedisSpidey is the ultimate solution for distributed web scraping and crawling, offering unparalleled performance, scalability, and flexibility. With RedisSpidey, you can easily handle large-scale web scraping tasks with ease, while taking advantage of advanced Redis and Spidey technology for efficient distributed crawling and post-processing of data.

License

Spidey is licensed under the MIT License.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme