crawlit

v0.1.5

Published

3 years ago

A node.js crawler support custom crawl rules for special site with a plugin.

Downloads

0High
0Medium
0Low

inaction

crawler crawl spidder

crawler

A node.js crawler support custom plugin to implement special crawl rules. Implement plugin example for crawl discus2x.

###Finished Features

Crawl site;
Filter: include/exclude URL path of the site.
Plugin: discuz2.0 attachments,discuz2.0 filter.
Queue and Crawl status.
Update mode.
Support wget cookies config. You can export site cookie use Cookie exporter.
Use jsdom and jQuery to get needed resources of crawled page.
gbk to utf-8 convert.

##Feature List：reference：http://obmem.info/?p=753

Support request.pipe, crawl site all in stream.pip mode.
Basic crawl site；
Proxy support；
Need Login？cookie auth；update and save cookie data;
- form login？
- support cookie
- Browser UserAgent setting.
- Multi-proxy support
Monitor：disk usage？ total pages count, crawled count，crawling count，speed，memory usage，failed list;
CP：Monitor viewer; start/pause/stop crawler; failed/retry; change config;
gzip/deflate: 5 times speedup；’accept-encoding’
Multi-workers/Async

##Install npm install crawlit ##Usage Basic usage:

//Add basic config
require('./config/config.js');
//Override config in your own config `./config/config.local.js`

//Override config too
config.crawlOption.working_root_path: 'run/crawler';
config.crawlOption.resourceParser: require('./lib/plugins/discuz');


var crawlIt = require('crawlit').domCrawler;
crawlIt.init({update:false});
//start crawl
crawlIt.crawl(config.crawlOption.page);
//Add other crawl interface

###More Example see QiCai Crawl Example

##MIT

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

crawlit

v0.1.5

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

crawler