gz-spider
v0.0.2
Published
web sipder framework
Downloads
4
Readme
gz-spider
A web spider framework for NodeJs, base on Puppeteer & Axios;
Feture
- IP Proxy
- Fail retry
- Support Puppeteer
- Easily compatible with various task queue services
- Easily multiprocessing
Install
npm i gz-spider --save
Usage
const spider = require('gz-spider');
// All your spider code register in Processer
spider.setProcesser({
['getGoogleSearchResult']: async (fetcher, params) => {
// fetcher.page is original puppeteer page
let resp = await fetcher.axios.get(`https://www.google.com/search?q=${params}`);
// throw 'Retry', will retry this processer
// throw 'ChangeProxy', will retry this processer use new proxy
// throw 'Fail', will finish this processer with message(fail) Immediately
if (resp.status === 200) {
// Data processing start
let result = resp.data + 1;
// Data processing end
return result;
} else {
throw 'retry';
}
}
});
// Get data
spider.getData('getGoogleSearchResult', params).then(userInfo => {
console.log(userInfo);
});
Config
This framework is divided into three components, fetcher, strategy and processer.
Fetcher
spider.setFetcher({
axiosTimeout: 5000,
proxyTimeout: 180 * 1000
proxy() {
// support async function,you can get proxy config from remote.
return {
host: '127.0.0.1',
port: '9000'
}
}
});
axiosTimeout
: [Number] Peer request timeout msproxyTimeout
: [Number] When config.proxy is [Function], will re-run proxy function and get new proxy host+portproxy
: [Object | Function] Whenproxy
is [Function], support async function,you can get proxy config from remoteproxy.host
[String]proxy.port
[String]
Strategy
spider.setStrategy({
retryTimes: 2
});
retryTimes
: [Number] Max retry times for one task
Work with task queue
Process
Get one task -> `spider.getData(processerKey, processerIn)` -> Complete task with processed data
Simulate task queue use MySQL
- Create table
spider-task
, include at least'id', 'status', 'processer_key', 'processer_input', 'processer_output'
- Write api to get one todo task (status = todo), for example
GET /spider/task
- Write api to update db table with processed data, for example
PUT /spider/task
const axios = require('axios');
while (true) {
// Get one task
let resp = await axios.get('http://127.0.0.1:8080/spider/task');
if (!resp.data.task) break;
let { id, processerKey, processerInput } = resp.data.task;
let processerOutput = await spider.getData(processerKey, processerInput);
// Complete task with processed data
await axios.put('http://127.0.0.1:8080/spider/task', {
id, processerOutput,
status: 'success'
});
}