floodesh
v0.8.19
Published
Floodesh is a distributed web spider/crawler written with Nodejs.
Downloads
66
Maintainers
Readme
Floodesh
Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood
and mesh
.
Table of Contents
Requirement
Gearman Server
Make sure g++
, make
, libboost-all-dev
, gperf
, libevent-dev
and uuid-dev
have been installed.
$ wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf
$ cd gearmand-1.1.12
$ ./configure
$ make
$ make install
MongoDB
Quick start
Install scaffold
$ npm install -g floodesh-cli
Initialize
Generate new app from templates by only one command.
$ mkdir demo
$ cd demo
$ floodesh-cli init # all necessary files will be generated in your directory.
Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js
Context
A context instance is a kind of Finite-State Machine implemented by Generators
which is ECMAScript 6 feature. By context, we can access almost all fields in response
and request
, like:
worker.use( (ctx,next) => {
ctx.content = ctx.body.toString(); // totally do not care about the body
return next();
})
Request
ctx.querystring
- <String>
Get querystring.
ctx.idempotent
- <Boolean>
Check if the request is idempotent.
ctx.search
- <String>
Get the search string. It includes the leading "?" compare to querystring.
ctx.method
- <String>
Get request method.
ctx.query
- <Object>
Get parsed query-string.
ctx.path
- <String>
Get the request pathname
ctx.url
- <String>
Return request url, the same as ctx.href.
ctx.origin
- <String>
Get the origin of URL, for instance, "https://www.google.com".
ctx.protocol
- <String>
Return the protocol string "http:" or "https:".
ctx.host
- <String>, hostname:port
Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.
ctx.hostname
- <String>
Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.
ctx.secure
- <Boolean>
Check if protocol is https.
Response
ctx.status
- <Number>
Get status code from response.
ctx.message
- <String>
Get status message from response.
ctx.body
- <Buffer>
Get the response body in Buffer.
ctx.length
- <Number>
Get length of response body.
ctx.type
- <String>
Get the response mime type, for instance, "text/html"
ctx.lastModifieds
- <Date>
Get the Last-Modified date in Date form, if it exists.
ctx.etag
- <String>
Get the ETag of a response.
ctx.header
- <Object>
Return the response header.
ctx.contentType
- <String>
ctx.get(key)
key
<String>- Return: <String>
Get value by key in response headers
ctx.is(types)
type
s <String>|Array>- Return: <String>|false|null
Check if the incoming response contains the "Content-Type" header field, and it contains any of the give mime type
s.If there is no response body, null
is returned.If there is no content type, false
is returned.Otherwise, it returns the first type
that matches.
Other
ctx.tasks
- <Array>
Array of generated tasks. A task is an object consists of Options and next
, next
is a function name in your spider you want to call in next task , Supported format:
[{
opt:<Options>,
next:<String>
}]
ctx.dataSet
- <Map>
A map to store result, that will be parsed and saved by floodesh.
Configuration
index
retry
<Integer>: Retry times at worker side, default3
logBaseDir
<String>: Directory where project's log directory exists, default '/var/log/bda/'parsers
<Array>: Array of parsers, which are file names in parser directory without '.js'
bottleneck
defaultCfg
<Object>rate
<Integer>: Number of milliseconds to delay between each requestsconcurrent
<Integer>: Size of the worker poolpriorityRange
<Integer>: Range of acceptable priorities starting from 0, default3
defaultPriority
<Integer>: priority of the requesthomogenous
<Boolean>:true
downloader
headers
<Object>: HTTP headers
gearman
jobs
<Integer>: Max number of jobs per worker, default1
srvQueueSize
<Integer>: Max number of jobs queued to gearman server, default1000
mongodb
<String>: Mongodb Connection String URI,worker
<Object>:servers
<Array>: Array of server list, server should be an object like{'host':'gearman-server'}
client
<Object>:servers
<Array>: Same as above,loadBalancing
<String>: 'RoundRobin'
retry
<Integer>: Retry times at client side
database
mongodb
<String>: Mongodb Connection String URI
logger
seenreq
repo
<String>: [redis|mongodb] default use memory as repo.removeKeys
<Array>:Array of keys in query string to skip when test if an url is seen
service
server
<String>: Remote service origin
Error handling
Just throw an Error
in a synced middleware, otherwise return a rejected Promise. err.stack
will be logged and err.code
will be sent to client to persist.
// sync
module.exports = (ctx, next) => {
// balabala
throw new Error('crash here');
}
// async
module.exports = (ctx, next) => {
return new Promise( (resolve, reject) => {
// balabala
reject(new Error('got error'));
});
}
Diagram
Client
State diagram
Flow chart
Worker
Flow chart
Middlewares
- mof-cheerio: A simple wrapper of
Cheerio
. - mof-charsetparser: Parse
Charset
in response headers. - mof-iconv: Encoding converter middleware using
iconv
oriconv-lite
. - mof-request: A wrapper of
Request.js
, with some default options. - mof-bottleneck: A wrapper of
bottleneckp
which is asynchronous rate limiter with priority. - mof-proxy: With power to acquire proxy from a proxy service.
- mof-whacko: A wrapper of
whacko
, which is a fork of cheerio that uses parse5 as an underlying platform. - mof-statsd: A wrapper of
statsd-client
, which enables you send metrics to a statsd daemon. - mof-uarotate: Rotate
User-Agent
header automatically from a local file. - mof-seenreq: Only make sense in flowesh, a simple wrapper of
seenreq
. - mof-validbody: Check if a response body meets a pattern, for instance, a html body should start with
<
and json body{
. - mof-statuscode: Status code detector.
- mof-genestamp: Prints gene and url of a task, along with # of new tasks and # of records.