npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

floodesh

v0.8.19

Published

Floodesh is a distributed web spider/crawler written with Nodejs.

Downloads

66

Readme

Floodesh

Floodesh is middleware based web spider written with Nodejs. "Floodesh" is a combination of two words, flood and mesh.

Table of Contents

Requirement

Gearman Server

Make sure g++, make, libboost-all-dev, gperf, libevent-dev and uuid-dev have been installed.

$ wget https://launchpad.net/gearmand/1.2/1.1.12/+download/gearmand-1.1.12.tar.gz | tar xvf
$ cd gearmand-1.1.12
$ ./configure
$ make
$ make install

MongoDB

Quick start

Install scaffold

$ npm install -g floodesh-cli

Initialize

Generate new app from templates by only one command.

$ mkdir demo
$ cd demo
$ floodesh-cli init # all necessary files will be generated in your directory.

Please make sure you have /data/tests and /var/log/bda/tests created and have Write access before use, you can customize path by modifying logBaseDir in config/[env]/index.js

Context

A context instance is a kind of Finite-State Machine implemented by Generators which is ECMAScript 6 feature. By context, we can access almost all fields in response and request, like:

worker.use( (ctx,next) => {
    ctx.content = ctx.body.toString(); // totally do not care about the body 
    return next();
})

Request

ctx.querystring

  • <String>

Get querystring.

ctx.idempotent

  • <Boolean>

Check if the request is idempotent.

ctx.search

  • <String>

Get the search string. It includes the leading "?" compare to querystring.

ctx.method

  • <String>

Get request method.

ctx.query

  • <Object>

Get parsed query-string.

ctx.path

  • <String>

Get the request pathname

ctx.url

  • <String>

Return request url, the same as ctx.href.

ctx.origin

  • <String>

Get the origin of URL, for instance, "https://www.google.com".

ctx.protocol

  • <String>

Return the protocol string "http:" or "https:".

ctx.host

  • <String>, hostname:port

Parse the "Host" header field host and support X-Forwarded-Host when a proxy is enabled.

ctx.hostname

  • <String>

Parse the "Host" header field hostname and support X-Forwarded-Host when a proxy is enabled.

ctx.secure

  • <Boolean>

Check if protocol is https.

Response

ctx.status

  • <Number>

Get status code from response.

ctx.message

  • <String>

Get status message from response.

ctx.body

  • <Buffer>

Get the response body in Buffer.

ctx.length

  • <Number>

Get length of response body.

ctx.type

  • <String>

Get the response mime type, for instance, "text/html"

ctx.lastModifieds

  • <Date>

Get the Last-Modified date in Date form, if it exists.

ctx.etag

  • <String>

Get the ETag of a response.

ctx.header

  • <Object>

Return the response header.

ctx.contentType

  • <String>

ctx.get(key)

  • key <String>
  • Return: <String>

Get value by key in response headers

ctx.is(types)

  • types <String>|Array>
  • Return: <String>|false|null

Check if the incoming response contains the "Content-Type" header field, and it contains any of the give mime types.If there is no response body, null is returned.If there is no content type, false is returned.Otherwise, it returns the first type that matches.

Other

ctx.tasks

  • <Array>

Array of generated tasks. A task is an object consists of Options and next, next is a function name in your spider you want to call in next task , Supported format:

[{
    opt:<Options>,
    next:<String>
}]

ctx.dataSet

  • <Map>

A map to store result, that will be parsed and saved by floodesh.

Configuration

index

  • retry <Integer>: Retry times at worker side, default 3
  • logBaseDir <String>: Directory where project's log directory exists, default '/var/log/bda/'
  • parsers <Array>: Array of parsers, which are file names in parser directory without '.js'

bottleneck

  • defaultCfg <Object>
    • rate <Integer>: Number of milliseconds to delay between each requests
    • concurrent <Integer>: Size of the worker pool
    • priorityRange <Integer>: Range of acceptable priorities starting from 0, default 3
    • defaultPriority <Integer>: priority of the request
    • homogenous <Boolean>:true

downloader

gearman

  • jobs <Integer>: Max number of jobs per worker, default 1
  • srvQueueSize <Integer>: Max number of jobs queued to gearman server, default 1000
  • mongodb <String>: Mongodb Connection String URI,
  • worker <Object>:
    • servers <Array>: Array of server list, server should be an object like {'host':'gearman-server'}
  • client <Object>:
    • servers <Array>: Same as above,
    • loadBalancing <String>: 'RoundRobin'
  • retry <Integer>: Retry times at client side

database

logger

seenreq

  • repo <String>: [redis|mongodb] default use memory as repo.
  • removeKeys <Array>:Array of keys in query string to skip when test if an url is seen

service

  • server <String>: Remote service origin

Error handling

Just throw an Error in a synced middleware, otherwise return a rejected Promise. err.stack will be logged and err.code will be sent to client to persist.

// sync
module.exports = (ctx, next) => {
    // balabala
    throw new Error('crash here');
}

// async
module.exports = (ctx, next) => {
    return new Promise( (resolve, reject) => {
        // balabala
        reject(new Error('got error'));
    });
}

Diagram

Client

State diagram

floodesh client state

Flow chart

floodesh client flow

Worker

Flow chart

floodesh worker flow

Middlewares

  • mof-cheerio: A simple wrapper of Cheerio.
  • mof-charsetparser: Parse Charset in response headers.
  • mof-iconv: Encoding converter middleware using iconv or iconv-lite.
  • mof-request: A wrapper of Request.js, with some default options.
  • mof-bottleneck: A wrapper of bottleneckp which is asynchronous rate limiter with priority.
  • mof-proxy: With power to acquire proxy from a proxy service.
  • mof-whacko: A wrapper of whacko, which is a fork of cheerio that uses parse5 as an underlying platform.
  • mof-statsd: A wrapper of statsd-client, which enables you send metrics to a statsd daemon.
  • mof-uarotate: Rotate User-Agent header automatically from a local file.
  • mof-seenreq: Only make sense in flowesh, a simple wrapper of seenreq.
  • mof-validbody: Check if a response body meets a pattern, for instance, a html body should start with < and json body {.
  • mof-statuscode: Status code detector.
  • mof-genestamp: Prints gene and url of a task, along with # of new tasks and # of records.