npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

light-crawler

v1.7.3

Published

a simplified directed web crawler, easy to use for scraping pages and downloading resources of page.

Downloads

77

Readme

Light Crawler - Directed Crawler

Build Status

NPM Status

NPM Downloads

A simplified directed web crawler, easy to use for scraping pages and downloading resources.

English Doc(Here) or 中文文档.

Install

npm install light-crawler

Example

const Crawler = require('light-crawler');
// create a instance of Crawler
let c = new Crawler();
// add a url or an array to request
c.addTask('http://www.xxx.com');
// define a scraping rule
c.addRule(function (result) {
	// result has 2 props : task and body
	// result.task: id, url, others you added.
	// result.body is the HTML of the page
	// scrape result.body, you can use cheerio
})
// start your crawler
c.start().then(() => {
	console.log('Finished!');
});

Crawler Property

In light-crawler,requesting page is called task.Tasks will be put into task-pool and be executed in order.

  • settings: basic settings of crawler

    • id: id of the crawler,integer or string,defalut: null

    • interval: crawling interval,defalut: 0(ms).or a random value in a range e.g.[200,500]

    • retry: retry times,defalut:3

    • concurrency: an integer for determining how many tasks should be run in parallel,defalut: 1

    • skipDuplicates: whether skip the duplicate task(same url),defalut: false

    • requestOpts: request options of task,this is global request options

      • timeout: defalut: 10000
      • proxy: proxy address
      • headers: headers of request,defalut: {}
      • or other settings in request opts
  • taskCounter: count all finished tasks whether they are failed or not

  • failCounter: count all failed tasks

  • doneCounter: count tasks which has done

  • started: boolean

  • finished: boolean

  • errLog: record all error infos in crawling

  • downloadDir: downloaded files in here, default: ../__dirname

  • drainAwait: crawler will be finished when task-pool is drained.This prop will let crawler await adding tasks when task-pool is drained.default:0(ms)

  • tasksSize: size of task-pool, exceeding tasks is in the buffer of task-pool, default:50

  • logger: show the console log, default:false

Crawler API

  • Crawler(opts: object)

construtor of Crawler

// e.g.:
let c = new Crawler({
	interval: 1000,
	retry: 5,
	.... // other props of `crawler.settings`
	requestOpts: {
		timeout: 5000,
		proxy: 'http://xxx'
		.... // other props of `crawler.requestOpts`
	}
});
  • tweak(opts: object)

tweak settings of crawler

  • addTasks(urls: string or array[, props: obejct])

add task into task-pool

// e.g.

// add single task

// input: url
c.addTask('http://www.google.com');

// input: url, prop
// set request options for the task(will override global)
c.addTask('http://www.google.com', {
	name: 'google',
	requestOpts: { timeout: 1 }
});

// input: url, next(processor of the task)
// crawler rules will not process this task again
c.addTask('http://www.google.com', function (result) {
	console.log('the task has done');
});

// input: url, prop, next
c.addTask('http://www.google.com', { name: 'google' }, function (result) {
	console.log('the task has done');
});

// or input an object
c.addTask({
	url: 'http://www.google.com',
	type: 'SE',
	next: function (result) {
		console.log('the task has done');
	}
});

// add multiple tasks

// input: an array of string
c.addTasks(['http://www.google.com','http://www.yahoo.com']);

// add prop for tasks
c.addTasks(['http://www.google.com','http://www.yahoo.com'], { type: 'SE' });
// get these props in processing function
c.addRule(function (result) {
	if (result.task.type == 'SE') {
		console.log('Searching Engine');
	}
});

// input: an array of object
c.addTasks([
	{
		url: 'http://www.google.com',
		name: 'google'
	},
	{
		url: 'http://www.sohu.com',
		name: 'sohu'
	}
]);
  • addRule(reg: string|object, func: function)

define a rule for scraping

// e.g.:
let tasks = [
	'http://www.google.com/123', 
	'http://www.google.com/2546', 
	'http://www.google.com/info/foo',
	'http://www.google.com/info/123abc'
];
c.addTasks(tasks);
c.addRule('http://www.google.com/[0-9]*', function (result) {
	// match to tasks[0] and tasks[1]
});
c.addRule('http://www.google.com/info/**', function (result) {
	// match to tasks[2] and tasks[3]
});
// or you can not define the rule
c.addRule(function (result) {
	// match to all url in tasks
});

// $(i.e. cheerio.load(result.body)) is a optional arg
c.addRule(function (result, $){
    console.log($('title').text());
});

Tip: light-crawler will transform all . in rule string.So you can directly write www.a.com instead of www\\.a\\.com. If you need .*,you can use **, just like the upper example.If you have to use .,just <.>.

  • start()

start the crawler

// e.g.:
c.start().then(function () {
	// on finished
	console.log('done!');
});
  • pause()

pause the crawler

  • resume()

resume the crawler

  • isPaused()

the crawler is is paused or not

  • stop()

stop the crawler

  • uniqTasks()

reomve duplicate task(deeply compare)

  • log(info: string, isErr: boolean, type: int)

crawler's logger

// e.g.:
// if it's an error,c.errLog will append it
c.log('some problems', true);
// console print: 
// [c.settings.id if it has]some problems

// type is color code of first '[...]', e.g.'[Crawler is Finished]'
// 1 red,2 green,3 yellow,4 blue,5 magenta,6 cyan...so on
c.log('[Parsed]blahblah~', false, 4);
// console print: 
// [c.settings.id if it has][Parsed]([Parsed] wil be blue)blahblah~

// you can do something after log() everytime
c.on('afterLog', function (info, isErr, type) {
	fs.appendFileSync('c.log', info); // append info to c.log
	....
};

// even you can replace the log()
c.log = function (info, isErr, type) {
	// log something....
};

Download Files

just add downloadTask: true for task you need to download

// e.g.:
// specify download directory
c.tweak({ downloadDir: 'D:\\yyy' });

let file = 'http://xxx/abc.jpg';
// 'abc.jpg' will be downloaded into 'D:\\yyy'
c.addTask(file, {downloadTask: true});
// or you can specify its name
c.addTask(file, {downloadTask: true, downloadFile: 'mine.jpg'});
// or specify relative dir(to 'D:\\yyy')
// if this directory ('jpg') doesn't exist,crawler will create it
c.addTask(file, {downloadTask: true, downloadFile: 'jpg/mine.jpg'});
// or specify absolute dir
c.addTask(file, {downloadTask: true, downloadFile: 'C:\\pics\\mine.jpg'});

Events

  • start

after the crawler is started

// e.g.
c.on('start', function () {
    console.log('started!');
});
  • beforeCrawl

task's props: id,url,retry,working,requestOpts,downloadTask,downloadFile...so on

// e.g.
c.on('beforeCrawl', function (task) {
    console.log(task);
});
  • drain

when task-pool and its buffer are drained

// e.g.
c.on('drain', function () {
    // do something
});
  • error

Utils API

  • getLinks(html: string, baseUrl: string)

get all links in the element

// e.g.:
let html = `
  <div>
	<ul>
		<li>
            <a href="http://link.com/a/1">1</a>
            <a href="a/2">2</a>
            <a href="b/3">3</a>
		</li>
		<li><a href="4">4</a></li>
		<li>foo</li>
	</ul>
</div>
`;
let links = Crawler.getLinks(html, 'http://link.com/index.html');
console.log(links);
// ['http://link.com/a/1','http://link.com/a/2','http://link.com/b/3','http://link.com/4']

// you can also use cheerio
let $ = cheerio.load(html);
let links = Crawler.getLinks($('ul'));
  • getImages(html: string, baseUrl: string)

like getLinks, get src from <img>.

  • loadHeaders(file: string)

load request headers from file example.headers

Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-CN,zh;q=0.8,en;q=0.6
Cache-Control:max-age=0
Connection:keep-alive
Cookie:csrftoken=Wwb44iw
Host:abc
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64)
...

load this file and set headers for requesting

let headers = Crawler.loadHeaders('example.headers');
c.tweak({
	requestOpts: {
		headers: headers
	}
});
  • getRegWithPath(fromUrl: string)

get reg string with path of fromUrl

let reg = Crawler.getRegWithPath('http://www.google.com/test/something.html');
// reg: http://www.google.com/test/**

Advanced Usage

  • addRule
// since 1.5.10, the rule of scraping could be a object
c.addTask('http://www.baidu.com', { name: 'baidu', type: 'S.E.' });
c.addTask('http://www.google.com', { name: 'google', type: 'S.E.' });
// following rules has same reg string, but name are different
c.addRule({ reg: 'www.**.com', name: 'baidu' }, function (r) {
    // scraping r.body
});
c.addRule({ reg: 'www.**.com', name: 'google' }, function (r) {
    // scraping r.body
});

// using function match could make rules more complex
// boolean match(task)
c.addTask('http://www.baidu.com', { tag: 3 });
c.addTask('http://www.google.com', { tag: 50 });
c.addRule({ reg: 'www.**.com', match: function (task) {
		return task.tag > 10;
}}, function (r) {
    // scrape google
});
  • loadRule

recycle rules

// lc-rules.js
exports.crawlingGoogle = {
    reg: 'www.**.com',
    name: 'google',
    scrape: function (r, $) {
        // ...
    }
};

// crawler.js
let c = new Crawler();
c.addTask('http://www.google.com', { name: 'google' });
c.loadRule(crawlingGoogle);

// or expand the function named 'scrape'
// implement the 'expand' in 'loadRule'
// on the other hand, you can use 'this'(Crawler) in 'addRule' or 'loadRule'
crawlingGoogle = {
    // ...
    scrape: function (r, $, expand) {
        expand($('title').text());
    }
};

crawlerAAA.loadRule(crawlingGoogle, function (text) {
    console.log(text);
    this.addTask('www.abc.com');
});

crawlerBBB.loadRule(crawlingGoogle, function (text) {
    console.log(text.toLowerCase());
});
  • removeRule

remove some rules

// by its 'ruleName'
let rule = {
    // ...
    ruleName: 'someone'
    // ...
}
c.loadRule(rule);
c.removeRule('someone');