npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

araneida

v1.1.2

Published

简单的页面爬取模块,根据传入的配置信息来获得对应数据。

Downloads

1

Readme

araneida

简单的页面爬取模块,根据传入的配置信息来获得对应数据。

用法


const option = {
    links, // 爬取基本设置
    callback, // 爬取单条时的回调
    done, // 全部爬取完毕的回调
    delay, // 爬取时的延迟设置
    timeout, // 超时时间
}

const spider = new Spider(option)

spider.start()

定义爬取规则

links设置

  • links
    • title 名称
    • url 爬取链接
    • encode 页面编码格式,默认 urf8
    • rules 爬取规则
      • list 爬取列表 css 规则
      • key 设置字段名
      • rule 列表下单条基本信息
        • url 列表下该条的爬取链接
          • type url 取值属性
          • path css 规则,同 list 设置
        • title 名称
          • type title 取值属性
          • path css 规则
      • links 子页面抓取规则,设置信息如同上一层,不设置 url 时,从上一层的 rule 中的 url 做为子页面爬取路径

callback

回调函数中接收子页面爬取回来的数据:

option.callback = function (data) {
    // data 是爬取回来的数据
    // 这里可以做一些保存之类的事情
}

done

单个页面爬取完毕的回调,接受单页面数据:

option.done = function (data) {
  // 对单个页面的数据处理
}

allPageDone

所有分页页面结束的回调函数:

option.allPageDone = function () {
  // 所有页面爬取结束
}

page

如果有对分页爬取的需求,提供一个返回待爬取的分页 url 的函数,举个例子,可以用闭包的形式返回一个函数:

option.page = (function () {
    let index = 2
    let result
    return function () {
        if (index > 5) {
            result = false
        } else {
            result = `https://segmentfault.com/blogs?page=${index}`
        }
        index ++
        return result
    }
})()

内部依据 option.page 的返回值来决定是否继续爬取页面,在上面的这个例子里:

option.page() // https://segmentfault.com/blogs?page=2
option.page() // https://segmentfault.com/blogs?page=3
option.page() // https://segmentfault.com/blogs?page=4
option.page() // https://segmentfault.com/blogs?page=5
option.page() // false
option.page() // false

内部拿到 false 的时候将停止爬取,同时触发 allPageDone 事件。

delay

跳转下一个页面的间隔时间。

如何开始和例子


# 安装依赖
npm i

# 试跑例子
npm run example