npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@mkbkkj/spiderhole

v1.5.0

Published

通过指定的规则,编写简单的配置就能完成一个爬虫。

Downloads

1

Readme

爬虫引擎

通过指定的规则,编写简单的配置就能完成一个爬虫。

使用

import Spiderhole from 'spiderhole';

// 创建爬虫
const spider = new Spiderhole(config: Config);
/**
 * 保存到文件
 * - 后缀名为json,保存为json文件
 * - 后缀名为其它,保存为行类型文件(每条数据为一行)
*/
await spider.save('path.json');

配置规则

// 单条数据字段的值来源信息
interface DataRowFieldValue {
    type: 'cmd' | 'request';
    options: {
        url?: string;
        // 获取值之前的操作
        before?: string;
        constant?: Record<string, string>;
        value: string | DataRowFieldValue;
    }
}

interface Config {
    // 指定使用的插件库,当前版本支持:cheerio
    use: "cheerio";
    // 入口链接
    entry: string;
    // 可选,预制的全局常量
    constant?: {
        // 每个key,都会执行一次,执行环境的cheerio为entry的
        [ p: string ]: string;
        /**
         * 可以是一个简单的字符串
         * baseURL: "'https://baidu.com'",
         * 
         * 可以是一个表达式
         * tds: "cheerio('.zszc tr td.yes')",
        */
    },
    // 可选,请求的信息
    request?: {
        // 配置请求的headers信息
        headers?: Record<string, string>; 
    },
    /**
     * 可选,响应的拦截器
     * 1. 这个时候是拿不到常量信息的,因为还不确定响应的信息是否是正确的
     * 2. 可以拿到contentType: 响应数据了下
     * 3. 可以拿到response,类型为AxiosResponse
     * 4. 默认会有一个拦截器:
     * {
     *    "when": "contentType !== 'text/html'",
     *    "error": "`错误的响应数据类型: ${contentType}`"
     * }
    */
    response?: Array<{
        // 判断条件,
        when: string;
        // 条件成立后,抛错的信息
        error?: string;
    }>;

    // 必选参数,数据相关
    data: {
        // 迭代器,在一个页面有多条数据时需要用这个属性进行循环获取数据
        iterator: {
            [ p: string ]: string;
        },
        // 每一条数据的配置
        row: {
            /**
             * 每个key和iterator保持一致
             * 表示在这个迭代器循环的时候,处理这些数据
            */
            [ p: string ]: {
                // 判断是否达到取值条件
                when?: string;
                // 此迭代器产生的字段配置
                fields: Array<{
                    // 字段名称
                    name: string;
                    // 何时取值
                    when?: string;
                    // 没有值或者不取值的默认值
                    default: any;
                    // 常量
                    constant?: Record<string, string>;
                    // 是否是必要字段,如果为true,没值时会抛错,默认:false
                    required?: boolean;
                    // 获取值之前的操作
                    before?: string;
                    // 值的取值
                    value: string | DataRowFieldValue;
                }>;
            }
        }
    },
    // 事件
    event?: {
        // 停止爬数据的条件
        stop?: string;
        // 进入下一页配置
        next?: {
            // 爬取下一页时,间隔时间,数组表示随机的时间范围
            sleep?: number | number[];
            // 下一页的url信息
            url: string;
        };
    };
}

插件规范

调用流程

plugin.save => plugin.start => => base._capture


spider.entry();
spider.exec();
spider.field();