npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@wtto00/spider-crawler

v0.2.3-beta

Published

定义一个json格式的爬虫规则,Nodejs按照该规则爬取所需要的内容

Downloads

7

Readme

spider-crawler

Lint & Test coverage downloads license

定义一个 json 格式的爬虫规则,Nodejs 按照该规则爬取所需要的内容

使用

import { crawlFromUrl, crawlFromJson, crawlFromHtml } from '@wtto00/spider-crawler';

crawlFromUrl(urlOptions).then((res) => {
  console.log(res);
});
crawlFromJson(jsonOptions).then((res) => {
  console.log(res);
});
crawlFromHtml(htmlOptions).then((res) => {
  console.log(res);
});

示例

crawlFromUrl

const options = {
  url: 'https://marketplace.visualstudio.com/items?itemName=Orta.vscode-jest',
  rules: {
    name: {
      selector: '.ux-item-name',
      handlers: [{ method: 'text' }, { method: 'trim' }],
    },
    author: {
      selector: '.ux-item-publisher',
      handlers: [{ method: 'text' }, { method: 'trim' }],
    },
    installs: {
      selector: '.installs-text',
      handlers: [
        { method: 'text' },
        { method: 'substring', args: [0, -9] },
        { method: 'trim' },
        { method: 'replace', args: [',', ''] },
        { method: 'number' },
      ],
    },
    tags: {
      selector: '.meta-data-list-link',
      handlers: [
        {
          method: 'map',
          args: [
            {
              text: { handlers: [{ method: 'text' }] },
              link: { handlers: [{ method: 'attr', args: ['href'] }, { method: 'resolveUrl' }] },
            },
          ],
        },
      ],
    },
  },
};

crawlFromUrl(options).then((res) => {
  console.log(res);
});

// {"code":0,"message":"success","data":{"name":"Jest","author":"Orta","installs":1148080,"tags":null}}

crawlFromJson

const options = {
  json: JSON.stringify({ test: '1' }),
  rules: {
    test: {
      selector: 'test',
      handlers: [{ method: 'number' }],
    },
  },
};
const res = crawlFromJson(jsonOptions);
// {"code":0,"message":"success","data":{"test":1}}

crawlFromHtml

const options = {
  html: '<p class="test">content</p>',
  rules: {
    content: {
      selector: '.test',
      handlers: [{ method: 'text' }],
    },
  },
};
const res = crawlFromHtml(options);
// {"code":0,"message":"success","data":{"content":"content"}}

CrawlFromJson Options

| 字段 | 类型 | 备注 | | ----- | --------------- | ------------ | | json | string | json 字符串 | | rules | Rules | 取值处理规则 |

CrawlFromHtml Options

| 字段 | 类型 | 必填 | 备注 | | ------- | --------------- | ---- | ------------------------------------- | | baseUrl | string | 否 | baseUrl 用于 html 中某些 url 属性处理 | | html | string | 是 | html 字符串 | | rules | Rules | 是 | 取值处理规则 |

CrawlFromUrl Options

| 字段 | 类型 | 备注 | | ------- | --------------------------------------------------------------- | -------- | | url | string | 请求地址 | | options | RequestInit | 请求参数 | | rules | Rules | 爬虫规则 |

Rules

type Rules = Record<string, Rule>;

Rule

| 字段 | 类型 | 必填 | 备注 | | -------- | --------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | selector | string | 否 | cheerio 选择器 | | dataType | 'html'|'json' | 否 | selector 是 cheerio 选择器,还是 json 选择器 | | handlers | Handler[] | 是 | 爬虫爬取到的元素的处理方法集合 |

Handler

interface Handler {
  method: Method;
  args?: Args;
}

Method & Args

下边列举所有的可以方法以及相对应的参数

| 方法Method | 参数args | 说明 | | ------------ | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | prefix | (string) | 字符串开头添加字符串 | | substring | (number,?number) | 对字符串结果进行截取 | | replace | (string,string) | 字符串全局替换 | | trim | - | 去除开头与结尾的空格 | | number | - | 把字符串转为数字,转换失败时默认为 0 | | br2nl | - | 把 html 中的 br 替换成文本换行符\n(匹配<br />,<br/><br >,<br>以及其中的空格以及\n换行符) | | sum | - | 把字符串数组转为数字后相加 | | resolveUrl | - | 获得的路径与当前请求地址相混合 | | decode | - | html 字符串反序列化到正常的阅读文本 | | length | - | 获取数组或者字符串的长度 | | attr | (?string) | cheerio 方法获取属性的方法。在匹配集合中只能获取的第一个元素的属性值。 | | find | (string,?string) | cheerio 方法通过选择器、jQuery 对象或元素来过滤,获取每个匹配元素的后代。对于对象数组来说,可以通过选择每一个对象的某个值来比较,从而选中匹配的那一项对象 | | eq | (number) | cheerio 方法根据索引来确定元素。使用 .eq(-i) 的则是倒过来计数。 | | text | - | cheerio 方法获取元素集合中的每个元素的合并文本内容,包括它们的后代 | | html | - | cheerio 方法获取第一个选中元素的 HTML 内容字符串 | | map | (Rules) | cheerio 方法通过每个在匹配函数产生的匹配集合中的匹配元素,产生一个新的包含返回值的数组。该函数可以返回一个单独的数据项或一组数据项被插入到所得到的集合中。 如果返回一个数组,数组中的元素插入到集合中。如果函数返回空或未定义,则将插入任何元素。 | | each | (Handler[]) | cheerio 方法对一个 cheerio 对象循环进行一些处理,得到一个新的数组。此方法与 map 方法的不同在于,map 总是返回一个对象数组,而 each 不一定返回对象数组。 |