npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

z-almighty-parser-core

v1.2.0

Published

crawler prser core

Downloads

5

Readme

页面爬虫解析器核心

此工具适用于

  1. 对单独页面链接进行解析
  2. 配合队列进行多页面解析

解释说明

支持详情页下一页抓取,支持繁体转换,支持对字数统计,支持对图片数量统计。 目前主要针对静态页的解析,对json请求和jsonp请求的解析做了预留(暂不支持)。

api接口

  • [x] getLinks 获取待抓页链接
  • [x] getContent 获取详情页内容
  • [x] parse 解析获取内容[为getLinksgetContent的集合]
  • [x] isArticleUrl 检测链接是否是详情页
  • [x] isListUrl 检测链接是否是列表页
  • [x] getIdFromArticleUrl 获取页面链接的唯一标示

配置参数

文档说明

实例

解析器案例

糗事百科 - 基础 今日健康 - 繁体 爆料网 - 详情下一页

定义网站规则

module.exports = {
    // 域名 网站域名,设置域名后只处理这些域名下的网页
    domains: 'https://www.qiushibaike.com/',
    // 列表页url的正则,符合这些正则的页面会被当作列表页处理
    listUrlRegexes: [/^https:\/\/www\.qiushibaike\.com(\/[a-z0-9]+(\/page\/[0-9]+)?)?(\/)?$/],
    // 内容页url的正则,符合这些正则的页面会被当作内容页处理
    contentUrlRegexes: [/^https:\/\/www\.qiushibaike\.com\/article\/[0-9]+$/],
    // 从内容页中抽取需要的数据
    fields: [{
        // 作者
        name: 'author',
        meta: {
            selector: ['.author h2'],
            format: 'text'
        }
    }, {
        // 标签 
        name: 'tags',
        meta: {
            format: 'text',
            selector: ['.source a'],
            index: 0
        }
    }, {
        // 网页关键字
        name: 'keywords',
        meta: {
            format: 'meta',
            selector: ['meta[name="keywords"]']
        }
    }, {
        // 网页描述
        name: 'description',
        meta: {
            format: 'meta',
            selector: ['meta[name="description"]']
        }
    }, {
        // 详情
        name: 'content',
        meta: {
            selector: ['.content', '.thumb'],
            format: 'html'
        },
        required: true
    }, {
        name: 'imagesCount',
        meta: {
            selector: ['.thumb'],
            format: 'count',
            countType: 'image'
        },
        defaultValue: 0
    }, {
        name: 'wordsCount',
        meta: {
            selector: ['.content'],
            format: 'count',
            countType: 'text'
        },
        defaultValue: 0
    }, {
        name: 'comments',
        meta: {
            selector: ['.stats-comments .number'],
            format: 'text'
        },
        defaultValue: 0
    }, {
        name: 'likes',
        meta: {
            selector: ['.stats-vote .number'],
            format: 'text'
        },
        defaultValue: 0
    }],
    // 是否模拟用户请求
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36',
    // 编码 默认utf-8
    charset: null,
    // 回调函数 对所有数据做处理
    afterExtractAll: function (data) {
        data.fields['hits'] = 0;
        return data;
    },
    afterExtractField: function (fieldsName, data) {
        if (fieldsName === 'tags') {
            data = data ? data.split(',') : [];
        }
        if (fieldsName === 'comments') {
            data = +data;
        }
        if (fieldsName === 'likes') {
            data = +data;
        }
        return data;
    }
};

引入

const Crawler = require('almighty-parser-core')
const options = require('../parser/parser-qiushibaike.js')
const parser = new Crawler(options)

API测试

测试案例

parse

{ fields:
   { author: '草莓、牛奶巧克力',
     tags: [ '搞笑图片' ],
     keywords: '',
     description: '笑死我了',
     content: '<div class="content">\n\n笑死我了\n\n</div><div class="thumb">\n\n<img src="//pic.qiushibaike.com/system/pictures/11909/119095438/medium/app119095438.jpg" alt="糗事#119095438">\n\n</div>',
     imagesCount: 1,
     wordsCount: 4,
     comments: 0,
     likes: 457,
     from: 'https://www.qiushibaike.com/article/119095438',
     sourceId: 'com.qiushibaike.www-article-119095438',
     site: 'www.qiushibaike.com',
     hits: 0 },
  urls:
   [ 'https://www.qiushibaike.com/',
     'https://www.qiushibaike.com/hot/',
     'https://www.qiushibaike.com/imgrank/',
     'https://www.qiushibaike.com/text/',
     'https://www.qiushibaike.com/history/',
     'https://www.qiushibaike.com/pic/',
     'https://www.qiushibaike.com/textnew/',
     'https://www.qiushibaike.com/my',
     'https://www.qiushibaike.com/article/116423562',
     'https://www.qiushibaike.com/article/116424718',
     'https://www.qiushibaike.com/article/116421669',
     'https://www.qiushibaike.com/article/116423344',
     'https://www.qiushibaike.com/article/116426229',
     'https://www.qiushibaike.com/article/116423107',
     'https://www.qiushibaike.com/article/104614784',
     'https://www.qiushibaike.com/article/104590828',
     'https://www.qiushibaike.com/article/104629666',
     'https://www.qiushibaike.com/article/104599846',
     'https://www.qiushibaike.com/article/104598154',
     'https://www.qiushibaike.com/article/104619022',
     'https://www.qiushibaike.com/article/118954381',
     'https://www.qiushibaike.com/article/118491926',
     'https://www.qiushibaike.com/article/118563113',
     'https://www.qiushibaike.com/article/118806836',
     'https://www.qiushibaike.com/article/118525804',
     'https://www.qiushibaike.com/article/118770803',
     'https://www.qiushibaike.com/article/119008939',
     'https://www.qiushibaike.com/article/119033005',
     'https://www.qiushibaike.com/article/119036209',
     'https://www.qiushibaike.com/article/118922421',
     'https://www.qiushibaike.com/article/119014594',
     'https://www.qiushibaike.com/article/119009873',
     'https://www.qiushibaike.com/article/118934286',
     'https://www.qiushibaike.com/joke/',
     'https://www.qiushibaike.com/article/' ] }

其余接口测试请下载后运行

npm run test:qiushibaike

License

MIT License