pcrawl

v1.0.2

Published

3 years ago

```bash $ npm i pcrawl -g $ pcrawl -c ./config.yml -o ./result.json ``` ## args ```js parser.add_argument("-v", "--version", { action: "version", version, }); parser.add_argument("-c", "--config", { help: "配置文件", }); parser.add_argument("-o", "--out

Downloads

0High
0Medium
0Low

yunfengsay

Quick Start

$ npm i pcrawl -g
$ pcrawl -c ./config.yml -o ./result.json

args

parser.add_argument("-v", "--version", {
  action: "version",
  version,
});
parser.add_argument("-c", "--config", {
  help: "配置文件",
});
parser.add_argument("-o", "--output", {
  help: "输出到目标文件",
});
parser.add_argument("-j", "--jquery", {
  help: "是否注入jquery到yemain, true| false, defalut false",
});
parser.add_argument("--log", {
  help: "开启log true|false,default false",
});

config.yml

| name | instructions | | ------------ | ------------------------------------------------------------ | | url | 要爬取的页面地址 | | wait | 爬取的dom标识，一旦这个类或者id出现则开始爬取 | | until | 结束标识，一旦出现这个类或id则停止 | | next | 爬取下一页的行为如 "click xxx" xxx是个class或id | | dataItemFunc | 脚本路径，每次页面wait标识激活后则执行当前方法 | | authUrl | 认证地址，如果需要登录可以填写登录地址,加载后会等待10给你输入密码时间 | | authActions | 如果需要自动化处理登录认证也可在这里配置一系列action，puppeteer |

dataItemFunc

必须定义一个 $main函数
必须返回一个数组

demo

$ pcrawl -c ./config.yml -o ./result.json

config.yml

instpaaper:
  url: https://www.instapaper.com/u
  wait: .paginate_older
  until: not .paginate_older
  next: click .paginate_older
  dataItemFunc: ./dataItemFunc.js
  authUrl: https://www.instapaper.com/user/login
  authActions:
    - "type #username [email protected]"
    - "type #password yunfeng0409"
    - "click #log_in"

script.js

function $main(args) {
    try {
        const itemList = $('.article_item').toArray();
        const data = itemList.map(v => {
            const find = (className) => {
                return $($(v).find(className)[0])
            };
            const title = find('.article_title').text();
            const desc = find('.article_preview').text();
            const originUrl = find('.js_domain_linkout').get(0).href;
            const result = {
                title,
                desc,
                originUrl,
            }
            return result;
        })
        return data
    } catch (e) {
        console.log(e)
        return []
    }
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Quick Start

args

config.yml

dataItemFunc

demo