@vpukhanov/quick-scraper

v1.7.1

Published

2 years ago

An easy, lightweight scraper for humans with many inbuilt features..

Downloads

0High
0Medium
0Low

vpukhanov

scraper decoding iconv utf-8 quick spider crawler

Quick Scraper

An easy, lightweight scraper built using typescript for good developer experience.

Features.

If it works in cheerio, it will work here.
Automatically change any encoding to UTF-8.
Built on typescript.
Great editor support.

Cons.

It doesn't play well with nested structures like

<p>
  abcd
  <a href="abcd">Some Url</a>
</p>

In this case, if you want to select the text abcd, it won't work ootb as there are some limitiations in the way jquery does it directly, to handle such cases, use the raw output text, send it to cheerio or other HTML parsers, and then apply the logic in there.

Installation

Yarn

yarn add quick-scraper

NPM

npm i quick-scraper

Usage

import { quickScraper } from "quick-scraper"

await quickScraper({
  url: "https://typestrong.org/ts-node/",
  options: {
    title: {
      // This property can be changed to the name you want.
      selector: ".hero__subtitle",
    },
    docs: {
      selector: "a.navbar__item:nth-child(1)",
      text: false, // Text is enabled by default, so you need to disable it explicitly.
      href: true, // One of the attribute that's available by default.
    },
    releaseNotes:{
      selector: "a.navbar__item:nth-child(3)",
      text: true, // You can also enable multiple attributes at once.
      href: true,
    }
  },
});

// Output
/*

{
  rawString: <html></html> structure of your page in string format, load it in cheerio or do whatever you like with it,
  data: {
    title: { text: 'TypeScript execution and REPL for node.js' },
    docs: { href: 'https://typestrong.org/ts-node/docs/' },
    releaseNotes: {
      text: 'Release Notes',
      href: 'https://github.com/TypeStrong/ts-node/releases'
    }
  }
}
*
/

Scrape a HTML string.

// The process works similar to the quickScraper, few things needs to be changed.

import { scrapeHtml } from 'quick-scraper';

await scrapeHtml({
  html: "html source code from https://typestrong.org",
  options: {
    title: {
      // This property can be changed to the name you want.
      selector: ".hero__subtitle",
    },
    docs: {
      selector: "a.navbar__item:nth-child(1)",
      text: false, // Text is enabled by default, so you need to disable it explicitly.
      href: true, // One of the attribute that's available by default.
    },
    releaseNotes:{
      selector: "a.navbar__item:nth-child(3)",
      text: true, // You can also enable multiple attributes at once.
      href: true,
    }
  },
});

// Output
/*

{
  rawString: <html></html> structure of your page in string format, load it in cheerio or do whatever you like with it,
  data: {
    title: { text: 'TypeScript execution and REPL for node.js' },
    docs: { href: 'https://typestrong.org/ts-node/docs/' },
    releaseNotes: {
      text: 'Release Notes',
      href: 'https://github.com/TypeStrong/ts-node/releases'
    }
  }
}
*
/

Headless Quick Scrape

import { quickScraperHeadless } from "quick-scraper";

const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();

await quickScraperHeadless({
  url: "https://typestrong.org/ts-node/",
  options: {
    title: {
      // This property can be changed to the name you want.
      selector: ".hero__subtitle",
    },
    docs: {
      selector: "a.navbar__item:nth-child(1)",
      text: false, // Text is enabled by default, so you need to disable it explicitly.
      href: true, // One of the attribute that's available by default.
    },
    releaseNotes:{
      selector: "a.navbar__item:nth-child(3)",
      text: true, // You can also enable multiple attributes at once.
      href: true,
    }
  },
  page: page
});

// Output
/*

{
  rawString: <html>{...}</html> structure of your page in string format, load it in cheerio or do whatever you like with it,
  data: {
    title: { text: 'TypeScript execution and REPL for node.js' },
    docs: { href: 'https://typestrong.org/ts-node/docs/' },
    releaseNotes: {
      text: 'Release Notes',
      href: 'https://github.com/TypeStrong/ts-node/releases'
    }
  }
}
*
/

More Examples.

Custom Attribute

await scrapeHtml({
  html: "html source code from https://typestrong.org",
  options: {
    relStatus: {
      // This property can be changed to the name you want.
      selector: "a.navbar__item:nth-child(3)",
      attrs:{
        rel: true // Key will be the identifier of the attribute you want to scrape.
      }
    },
  },
});

All custom attributes will be accessible under Attrs key inside `output.data`
// Output
/*

{
  data: {
    relStatus: { attrs: { rel: "noopener noreferrer" } }
  }
}
*
/

List Item

import { quickScraper } from 'quick-scraper';

await quickScraper({
  url: "https://www.ptwxz.com/html/11/11622/",
  options: {
    chapters: {
      selector: ".centent > ul> li",
      listItem: true,
    },
  },
});
scrapedData.data.test.lists?
/*

{
  raw: <html></html> structure of your page in string format, load it in cheerio or do whatever you like with it,
  data: {
    chapters: {
      lists: [
        { text: '第一章 键盘侠' },
        { text: '第2章 杀机' },
        { text: '第3章 晴天霹雳' },
        { text: '第4章 第一部秘典' },
        { text: '第5章 坑爹的抽奖' },
        { text: '第6章 打赌' },
        { text: '第7章 突破' },
        { text: '第8章 信春哥' },
        { text: '第9章 老婆闺蜜' },
        ... 740 More Items
      ]
    }
  }
}
*
/

Visualization of this Repo.

Visualization of this repo

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme