npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

scrapa

v0.2.6

Published

Scrape Parse Transform Repeat

Downloads

8

Readme

Scrapa

Yet another Node 'web Scraper', converting HTML, XML and JSON structures to proper JSON objects.

About the project

Scrapa uses 2 phases to process requests, both phases are separated and can be used apart from each other. parse can be used without scrape and vice versa.

1. Scrape

scrape - Makes the HTTP request, fetches the page and also able to select a specific part of the page, for example, find a JSON string using a RegExp.

Scrape phase has 2 types for processing requests, both will yield plain response as a string.

  1. get - simple fetch
  2. headless - headless browser, to process dynamic applications like React that has complex Javascript rendering.

You also can use this library to get remote content without the parse phase.

Example for scraping Yahoo via headless browser.

import { scrape } from 'scrapa';

let body = await scrape({ url: 'https://news.yahoo.com', type: 'headless' });

console.debug(body);

2. Parse

parse - Parse is able to handle 3 types of inputs [HTML, XML, JSON] and convert them to a unified, easily consumed format.

Anonymity

All requests currently sent with a basic hardcoded user agent Mozilla/5.0 (iPhone; CPU iPhone OS 15_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.4 Mobile/15E148 Safari/604.1

Headless browser sends the same useragent.

Setup

# npm
npm install --save scrapa

Example

Extract the title of Yahoo website located in Yahoo News

import { scrape, parse } from 'scrapa';

let body = await scrape({url: 'https://news.yahoo.com'});

let parsed = await parse({body, fields: {
    title_now_is: 'head > title'
  }
});

console.info(parsed);
{
  total: 1,
  fields: [
    { title_now_is: 'Yahoo News - Latest News & Headlines' }
  ]
}

Extract top 3 items from Yahoo News

import { scrape, parse } from 'scrapa';

let body = await scrape({url: 'https://news.yahoo.com'});

let parsed = await parse({body, fields: {
    article_title: '.js-stream-content ul li div'
  }
});

console.info(parsed);
{
  total: 3,
  fields: [
    { article_title: 'COVID-affected tenants face eviction despite CDC ban' },
    { article_title: 'Cayman Islands jails U.S. student in COVID case' },
    { article_title: "Fla. scientist vows to speak COVID-19 'truth to power'" }
  ]
}

Extracting links from Yahoo, finding the JSON part (root.App.main), and using it instead of HTML parsing.

import { scrape, parse } from 'scrapa';

let body = await scrape({
  url: 'https://news.yahoo.com',
  regExp: [new RegExp('root\.App\.main = (.*?);\n.*\}\\(this\\)\\);', 'gm')],
});

let parsed = await parse({ 
  body,
  type: 'json',
  fields: { href: 'context.dispatcher.stores.PageStore.pageData.links.{Iterator}.href'},
  options: {
      
  },
});

console.info(parsed);
{
  total: 23,
  fields: [
    { href: '//s.yimg.com' },
    { href: '//mbp.yimg.com' },
    ...
    { href: 'https://s.yimg.com/cv/apiv2/favicon_y19_32x32.ico' },
    { href: 'https://news.yahoo.com/' }
  ]
}

Documentation and Usage

scrape({url, type='get|headless', regExp = []})

Simple Scraper

Params

  • String url: The page url or request options.
  • String type: The type of scrape required, two options: 1. get - simple get operation via 'fetch' 2. headless for a full browser for Javascript heavy application that post render on a browser.
  • Array regExp: Array of RegExp instances to clean the output, useful before passing to parse.

Return

  • Promise Resolving with:
    • body (String): Scrapped raw body

parse({ body, type = 'html', fields = {}, options = {} })

Parses finds the fields and extracts the data formatted in the output under the same field's name.

parse uses 3 input types

Type Options:

html - Using Cheeerio as a query selector. Fields should contain CSS style selectors to get find the data. All CSS Cheeerio selectors are valid. Example usage: {fields: {page_title: 'head > title'}} - The following will populate on the output the field page_title with the page's title. Currently it takes all the .innerHTML from the selectors and populate them as output.

import { parse } from 'scrapa';

let parsed = await parse({body, type: 'html', fields: {
    page_title: 'head > title'
  }
});

console.debug(parsed);

json - Fields should be mapped as you would regularly read from JSON with DOT notation (store.books.0.title).

Array, should be accessed via DOT too, instead of [] as in the example.

Another operator used for objects containing many rows, for getting all objects, special operator should be used: {Iterator} instead of the number. This number will be replaced on runtime and process all items in the array.

Other than these, properties should behave as a regular JSON array address.

import { parse } from 'scrapa';

let parsed = await parse({body, type: 'json', fields: {
  books_title: 'catalog.book.0.title',
  books_price: 'catalog.book.{Iterator}.title',
}});

console.debug(parsed);

xml - Converts XML input to JSON. All syntax should be similar to JSON

:bulb: More: More examples can be found in the unit tests folder.

Params

  • String/Object body: Input body to parse
  • String type: The of body - 'html' or 'json' or 'xml'
  • Object fields: Key/Value pairs of parsing properties according to the type. Examples above.
  • Object options: Settings property to configure the parsing process and alter the output.
    • limit (Number): Splices the object to the desired amount.
    • reverse (Boolean): Reverses the output

Return

  • Promise Resolving with:
    • total (Number): Amount of elements found.
    • fields (Array): Array of input fields, according to the key passed to parse

TODO

  • Debug output
  • Cover scrape with tests
  • Add E2E
  • Parse tranfsormation, for example parse date str to Date object.