npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

getsy

v0.9.2

Published

A simple browser/client-side web scraper.

Downloads

7

Readme

Getsy

A simple browser/client-side web scraper. Try it out in a REPL: http://www.getgetsy.com

TODOS:

  • [x] Support for websites with infinite scroll.
  • [ ] Support for websites with click pagination.

Installation options:

  • Run npm install --save getsy or yarn add getsy
  • Download the umd build and link it using a script tag

How to use:

This library exposes a single function: getsy(url: string, optionsObject?: options): Promise<Getsy>

parameters:

  • url: The url of the website you wish to scrape.

  • optionsObject(optional):

    • corsProxy(optional string): The endpoint of the corsProxy you wish to use. (Read corsProxy for more info)

    • resolveURLs(optional boolean): Wether you want getsy to resolve all relative urls in the resource to absolute urls so they don't fail when they load in another page. (defaults to true)

    • iframe: A boolean or object with width and height properties indicating if getsy should start in iframeMode or not. iframe mode will wait for the resource to be mounted in a hidden iframe so you can extract more data through pagination or infinite scrolling. (defaults to false)

The function returns a promise that resolves to a Getsy object on success and rejects if it was unable to load the requested page.

Getsy objects have a method getMe for scraping the resource's contents. This method is just a wrapper over the jQuery function so you can chain other jQuery methods on it. If you need to use the raw data you can access it's content property. (More on Getsy below)

Example (Promises):

import getsy from 'getsy'

getsy('https://en.wikipedia.org/wiki/"Hello,_World!"_program').then(myGetsy => {
  console.log(myGetsy.getMe('#firstHeading').text())
})

Example (Async/Await):

import getsy from 'getsy'

async function testing() {
  const myGetsy = await getsy('https://en.wikipedia.org/wiki/"Hello,_World!"_program')

  console.log(myGetsy.getMe('#firstHeading').text())
}

testing()

Here's how you might use it with a website that has infinite scrolling:

async function infiniteScrape() {
  myGetsy = await getsy('http://scrollmagic.io/examples/advanced/infinite_scrolling.html', { iframe: true })
  
  console.log(`${myGetsy.getMe('.box1').length} boxes.`)
  
  const { succesfulTimes, totalRetries } = await myGetsy.scroll(10)
  
  console.log(`New content loaded ${succesfulTimes} times with ${totalRetries} total retries.`)
  console.log(`${myGetsy.getMe('.box1').length} boxes.`) // More content!
}

infiniteScrape()

The Getsy Object:

The Getsy object has the following properties and methods:

  • corsProxy: The same one passed from the options object or the default value.

  • content: The original string data received from the request.

  • iframe: A reference to its iframe element if in iframe mode.

  • iframeDoc: A reference to its iframe's document if in iframe mode.

  • content: The original string data received from the request.

  • getMe(selector: string): JQuery: Query the resource's DOM or the iframe if in iframe mode with a jQuery selector. Returns a JQuery object.

  • scroll(numberOfTimes: number, element?: HTMLElement, interval?: number, retries?: number): Promise<scrollResolve>: Scroll to the bottom of an element (defaults to body) to load new data a specified numberOfTimes. The interval (defaults to 2000) is the time in milliseconds that Getsy waits before checking if new content has loaded. If no new content has loaded it will retry as many times as specified by retries (defaults to 5). If no new content has loaded and scroll is out of retries then it will resolve the Promise early to avoid waiting for the remaining numberOfTimes. Note: retries reset to 0 on every succesful content load. Returns a Promise that resolves to an object with the number of .succesfulTimes that new content was loaded and the .totalRetries.

  • hideFrame(): void: Hides the iframe if applicable.

  • showFrame(): void: Shows the iframe if applicable.

CorsProxy:

This library uses a corsProxy to get by the CORS Origin issue. If you don't provide one it will default to: https://crossorigin.me/.

Some node CorsProxy servers: