npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

element-scraper

v0.2.0

Published

Scrapes elements from an url

Downloads

14

Readme

Element-scraper

Scrapes elements from a URL Written in pure JavaScript, and without any extra dependencies.

Be aware: names of methods are probable to change until version 0.1.0

release notes can be found here

How to install

This module is intended to be used as a helper to fetch elements on a webpage. It does not handle logins at its current stage.

npm i element-scraper

How to use

import {getHtmlData, parseDataForElement, parseElementsInnerText} from 'element-scraper'

Alternate methods

import {hasCorrectHtmlProtocol, isHttps, parseDataForMultiLineElements} from 'element-scraper'

All available GET methods

  • getHtmlData,
  • isHttps,
  • hasCorrectHtmlProtocol

All available Parsing methods

  • greedyFindSingleLineElements
  • greedyFindMultiLineElementsByAttributeOrText
  • nonGreedyFindMultiLineElementsByAttributeOrText *
  • greedyFindMultiLineElementsByType
  • nonGreedyFindMultiLineElementsByType
  • nonGreedyFindSingleLineElementsInnerText

All parsing methods follow the schema of (dataToParse, "What you are looking for") except for nonGreedySingleLineElementsInnerText(dataToParse, boolean: getEmptySpaces) See example further down this page.


Getting Data

getHtmlData

await getHtmlData(url)

This function is asynchronous To GET the entire HTML page you want to parse, as a string.


hasCorrectHtmlProtocol

hasCorrectHtmlProtocol(url)

Checks if the URL seems to have to correct protocol, as in http or https. It will however not check that it is a completely valid URL

isHttps

isHttps(url)

You can check if the URL supplied is HTTPS, this will return true or false.

Parsing elements

Once you have the data string, you can check start parsing out the elements you would like to get.

parseDataForElement

greedyFindSingleLineElements(dataToParse, elementMatch)

You pass your data string as dataToParse. To find a specific class name or element ID pass this as elementMatch.

parseElementsInnerText

nonGreedyFindSingleLineElementsInnerText(dataToParse, getEmptySpaces)

Gets all text within >< but not empty spaces (see note) and defaults to getEmptySpaces as false / omitted.

Note: the empty array data from this function comes from the emptiness in </innerElement></outerElement> in an element like this.<outerElement>Some text<innerElement>Some more text</innerElement></outerElement>

[
  ' 15°', '15°', 'max',  '',    ' 15°', '15°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 14°', '14°',
  'max',  '',    ' 15°', '15°', 'max',  '',
  ' 15°', '15°', 'max',  '',    ' 15°', '15°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 14°', '14°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 13°', '13°',
  'max',  ''
]

If you want to have the empty spaces, like above make sure getEmptySpaces is set to true

nonGreedyFindMultiLineElementsByType or greedyFindMultiLineElementsByType

nonGreedyFindMultiLineElementsByType(dataToParse, elementMatch)
greedyFindMultiLineElementsByType(dataToParse, elementMatch)

To get a multi-line element you use nonGreedyFindMultiLineElementsByType The difference is that the non-greedy version will break after the first </> of that element, while the greedy will look for the last possible closing tag, of the same element it looks for the opening and closing tag. You pass your data string as dataToParse. To find a specific class name or element ID pass this as elementMatch.

greedyFindMultiLineElementsByAttributeOrText or nonGreedy

The same idea applies here, to break early or not

greedyFindMultiLineElementsByAttributeOrText(dataToParse, textOrAttributeMatch)
nonGreedyFindMultiLineElementsByAttributeOrText(dataToParse, textOrAttributeMatch)

This will allow you to get elements with attributes like class name or text strings.

Participate in this module

If you want to participate in this module, feel free to do pull requests towards the main branch. I will review them when time allows. Please make sure your commit message is clear on what it is trying to solve or add. If there are issues with how the module works, create and issue, with an example for me to review.

Note

This is a school project and may or may not be maintained after the course.