element-scraper

v0.2.0

Published

2 years ago

Scrapes elements from an url

Downloads

0High
0Medium
0Low

oxxygen

Element-scraper

Scrapes elements from a URL Written in pure JavaScript, and without any extra dependencies.

Be aware: names of methods are probable to change until version 0.1.0

release notes can be found here

How to install

This module is intended to be used as a helper to fetch elements on a webpage. It does not handle logins at its current stage.

npm i element-scraper

How to use

import {getHtmlData, parseDataForElement, parseElementsInnerText} from 'element-scraper'

Alternate methods

import {hasCorrectHtmlProtocol, isHttps, parseDataForMultiLineElements} from 'element-scraper'

All available GET methods

getHtmlData,
isHttps,
hasCorrectHtmlProtocol

All available Parsing methods

greedyFindSingleLineElements
greedyFindMultiLineElementsByAttributeOrText
nonGreedyFindMultiLineElementsByAttributeOrText *
greedyFindMultiLineElementsByType
nonGreedyFindMultiLineElementsByType
nonGreedyFindSingleLineElementsInnerText

All parsing methods follow the schema of (dataToParse, "What you are looking for") except for nonGreedySingleLineElementsInnerText(dataToParse, boolean: getEmptySpaces) See example further down this page.

Getting Data

getHtmlData

await getHtmlData(url)

This function is asynchronous To GET the entire HTML page you want to parse, as a string.

hasCorrectHtmlProtocol

hasCorrectHtmlProtocol(url)

Checks if the URL seems to have to correct protocol, as in http or https. It will however not check that it is a completely valid URL

isHttps

isHttps(url)

You can check if the URL supplied is HTTPS, this will return true or false.

Parsing elements

Once you have the data string, you can check start parsing out the elements you would like to get.

parseDataForElement

greedyFindSingleLineElements(dataToParse, elementMatch)

You pass your data string as dataToParse. To find a specific class name or element ID pass this as elementMatch.

parseElementsInnerText

nonGreedyFindSingleLineElementsInnerText(dataToParse, getEmptySpaces)

Gets all text within >< but not empty spaces (see note) and defaults to getEmptySpaces as false / omitted.

Note: the empty array data from this function comes from the emptiness in </innerElement></outerElement> in an element like this.<outerElement>Some text<innerElement>Some more text</innerElement></outerElement>

[
  ' 15°', '15°', 'max',  '',    ' 15°', '15°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 14°', '14°',
  'max',  '',    ' 15°', '15°', 'max',  '',
  ' 15°', '15°', 'max',  '',    ' 15°', '15°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 14°', '14°',
  'max',  '',    ' 14°', '14°', 'max',  '',
  ' 14°', '14°', 'max',  '',    ' 13°', '13°',
  'max',  ''
]

If you want to have the empty spaces, like above make sure getEmptySpaces is set to true

nonGreedyFindMultiLineElementsByType or greedyFindMultiLineElementsByType

nonGreedyFindMultiLineElementsByType(dataToParse, elementMatch)

greedyFindMultiLineElementsByType(dataToParse, elementMatch)

To get a multi-line element you use nonGreedyFindMultiLineElementsByType The difference is that the non-greedy version will break after the first </> of that element, while the greedy will look for the last possible closing tag, of the same element it looks for the opening and closing tag. You pass your data string as dataToParse. To find a specific class name or element ID pass this as elementMatch.

greedyFindMultiLineElementsByAttributeOrText or nonGreedy

The same idea applies here, to break early or not

greedyFindMultiLineElementsByAttributeOrText(dataToParse, textOrAttributeMatch)

nonGreedyFindMultiLineElementsByAttributeOrText(dataToParse, textOrAttributeMatch)

This will allow you to get elements with attributes like class name or text strings.

Participate in this module

If you want to participate in this module, feel free to do pull requests towards the main branch. I will review them when time allows. Please make sure your commit message is clear on what it is trying to solve or add. If there are issues with how the module works, create and issue, with an example for me to review.

Note

This is a school project and may or may not be maintained after the course.