scrape-html-web
v0.1.2
Published
Website scraper
Downloads
152
Maintainers
Readme
scrape-html-web
Extract content from a static HTML website.
ESM, CJS Node >=16
When you install Scrape HTML Web, no version of Chromium will be downloaded, unlike, for example, Puppeteer. This makes it a fast and light library.
Access to all websites is not guaranteed, this may depend on the authorization they have.
The library is asynchronous.
Note: two dependencies are included in order to work:
Installation
To use Scrape HTML Web in your project, run:
npm install scrape-html-web
or
yarn add scrape-html-web
Usage
import { scrapeHtmlWeb } from "scrape-html-web";
//or
const { scrapeHtmlWeb } = require("scrape-html-web");
//example
const options = {
url: "https://nodejs.org/en/blog/",
bypassCors: true // avoids running errors in esm
mainSelector: ".blog-index",
childrenSelector: [
{ key: "date", selector: "time", type: "text" },
// by default, the first option that is taken into consideration is att
{ key: "version", selector: "a", type: "text" },
{ key: "link", selector: "a", attr: "href" },
],
};
(async () => {
const data = await scrapeHtmlWeb(options);
console.log(data);
})();
Response
//Example response
[
{
date: "04 Nov",
version: "Node v18.12.1 (LTS)",
link: "/en/blog/release/v18.12.1/",
},
{
date: "04 Nov",
version: "Node v19.0.1 (Current)",
link: "/en/blog/release/v19.0.1/",
},
...{
date: "11 Jan",
version: "Node v17.3.1 (Current)",
link: "/en/blog/release/v17.3.1/",
},
{
date: "11 Jan",
version: "Node v12.22.9 (LTS)",
link: "/en/blog/release/v12.22.9/",
},
];
options
- url - urls to scraper site web required
- bypassCors - Url to bypass cors errors in ESM
- mainSelector - indicates the main selector where to start scraping required
- list - indicates that we need to iterate through a list of elements containing mainSelector, default is True not required
- childrenSelector - is an array made up of parameters to define the object we expect to receive required
url
const options = {
url: "https://nodejs.org/en/blog/" //url from which you want to extrapolate the data,
...
};
bypassCors
const options = {
bypassCors: {
customURI: "https://api.allorigins.win/get?url=",
paramExstract: "contents",
}, // bypass cors error in ESM
...
};
const options = {
bypassCors: true,
...
};
You can use the default URL or use a custom one.
If you pass a Boolean without specifying anything, the default URL will be used, which is the following: https://api.allorigins.win/get?url=
it is also possible to pass a custom URL indicating the following parameters:
- customURI: Custom URL ** required
- paramExstract: Any extraction parameter deriving from the call ** _not required_
mainSelector
const options = {
...
mainSelector: ".blog-index" //the parent selector where you want to start from,
...
};
//Extract **HTML**:
//example HTML
<ul class="blog-index">
<li>
<time datetime="2022-11-04T22:34:29+0000">04 Nov</time>
<a href="/en/blog/release/v18.12.1/">Node v18.12.1 (LTS)</a>
</li>
<li>
<time datetime="2022-11-04T18:05:19+0000">04 Nov</time>
<a href="/en/blog/release/v19.0.1/">Node v19.0.1 (Current)</a>
</li>
</ul>
list
const options = {
...
list: true|false
// if false it will only loop once over the parent element
// if true it will loop through all elements below the parent element
...
};
childrenSelector
const options = {
...
childrenSelector: [
{ key: "date", selector: "time", type: "text" },
{ key: "version", selector: "a", type: "text" },
{ key: "link", selector: "a", attr: "href" },
],
};
key: is the name of the key ** required
selector: is the name of the selector that is searched for in the HTML that is contained by the parent ** required
attr: indicates what kind of attribute you want to get ** not required
Some of the more common attributes are − [ className, tagName, id, href, title, rel, src, style ]
type: indicates the type of value to be obtained ** not required (Default: "Text")
possible values: [ text , html ]
optional
replace - with this parameter it is possible to have text or html inside a selector. It is possible to provide it with either a RegExp or a custom function ** not required
canBeEmpty: - by default it is set to false ( grants the ability to leave the value of an element blank ) ** not required
{ key: "title", selector: ".title", type: "text", canBeEmpty: true }, Example response: {title: ''} if text in selector is empty
replace
const options = {
url: "https://nodejs.org/en/blog/",
mainSelector: ".blog-index",
childrenSelector: [
{
key: "date",
selector: "time",
type: "text",
replace: (text) => text + " 2022",
/* I pass a custom function that adds the
"2022" test to the date I get from the selector */
},
{
key: "version",
selector: "a",
type: "html",
replace: /[{()}]/g,
/* I pass a regex to remove
the round paraesthesia within the html */
},
{
key: "link",
selector: "a",
attr: "href",
},
],
};
(async () => {
const data = await scrapeHtmlWeb(options);
console.log("example 2 :", data);
})();
//Example response
[
{
date: "04 Nov 2022",
version: '<a href="/en/blog/release/v18.12.1/">Node v18.12.1 LTS</a>',
link: "/en/blog/release/v18.12.1/",
},
{
date: "04 Nov 2022",
version: '<a href="/en/blog/release/v19.0.1/">Node v19.0.1 Current</a>',
link: "/en/blog/release/v19.0.1/",
},
...{
date: "11 Jan 2022",
version: '<a href="/en/blog/release/v17.3.1/">Node v17.3.1 Current</a>',
link: "/en/blog/release/v17.3.1/",
},
{
date: "11 Jan 2022",
version: '<a href="/en/blog/release/v12.22.9/">Node v12.22.9 LTS</a>',
link: "/en/blog/release/v12.22.9/",
},
];
❤️ Support
If you make any profit from this or you just want to encourage me, you can offer me a coffee and I'll try to accommodate you.
Please note: 🙏
This library was created for educational purposes and excludes the intention to take information for which authorization to do so is not granted