html-play
v1.3.0
Published
Fetch and parse dynamic HTMLs with Node.js like a boss 🕶
Downloads
5
Readme
Features
- Intuitive APIs for extracting useful contents like links and images.
- CSS selectors.
- Mocked user-agent (like a real web browser).
- Full JavaScript support.
Using Chromium under the hood by default, thanks to Playwright.await htmlPlay(url, { browser: true })
Recipes
Grab a list of all links and images on the page.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://nodejs.org') // Will print all link URLs on the page console.log(dom.links) // Will print all image URLs on the page console.log(dom.images)
Select an element with a CSS selector.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://nodejs.org') const intro = dom.find('#home-intro', { containing: 'Node' }) // Will print: 'Node.js® is an open-source, cross-platform...' console.log(intro.text)
- Let's grab some wallpapers from unsplash.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://unsplash.com/t/wallpapers') const elements = dom.findAll('img[itemprop=thumbnailUrl]') const images = elements.map(({ image }) => image) // Will print something like // ['https://images.unsplash.com/photo-1705834008920-b08bf6a05223', ...] console.log(images)
- Let's load some hacker news from Hack News.
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://news.ycombinator.com') const titles = dom.findAll('.titleline') const news = titles.map(({ text, link }) => [text, link]) // Will print something like // [['news 1', 'http://xxx.com'], ['news 2', 'http://yyy.com'], ...] console.log(news)
- Load a dynamic website, which means its content is generated by JavaScript!
// Search for images of "flower" with Google import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('https://www.google.com/search?&q=flower&tbm=isch', { browser: true }) // Filtering is still needed if you want this work... console.log(dom.images)
- Send requests with custom cookies.
import { htmlPlay } from '../src/index.js' const { dom } = await htmlPlay('https://httpbin.org/cookies', { fetch: { fetchInit: { headers: { Cookie: 'a=1; b=2;' } } }, }) // Will print { "cookies": { "a": "1", "b": "2" } } console.log(dom.text)
Installation
npm i html-play
If you want to use a browser to "run" the page before parsing, you'll need to install Chromium with Playwright.
npm i playwright
npx playwright install chromium
APIs
Methods
htmlPlay
Fetch a certain URL and return its response with the parsed DOM.
Example:
import { htmlPlay } from 'html-play' const { dom } = await htmlPlay('http://example.com')
Parameters:
url
Type:
string
The URL to fetch.
options
(Optional)Type:
object
Default:
{ fetch: true }
fetch
(Optional)Type:
boolean | object
Default:
true
If set to
true
, we will use the Fetch API to load the requested URL. You can also specify the options for the Fetch API by passing anobject
here.fetcher
(Optional)Type:
function
The fetch function we are going to use. We can pass a polyfill here.
fetchInit
(Optional)Type:
function
The fetch parameters passed to the fetch function. See fetch#options. You can set HTTP headers or cookies here.
browser
(Optional)Type:
boolean | object
Default:
false
If set to
true
, we will use Playwright to load the requested URL. You can also specify the options for Playwright by passing anobject
here.browser
(Optional)Type:
object
The Playwright Browser instance to use.
page
(Optional)Type:
object
The Playwright Page instance to use.
launchOptions
(Optional)The
launchOptions
passed to Playwright when we are launching the browser. See BrowserType#browser-type-launchbeforeNavigate
(Optional)A custom hook function that will be called before the page is loaded.
page
andbrowser
can be accessed here as the properties of its first parameter to interact with the page.afterNavigate
(Optional)A custom hook function that will be called after the page is loaded.
page
andbrowser
can be accessed here as the properties of its first parameter to interact with the page.
Returns:
A
Promise
of theResponse
instance (see below).Classes
Response
Properties
url
Type:
string
The URL of the response. If the response is redirected from another URL, the value will be the final redirected URL.
status
Type:
number
The HTTP status code of the response.
content
Type:
string
The response content as a plain string.
dom
Type:
object
The parsed root DOM. See
DOMElement
.json
Type:
object | undefined
The parsed response JSON. If the response is not a valid JSON, it will be
undefined
.rawBrowserResponse
Type:
object
The raw response object returned by Playwright.
rawFetchResponse
Type:
object
The raw response object returned by the Fetch API.
DOMElement
Properties
html
Type:
string
The "
outerHTML
" of this element.link
Type:
string
If the element is an anchor element, this will be the absolute value of the element's link, or it will be an empty string.
links
Type:
string[]
All the anchor elements inside this element.
text
Type:
string
The text of the element with whitespaces and linebreaks stripped.
rawText
Type:
string
The original text of the element.
image
Type:
string
If the element is an image embed element, this will be the absolute URL of the element's image, or it will be an empty string.
images
Type:
string[]
All the image URLs inside this element.
backgroundImage
Type:
string
The background image source extracted from the element's inline style.
element
Type:
object
The corresponding
JSDOM
element object.
Methods
find
Find the first matched child
DOMElement
inside this element.Parameters
selector
Type:
string
The CSS selector to use.
options
(Optional)Type:
object
containing
(Optional)Type:
string
Check if the element contains the specified substring.
Type:
string
findAll
Find all matched child
DOMElement
s inside this element.Parameters
selector
Type:
string
The CSS selector to use.
options
(Optional)Type:
object
containing
(Optional)Type:
string
Check if the element contains the specified substring.
Type:
string
getAttribute
Parameters
qualifiedName
Type:
string
Returns element's first attribute whose qualified name is qualifiedName, and
undefined
if there is no such attribute otherwise.
Credits
This project is highly inspired by another fabulous library Requests-HTML for Python.