html-urls
v2.4.60
Published
Get all links from a HTML markup
Downloads
3,164
Maintainers
Readme
html-urls
Get all URLs from a HTML markup. It's based on W3C link checker.
Install
$ npm install html-urls --save
Usage
const got = require('got')
const htmlUrls = require('html-urls')
;(async () => {
const url = process.argv[2]
if (!url) throw new TypeError('Need to provide an url as first argument.')
const { body: html } = await got(url)
const links = htmlUrls({ html, url })
links.forEach(({ url }) => console.log(url))
// => [
// 'https://microlink.io/component---src-layouts-index-js-86b5f94dfa48cb04ae41.js',
// 'https://microlink.io/component---src-pages-index-js-a302027ab59365471b7d.js',
// 'https://microlink.io/path---index-709b6cf5b986a710cc3a.js',
// 'https://microlink.io/app-8b4269e1fadd08e6ea1e.js',
// 'https://microlink.io/commons-8b286eac293678e1c98c.js',
// 'https://microlink.io',
// ...
// ]
})()
It returns the following structure per every value detect on the HTML markup:
value
Type: <string>
The original value.
url
Type: <string|undefined>
The normalized URL, if the value can be considered an URL.
uri
Type: <string|undefined>
The normalized value as URI.
See examples for more!
API
htmlUrls([options])
options
html
Type: string
Default: ''
The HTML markup.
url
Type: string
Default: ''
The URL associated with the HTML markup.
It is used for resolve relative links that can be present in the HTML markup.
whitelist
Type: array
Default: []
A list of links to be excluded from the final output. It supports regex patterns.
See matcher for know more.
removeDuplicates
Type: boolean
Default: true
Remove duplicated links detected over all the HTML tags.
Related
- xml-urls – Get all urls from a Feed/Atom/RSS/Sitemap xml markup.
- css-urls – Get all URLs referenced from stylesheet files.
License
html-urls © Kiko Beats, released under the MIT License. Authored and maintained by Kiko Beats with help from contributors.
kikobeats.com · GitHub @Kiko Beats · X @Kikobeats