walk-site
v0.0.1
Published
Walk URLs
Downloads
67
Readme
walk-site (experimental)
A minimal, Puppeteer-based link crawler with concurrency, depth-limiting, extension filtering, and flexible callbacks.
Installation
npm i walk-site
Quick start
import { walkSite } from 'walk-site'
const targetURL = new URL('https://example.com')
await walkSite(targetURL, {
// Only visit links on the same domain
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
console.log('Page content:', page.content)
},
})
Examples
With depth limit
import { walkSite } from 'walk-site'
const targetURL = new URL('https://example.com')
await walkSite(targetURL, {
// Visit the initial page and its direct links
depth: 1,
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
console.log('Depth:', page.depth)
},
})
With concurrency
import { walkSite } from 'walk-site'
const targetURL = new URL('https://example.com')
await walkSite(targetURL, {
// Visit up to 5 pages concurrently
concurrency: 5,
onURL: (url) => url.hostname === targetURL.hostname,
onPage: (page) => {
console.log('Page title:', page.title)
},
})
API Reference
walkSite(targetURL, options)
Crawls links starting from targetURL
. Returns a Promise that resolves once all pages have been processed or fails on internal errors (unless caught by onError
).
Parameters
targetURL: string | URL
The starting URL to crawl.options: WalkSiteOptions
Configuration object:| Option | Type | Description | | ----------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
depth
|number \| undefined
| Limits crawl depth.depth = 0
visits onlytargetURL
;depth = 1
includes its children, etc. Default isundefined
(no limit). | |concurrency
|number \| undefined
| Number of pages processed in parallel. Defaults to1
(serial crawling). | |onURL
|(url: URL, meta: { href: string; depth: number }) => boolean \| void \| Promise<boolean \| void>
| Called before enqueuing a link. Returnfalse
to skip. | |onPage
|(page: Page) => void \| Promise<void>
| Called after navigating to a page. Can be used to extract or process HTML content. | |onError
|(error: unknown, url: URL) => void \| Promise<void> \| undefined
| Called on errors (e.g., non-2xx HTTP status if you handle it that way, network errors, etc.). If not provided, errors are logged toconsole.error
. | |extensions
|string[] \| null \| undefined
| File extensions recognized as HTML. Defaults to[".html", ".htm"]
. Ifnull
, all links are followed. If you pass your own array, it completely overrides the default. |
Returns
Promise<void>
Resolves when the entire crawl finishes (or rejects on internal errors, unless you handle them inonError
).
OnURL
Type
type OnURL = (
url: URL,
metadata: { href: string; depth: number },
) => boolean | void | Promise<boolean | void>
- Return
false
to skip crawlingurl
. - All other return values will include it in the crawl queue.
Page
Type
type Page = {
title: string
url: URL
href: string
content: string
depth: number
ok: boolean
status: number
}
title
:<title>
of the page.url
: The final URL as aURL
object.href
: String form of the link from which we arrived here.content
: Inner HTML (page.content()
).depth
: Depth relative to the starting URL.ok
:true
if HTTP status was in the 200 range.status
: Numeric HTTP status code.