walk-site

v0.0.1

Published

18 days ago

Walk URLs

Downloads

0High
0Medium
0Low

aaronccasanova

walk-site (experimental)

A minimal, Puppeteer-based link crawler with concurrency, depth-limiting, extension filtering, and flexible callbacks.

Installation

npm i walk-site

Quick start

import { walkSite } from 'walk-site'

const targetURL = new URL('https://example.com')

await walkSite(targetURL, {
  // Only visit links on the same domain
  onURL: (url) => url.hostname === targetURL.hostname,
  onPage: (page) => {
    console.log('Page title:', page.title)
    console.log('Page content:', page.content)
  },
})

Examples

With depth limit

import { walkSite } from 'walk-site'

const targetURL = new URL('https://example.com')

await walkSite(targetURL, {
  // Visit the initial page and its direct links
  depth: 1,
  onURL: (url) => url.hostname === targetURL.hostname,
  onPage: (page) => {
    console.log('Page title:', page.title)
    console.log('Depth:', page.depth)
  },
})

With concurrency

import { walkSite } from 'walk-site'

const targetURL = new URL('https://example.com')

await walkSite(targetURL, {
  // Visit up to 5 pages concurrently
  concurrency: 5,
  onURL: (url) => url.hostname === targetURL.hostname,
  onPage: (page) => {
    console.log('Page title:', page.title)
  },
})

API Reference

`walkSite(targetURL, options)`

Crawls links starting from targetURL. Returns a Promise that resolves once all pages have been processed or fails on internal errors (unless caught by onError).

Parameters

targetURL: string | URL The starting URL to crawl.
options: WalkSiteOptions Configuration object:
| Option | Type | Description | | ----------------- | -------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | depth | number \| undefined | Limits crawl depth. depth = 0 visits only targetURL; depth = 1 includes its children, etc. Default is undefined (no limit). | | concurrency | number \| undefined | Number of pages processed in parallel. Defaults to 1 (serial crawling). | | onURL | (url: URL, meta: { href: string; depth: number }) => boolean \| void \| Promise<boolean \| void> | Called before enqueuing a link. Return false to skip. | | onPage | (page: Page) => void \| Promise<void> | Called after navigating to a page. Can be used to extract or process HTML content. | | onError | (error: unknown, url: URL) => void \| Promise<void> \| undefined | Called on errors (e.g., non-2xx HTTP status if you handle it that way, network errors, etc.). If not provided, errors are logged to console.error. | | extensions | string[] \| null \| undefined | File extensions recognized as HTML. Defaults to [".html", ".htm"]. If null, all links are followed. If you pass your own array, it completely overrides the default. |

Returns

Promise<void> Resolves when the entire crawl finishes (or rejects on internal errors, unless you handle them in onError).

`OnURL` Type

type OnURL = (
  url: URL,
  metadata: { href: string; depth: number },
) => boolean | void | Promise<boolean | void>

Return false to skip crawling url.
All other return values will include it in the crawl queue.

`Page` Type

type Page = {
  title: string
  url: URL
  href: string
  content: string
  depth: number
  ok: boolean
  status: number
}

title: <title> of the page.
url: The final URL as a URL object.
href: String form of the link from which we arrived here.
content: Inner HTML (page.content()).
depth: Depth relative to the starting URL.
ok: true if HTTP status was in the 200 range.
status: Numeric HTTP status code.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme