@algolia/404-crawler

v1.0.10

Published

24 days ago

Detect 404/Not found pages from sitemap

Downloads

0High
0Medium
0Low

eventexperiences_algolia

taylorcjohnson_algolia

fernando_beck_algolia

404 crawler cli not-found

404 Crawler 🏊‍♂️

A command line interface to crawl and detect 404 pages from sitemap.

Screenshot

📊 Usage

Install

Make sure npm is installed in your computer. To know more about it, visit https://docs.npmjs.com/downloading-and-installing-node-js-and-npm

In a terminal, run

npm install -g @algolia/404-crawler

After that, you'll be able to use the command 404crawler in your terminal

Examples

Crawl and detect every 404 pages from algolia website's sitemap:
```
404crawler crawl -u https://algolia.com/sitemap.xml
```
Use JavaScript rendering to crawl and identify all 404 or 'Not Found' pages on the Algolia website.
```
404crawler crawl -u https://algolia.com/sitemap.xml --render-js
```
Crawl and identify all 404 pages on the Algolia website by analyzing its sitemap, including all potential sub-path variations
```
404crawler crawl -u https://algolia.com/sitemap.xml --include-variations
```

Options

--sitemap-url or -u: Required URL of the sitemap.xml file.
--render-js or -r: Use JavaScript rendering to crawl and identify a 'Not Found Page' if the status code isn't a 404. This option is useful for websites that returns a 200 status code even if the page is not found (Next.js with custom not found page for example)
--output or -o: Ouput path for the JSON file of the results. Example: crawler/results.json. If not set, no file is written after the crawl.
--include-variations or -v: Include all sub-path variations from URLs found in the sitemap.xml. For example, if https://algolia.com/foo/bar/baz is found in the sitemap, the crawler will test https://algolia.com/foo/bar/baz, https://algolia.com/foo/bar, https://algolia.com/foo and https://algolia.com
--exit-on-detection or -e: Exit when a 404 or a 'Not Found' page is detected.
--run-in-parallel or -p: Run the crawler with multiple pages in parallel. By default, the number of parallel instances is set to 10. See --batch-size option to configure this number.
--batch-size or -s: Number or parallel instances of crawler to run: the more this number is, the more resources are consumed. Only available when --run-in-parallel option is set. If not set, default to 10.
--browser-type or -b: Type of the browser to use to crawl pages. Can be 'firefox', 'chromium' or 'webkit'. If not set, default to 'firefox'.

👨‍💻 Get started (maintainers)

This CLI is built with TypeScript and uses ts-node to run the code locally.

Install

Install all dependencies

pnpm i

Run locally

pnpm 404crawler crawl <options>

Deploy

Update package.json version
Commit and push changes
Build JS files in dist/ with
```
pnpm build
```
Initialize npm with Algolia org as scope
```
npm init --scope=algolia
```
Follow instructions
Publish package with
```
npm publish
```

🔗 References

This package uses:

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

404 Crawler 🏊‍♂️

📊 Usage

Install

Examples

Options

👨‍💻 Get started (maintainers)

Install

Run locally

Deploy

🔗 References