npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@jldb/web-to-md

v0.1.0

Published

A CLI tool to crawl (for example) documentation websites and convert them to Markdown.

Downloads

3

Readme

🕷️ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter 🕸️

Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀

🌟 Why Web-to-MD?

Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!

🎭 Features That'll Make You Go "Wow!"

  • 🔍 Crawls websites like a pro detective
  • 🧙‍♂️ Magically transforms HTML into beautiful Markdown
  • 🏃‍♂️ Resumes interrupted crawls (because life happens!)
  • 📚 Creates separate Markdown files or one big book of knowledge
  • 🎨 Shows fancy progress bars (because who doesn't love those?)
  • 🚦 Respects rate limits (we're polite crawlers here!)
  • 🌳 Preserves directory structure (if you're into that sort of thing)
  • 🔒 Handles authentication gracefully (no trespassing allowed!)
  • 👥 Multi-worker support (because teamwork makes the dream work!)
  • 🔄 Smart content change detection (no need to crawl what hasn't changed!)

🛠️ Installation

  1. Clone this repo (it won't bite, promise!)
  2. Run npm install (sit back and watch the magic happen)
  3. Run npm run build to compile the TypeScript code

🚀 Usage

Fire up Web-to-MD with this incantation:

npm start -- -u <url> -o <output_directory> [options]

🎛️ Options (Mix and Match to Your Heart's Content)

  • -u, --url <url>: The URL of your web treasure trove (required)
  • -o, --output <output>: Where to stash your Markdown gold (required)
  • -c, --combine: Merge all pages into one massive scroll of knowledge
  • -e, --exclude <paths>: Comma-separated list of paths to skip (shh, we won't tell)
  • -r, --rate <rate>: Max pages per second (default: 5, for the speed demons)
  • -d, --depth <depth>: How deep should we dig? (default: 3, watch out for dragons)
  • -m, --max-file-size <size>: Max file size in MB for combined output (default: 2)
  • -n, --name <name>: Name your combined file (get creative!)
  • -p, --preserve-structure: Keep the directory structure (for the neat freaks)
  • -t, --timeout <timeout>: Timeout in seconds for page navigation (default: 3.5)
  • -i, --initial-timeout <initialTimeout>: Initial timeout for the first page (default: 60)
  • -re, --retries <retries>: Number of retries for initial page load (default: 3)
  • -w, --workers <workers>: Number of concurrent workers (default: 1, for the multitaskers)

🌟 Example (Because We All Need a Little Guidance)

npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3

This will:

  1. Crawl https://docs.example.com
  2. Save Markdown files to ./my_docs
  3. Combine all pages into one file
  4. Crawl up to 5 levels deep
  5. Respect a rate limit of 3 pages per second
  6. Name the combined file "ExampleDocs"
  7. Use 3 concurrent workers for faster crawling

🔧 Config Magic: Resuming and Customizing Your Crawls

Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:

📁 Config File

After a crawl (complete or interrupted), Web-to-MD saves a config.json file in your output directory. This file contains all the settings and state information from your last crawl.

🔄 Resuming a Crawl

To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json file and pick up where it left off.

🎛️ Customizing Your Crawl

You can manually edit the config.json file to customize your next crawl. Here are the available options and their default values:

| Option | Description | Default Value | |--------|-------------|---------------| | url | Starting URL for the crawl | (Required) | | outputDir | Output directory for Markdown files | (Required) | | excludePaths | Paths to exclude from crawling | [] | | maxPagesPerSecond | Maximum pages to crawl per second | 5 | | maxDepth | Maximum depth to crawl | 3 | | maxFileSizeMB | Maximum file size in MB for combined output | 2 | | combine | Combine all pages into a single file | false | | name | Name for the combined output file | undefined | | preserveStructure | Preserve directory structure | false | | timeout | Timeout in seconds for page navigation | 3.5 | | initialTimeout | Initial timeout in seconds for the first page load | 60 | | retries | Number of retries for initial page load | 3 | | numWorkers | Number of concurrent workers | 1 |

You can modify these settings in the config.json file to customize your crawl. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

🌟 Example Workflow

  1. Start an initial crawl:

    npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2
  2. If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json.

  3. To resume, simply run:

    npm start -- -o ./my_docs
  4. To customize, edit ./my_docs/config.json to change the crawl settings as needed. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}
  1. Run the crawl again with the updated config:
    npm start -- -o ./my_docs

This workflow allows you to fine-tune your crawls and easily pick up where you left off!

🎭 Contributing

Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝

📜 License

ISC (It's So Cool) License

🙏 Acknowledgements

A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸

Now go forth and crawl some docs! 🕷️📚