@jldb/web-to-md

v0.1.0

Published

8 months ago

A CLI tool to crawl (for example) documentation websites and convert them to Markdown.

Downloads

0High
0Medium
0Low

jldb

documentation crawler markdown cli

🕷️ Web-to-MD: Your Friendly Neighborhood Web Crawler and Markdown Converter 🕸️

Welcome to Web-to-MD, the CLI tool that turns websites into your personal Markdown library! 🚀

🌟 Why Web-to-MD?

Ever wished you could magically transform entire websites into neatly organized Markdown files? Well, wish no more! Web-to-MD is here to save the day (and your sanity)!

🎭 Features That'll Make You Go "Wow!"

🔍 Crawls websites like a pro detective
🧙‍♂️ Magically transforms HTML into beautiful Markdown
🏃‍♂️ Resumes interrupted crawls (because life happens!)
📚 Creates separate Markdown files or one big book of knowledge
🎨 Shows fancy progress bars (because who doesn't love those?)
🚦 Respects rate limits (we're polite crawlers here!)
🌳 Preserves directory structure (if you're into that sort of thing)
🔒 Handles authentication gracefully (no trespassing allowed!)
👥 Multi-worker support (because teamwork makes the dream work!)
🔄 Smart content change detection (no need to crawl what hasn't changed!)

🛠️ Installation

Clone this repo (it won't bite, promise!)
Run npm install (sit back and watch the magic happen)
Run npm run build to compile the TypeScript code

🚀 Usage

Fire up Web-to-MD with this incantation:

npm start -- -u <url> -o <output_directory> [options]

🎛️ Options (Mix and Match to Your Heart's Content)

-u, --url <url>: The URL of your web treasure trove (required)
-o, --output <output>: Where to stash your Markdown gold (required)
-c, --combine: Merge all pages into one massive scroll of knowledge
-e, --exclude <paths>: Comma-separated list of paths to skip (shh, we won't tell)
-r, --rate <rate>: Max pages per second (default: 5, for the speed demons)
-d, --depth <depth>: How deep should we dig? (default: 3, watch out for dragons)
-m, --max-file-size <size>: Max file size in MB for combined output (default: 2)
-n, --name <name>: Name your combined file (get creative!)
-p, --preserve-structure: Keep the directory structure (for the neat freaks)
-t, --timeout <timeout>: Timeout in seconds for page navigation (default: 3.5)
-i, --initial-timeout <initialTimeout>: Initial timeout for the first page (default: 60)
-re, --retries <retries>: Number of retries for initial page load (default: 3)
-w, --workers <workers>: Number of concurrent workers (default: 1, for the multitaskers)

🌟 Example (Because We All Need a Little Guidance)

npm start -- -u https://docs.example.com -o ./my_docs -c -d 5 -r 3 -n "ExampleDocs" -w 3

This will:

Crawl https://docs.example.com
Save Markdown files to ./my_docs
Combine all pages into one file
Crawl up to 5 levels deep
Respect a rate limit of 3 pages per second
Name the combined file "ExampleDocs"
Use 3 concurrent workers for faster crawling

🔧 Config Magic: Resuming and Customizing Your Crawls

Web-to-MD comes with a nifty config feature that lets you resume interrupted crawls and customize your crawling experience. Here's how it works:

📁 Config File

After a crawl (complete or interrupted), Web-to-MD saves a config.json file in your output directory. This file contains all the settings and state information from your last crawl.

🔄 Resuming a Crawl

To resume an interrupted crawl, simply run Web-to-MD with the same output directory. The tool will automatically detect the config.json file and pick up where it left off.

🎛️ Customizing Your Crawl

You can manually edit the config.json file to customize your next crawl. Here are the available options and their default values:

| Option | Description | Default Value | |--------|-------------|---------------| | url | Starting URL for the crawl | (Required) | | outputDir | Output directory for Markdown files | (Required) | | excludePaths | Paths to exclude from crawling | [] | | maxPagesPerSecond | Maximum pages to crawl per second | 5 | | maxDepth | Maximum depth to crawl | 3 | | maxFileSizeMB | Maximum file size in MB for combined output | 2 | | combine | Combine all pages into a single file | false | | name | Name for the combined output file | undefined | | preserveStructure | Preserve directory structure | false | | timeout | Timeout in seconds for page navigation | 3.5 | | initialTimeout | Initial timeout in seconds for the first page load | 60 | | retries | Number of retries for initial page load | 3 | | numWorkers | Number of concurrent workers | 1 |

You can modify these settings in the config.json file to customize your crawl. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

🌟 Example Workflow

Start an initial crawl:

npm start -- -u https://docs.example.com -o ./my_docs -d 3 -w 2

If the crawl is interrupted, Web-to-MD will save the state in ./my_docs/config.json.
To resume, simply run:
```
npm start -- -o ./my_docs
```
To customize, edit ./my_docs/config.json to change the crawl settings as needed. For example:

{
  "settings": {
    "url": "https://docs.example.com",
    "outputDir": "./my_docs",
    "excludePaths": ["/blog", "/forum"],
    "maxPagesPerSecond": 5,
    "maxDepth": 4,
    "numWorkers": 3
  }
}

Run the crawl again with the updated config:
```
npm start -- -o ./my_docs
```

This workflow allows you to fine-tune your crawls and easily pick up where you left off!

🎭 Contributing

Got ideas? Found a bug? We're all ears! Open an issue or send a pull request. Let's make Web-to-MD even more awesome together! 🤝

📜 License

ISC (It's So Cool) License

🙏 Acknowledgements

A big thank you to all the open-source projects that made Web-to-MD possible. You rock! 🎸

Now go forth and crawl some docs! 🕷️📚