npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

crawltojson

v1.11.11

Published

Crawl websites and convert them to JSON with ease

Downloads

2,533

Readme

crawltojson

A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.

🎯 Intended Use

Just two commands to crawl a website and save the content in a structured JSON file.

npx crawltojson config
npx crawltojson crawl

🚀 Features

  • 🌐 Crawl any website with customizable patterns
  • 📦 Export to structured JSON
  • 🎯 CSS selector-based content extraction
  • 🔄 Automatic retry mechanism for failed requests
  • 🌲 Depth-limited crawling
  • ⏱️ Configurable timeouts
  • 🚫 URL pattern exclusion
  • 💾 Stream-based processing for memory efficiency
  • 🎨 Beautiful CLI interface with progress indicators

📋 Table of Contents

🔧 Installation

Global Installation (Recommended)

npm install -g crawltojson

Using npx (No Installation)

npx crawltojson

Local Project Installation

npm install crawltojson

🚀 Quick Start

  1. Generate configuration file:
crawltojson config
  1. Start crawling:
crawltojson crawl

⚙️ Configuration Options

Basic Options

  • url - Starting URL to crawl

    • Example: "https://example.com/blog"
    • Must be a valid HTTP/HTTPS URL
  • match - URL pattern to match (supports glob patterns)

    • Example: "https://example.com/blog/**"
    • Use ** for wildcard matching
    • Default: Same as starting URL with /** appended
  • selector - CSS selector to extract content

    • Example: "article.content"
    • Default: "body"
    • Supports any valid CSS selector
  • maxPages - Maximum number of pages to crawl

    • Default: 50
    • Range: 1 to unlimited
    • Helps control crawl scope

Advanced Options

  • maxRetries - Maximum number of retries for failed requests

    • Default: 3
    • Useful for handling temporary network issues
    • Exponential backoff between retries
  • maxLevels - Maximum depth level for crawling

    • Default: 3
    • Controls how deep the crawler goes from the starting URL
    • Level 0 is the starting URL
    • Helps prevent infinite crawling
  • timeout - Page load timeout in milliseconds

    • Default: 7000 (7 seconds)
    • Prevents hanging on slow-loading pages
    • Adjust based on site performance
  • excludePatterns - Array of URL patterns to ignore

    • Default patterns:
      [
        "**/tag/**",    // Ignore tag pages
        "**/tags/**",   // Ignore tag listings
        "**/#*",        // Ignore anchor links
        "**/search**",  // Ignore search pages
        "**.pdf",       // Ignore PDF files
        "**/archive/**" // Ignore archive pages
      ]

Configuration File

The configuration is stored in crawltojson.config.json. Example:

{
  "url": "https://example.com/blog",
  "match": "https://example.com/blog/**",
  "selector": "article.content",
  "maxPages": 100,
  "maxRetries": 3,
  "maxLevels": 3,
  "timeout": 7000,
  "outputFile": "crawltojson.output.json",
  "excludePatterns": [
    "**/tag/**",
    "**/tags/**",
    "**/#*"
  ]
}

🎯 Advanced Usage

Selecting Content

The selector option supports any valid CSS selector. Examples:

# Single element
article.main-content

# Multiple elements
.post-content, .comments

# Nested elements
article .content p

# Complex selectors
main article:not(.ad) .content

URL Pattern Matching

The match pattern supports glob-style matching:

# Match exact path
https://example.com/blog/

# Match all blog posts
https://example.com/blog/**

# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**

Exclude Patterns

Customize excludePatterns for your needs:

{
  "excludePatterns": [
    "**/tag/**",        // Tag pages
    "**/category/**",   // Category pages
    "**/page/*",        // Pagination
    "**/wp-admin/**",   // Admin pages
    "**?preview=true",  // Preview pages
    "**.pdf",           // PDF files
    "**/feed/**",       // RSS feeds
    "**/print/**"       // Print pages
  ]
}

📄 Output Format

The crawler generates a JSON file with the following structure:

[
  {
    "url": "https://example.com/page1",
    "content": "Extracted content...",
    "timestamp": "2024-11-02T12:00:00.000Z",
    "level": 0
  },
  {
    "url": "https://example.com/page2",
    "content": "More content...",
    "timestamp": "2024-11-02T12:00:10.000Z",
    "level": 1
  }
]

Fields:

  • url: The normalized URL of the crawled page
  • content: Extracted text content based on selector
  • timestamp: ISO timestamp of when the page was crawled
  • level: Depth level from the starting URL (0-based)

🎯 Use Cases

  1. Content Migration

    • Crawl existing website content
    • Export to structured format
    • Import into new platform
  2. Training Data Collection

    • Gather content for ML models
    • Create datasets for NLP
    • Build content classifiers
  3. Content Archival

    • Archive website content
    • Create backups
    • Document snapshots
  4. SEO Analysis

    • Extract meta content
    • Analyze content structure
    • Track content changes
  5. Documentation Collection

    • Crawl documentation sites
    • Create offline copies
    • Generate searchable indexes

🛠️ Development

Local Setup

  1. Clone the repository:
git clone https://github.com/yourusername/crawltojson.git
cd crawltojson
  1. Install dependencies:
npm install
  1. Build the project:
npm run build
  1. Link for local testing:
npm link

Development Commands

# Run build
npm run build

# Clean build
npm run clean

# Run tests
npm test

# Watch mode
npm run dev

Publishing

  1. Update version:
npm version patch|minor|major
  1. Build and publish:
npm run build
npm publish

❗ Troubleshooting

Common Issues

  1. Browser Installation Failed
# Manual installation
npx playwright install chromium
  1. Permission Errors
# Fix CLI permissions
chmod +x ./dist/cli.js
  1. Build Errors
# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build

Debug Mode

Set DEBUG environment variable:

DEBUG=crawltojson* crawltojson crawl

🤝 Contributing

  1. Fork the repository
  2. Create feature branch
  3. Commit changes
  4. Push to branch
  5. Create Pull Request

Coding Standards

  • Use ESLint configuration
  • Add tests for new features
  • Update documentation
  • Follow semantic versioning

📜 License

MIT License - see LICENSE for details.

🙏 Acknowledgments


Made with ❤️ by Vivek M. Agarwal