crawltojson
v1.11.11
Published
Crawl websites and convert them to JSON with ease
Downloads
2,533
Maintainers
Readme
crawltojson
A powerful and flexible web crawler that converts website content into structured JSON. Perfect for creating training datasets, content migration, web scraping, or any task requiring structured web content extraction.
🎯 Intended Use
Just two commands to crawl a website and save the content in a structured JSON file.
npx crawltojson config
npx crawltojson crawl
🚀 Features
- 🌐 Crawl any website with customizable patterns
- 📦 Export to structured JSON
- 🎯 CSS selector-based content extraction
- 🔄 Automatic retry mechanism for failed requests
- 🌲 Depth-limited crawling
- ⏱️ Configurable timeouts
- 🚫 URL pattern exclusion
- 💾 Stream-based processing for memory efficiency
- 🎨 Beautiful CLI interface with progress indicators
📋 Table of Contents
- Installation
- Quick Start
- Configuration Options
- Advanced Usage
- Output Format
- Use Cases
- Development
- Troubleshooting
- Contributing
- License
🔧 Installation
Global Installation (Recommended)
npm install -g crawltojson
Using npx (No Installation)
npx crawltojson
Local Project Installation
npm install crawltojson
🚀 Quick Start
- Generate configuration file:
crawltojson config
- Start crawling:
crawltojson crawl
⚙️ Configuration Options
Basic Options
url
- Starting URL to crawl- Example: "https://example.com/blog"
- Must be a valid HTTP/HTTPS URL
match
- URL pattern to match (supports glob patterns)- Example: "https://example.com/blog/**"
- Use ** for wildcard matching
- Default: Same as starting URL with /** appended
selector
- CSS selector to extract content- Example: "article.content"
- Default: "body"
- Supports any valid CSS selector
maxPages
- Maximum number of pages to crawl- Default: 50
- Range: 1 to unlimited
- Helps control crawl scope
Advanced Options
maxRetries
- Maximum number of retries for failed requests- Default: 3
- Useful for handling temporary network issues
- Exponential backoff between retries
maxLevels
- Maximum depth level for crawling- Default: 3
- Controls how deep the crawler goes from the starting URL
- Level 0 is the starting URL
- Helps prevent infinite crawling
timeout
- Page load timeout in milliseconds- Default: 7000 (7 seconds)
- Prevents hanging on slow-loading pages
- Adjust based on site performance
excludePatterns
- Array of URL patterns to ignore- Default patterns:
[ "**/tag/**", // Ignore tag pages "**/tags/**", // Ignore tag listings "**/#*", // Ignore anchor links "**/search**", // Ignore search pages "**.pdf", // Ignore PDF files "**/archive/**" // Ignore archive pages ]
- Default patterns:
Configuration File
The configuration is stored in crawltojson.config.json
. Example:
{
"url": "https://example.com/blog",
"match": "https://example.com/blog/**",
"selector": "article.content",
"maxPages": 100,
"maxRetries": 3,
"maxLevels": 3,
"timeout": 7000,
"outputFile": "crawltojson.output.json",
"excludePatterns": [
"**/tag/**",
"**/tags/**",
"**/#*"
]
}
🎯 Advanced Usage
Selecting Content
The selector
option supports any valid CSS selector. Examples:
# Single element
article.main-content
# Multiple elements
.post-content, .comments
# Nested elements
article .content p
# Complex selectors
main article:not(.ad) .content
URL Pattern Matching
The match
pattern supports glob-style matching:
# Match exact path
https://example.com/blog/
# Match all blog posts
https://example.com/blog/**
# Match specific sections
https://example.com/blog/2024/**
https://example.com/blog/*/technical/**
Exclude Patterns
Customize excludePatterns
for your needs:
{
"excludePatterns": [
"**/tag/**", // Tag pages
"**/category/**", // Category pages
"**/page/*", // Pagination
"**/wp-admin/**", // Admin pages
"**?preview=true", // Preview pages
"**.pdf", // PDF files
"**/feed/**", // RSS feeds
"**/print/**" // Print pages
]
}
📄 Output Format
The crawler generates a JSON file with the following structure:
[
{
"url": "https://example.com/page1",
"content": "Extracted content...",
"timestamp": "2024-11-02T12:00:00.000Z",
"level": 0
},
{
"url": "https://example.com/page2",
"content": "More content...",
"timestamp": "2024-11-02T12:00:10.000Z",
"level": 1
}
]
Fields:
url
: The normalized URL of the crawled pagecontent
: Extracted text content based on selectortimestamp
: ISO timestamp of when the page was crawledlevel
: Depth level from the starting URL (0-based)
🎯 Use Cases
Content Migration
- Crawl existing website content
- Export to structured format
- Import into new platform
Training Data Collection
- Gather content for ML models
- Create datasets for NLP
- Build content classifiers
Content Archival
- Archive website content
- Create backups
- Document snapshots
SEO Analysis
- Extract meta content
- Analyze content structure
- Track content changes
Documentation Collection
- Crawl documentation sites
- Create offline copies
- Generate searchable indexes
🛠️ Development
Local Setup
- Clone the repository:
git clone https://github.com/yourusername/crawltojson.git
cd crawltojson
- Install dependencies:
npm install
- Build the project:
npm run build
- Link for local testing:
npm link
Development Commands
# Run build
npm run build
# Clean build
npm run clean
# Run tests
npm test
# Watch mode
npm run dev
Publishing
- Update version:
npm version patch|minor|major
- Build and publish:
npm run build
npm publish
❗ Troubleshooting
Common Issues
- Browser Installation Failed
# Manual installation
npx playwright install chromium
- Permission Errors
# Fix CLI permissions
chmod +x ./dist/cli.js
- Build Errors
# Clean install
rm -rf node_modules dist package-lock.json
npm install
npm run build
Debug Mode
Set DEBUG environment variable:
DEBUG=crawltojson* crawltojson crawl
🤝 Contributing
- Fork the repository
- Create feature branch
- Commit changes
- Push to branch
- Create Pull Request
Coding Standards
- Use ESLint configuration
- Add tests for new features
- Update documentation
- Follow semantic versioning
📜 License
MIT License - see LICENSE for details.
🙏 Acknowledgments
- Built with Playwright
- CLI powered by Commander.js
- Inspired by web scraping communities
Made with ❤️ by Vivek M. Agarwal