enterprise-ai-recursive-web-scraper
v1.0.7
Published
AI powered, recursive, web-scraper utilizing Gemini models, Puppeteer, and Playwright
Downloads
596
Maintainers
Readme
✨ Features
- 🚀 High Performance:
- Blazing fast multi-threaded scraping with concurrent processing
- Smart rate limiting to prevent API throttling and server overload
- Automatic request queuing and retry mechanisms
- 🤖 AI-Powered: Intelligent content extraction using Groq LLMs
- 🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
- 📊 Smart Extraction:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
- 🎯 Advanced Capabilities:
- Recursive domain crawling with boundary respect
- Intelligent rate limiting with token bucket algorithm
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
- 🔒 Enterprise Ready:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling and retry mechanisms
- Flexible timeout and rate limit management
- Detailed logging and monitoring
🚀 Quick Start
To install the package, run:
npm install enterprise-ai-recursive-web-scraper
Using the CLI
The enterprise-ai-recursive-web-scraper
package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.
Installation
Ensure that the package is installed globally to use the CLI:
npm install -g enterprise-ai-recursive-web-scraper
Running the CLI
Once installed, you can use the web-scraper
command to start scraping. Here’s a basic example of how to use it:
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output
CLI Options
-k, --api-key <key>
: (Required) Your Google Gemini API key-u, --url <url>
: (Required) The URL of the website to scrape-o, --output <directory>
: Output directory for scraped data (default:scraping_output
)-d, --depth <number>
: Maximum crawl depth (default:3
)-c, --concurrency <number>
: Concurrent scraping limit (default:5
)-r, --rate-limit <number>
: Requests per second (default:5
)-t, --timeout <number>
: Request timeout in milliseconds (default:30000
)-f, --format <type>
: Output format: json|csv|markdown (default:json
)-v, --verbose
: Enable verbose logging--retry-attempts <number>
: Number of retry attempts (default:3
)--retry-delay <number>
: Delay between retries in ms (default:1000
)
Example usage with rate limiting:
web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
--depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose
🔧 Advanced Usage
Rate Limiting Configuration
Configure rate limiting to respect server limits and prevent throttling:
import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";
const scraper = new WebScraper({
rateLimiter: new RateLimiter({
maxTokens: 5, // Maximum number of tokens
refillRate: 1, // Tokens refilled per second
retryAttempts: 3, // Number of retry attempts
retryDelay: 1000 // Delay between retries (ms)
})
});
Structured Data Extraction
To extract structured data using a JSON schema, you can use the JsonExtractionStrategy
:
import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";
const schema = {
baseSelector: "article",
fields: [
{ name: "title", selector: "h1" },
{ name: "content", selector: ".content" },
{ name: "date", selector: "time", attribute: "datetime" }
]
};
const scraper = new WebScraper({
extractionStrategy: new JsonExtractionStrategy(schema)
});
Custom Browser Session
You can customize the browser session with specific configurations:
import { WebScraper } from "enterprise-ai-recursive-web-scraper";
const scraper = new WebScraper({
browserConfig: {
headless: false,
proxy: "http://proxy.example.com",
userAgent: "Custom User Agent"
}
});
🤝 Contributors
📄 License
MIT © Mike Odnis
💙 Built with
create-typescript-app