enterprise-ai-recursive-web-scraper

v1.0.7

Published

a month ago

AI powered, recursive, web-scraper utilizing Gemini models, Puppeteer, and Playwright

Downloads

596

0High
0Medium
0Low

womb0comb0

web-scraper ai gemini puppeteer playwright typescript cli

✨ Features

🚀 High Performance:
- Blazing fast multi-threaded scraping with concurrent processing
- Smart rate limiting to prevent API throttling and server overload
- Automatic request queuing and retry mechanisms
🤖 AI-Powered: Intelligent content extraction using Groq LLMs
🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
📊 Smart Extraction:
- Structured data extraction without LLMs using CSS selectors
- Topic-based and semantic chunking strategies
- Cosine similarity clustering for content deduplication
🎯 Advanced Capabilities:
- Recursive domain crawling with boundary respect
- Intelligent rate limiting with token bucket algorithm
- Session management for complex multi-page flows
- Custom JavaScript execution support
- Enhanced screenshot capture with lazy-load detection
- iframe content extraction
🔒 Enterprise Ready:
- Proxy support with authentication
- Custom headers and user-agent configuration
- Comprehensive error handling and retry mechanisms
- Flexible timeout and rate limit management
- Detailed logging and monitoring

🚀 Quick Start

To install the package, run:

npm install enterprise-ai-recursive-web-scraper

Using the CLI

The enterprise-ai-recursive-web-scraper package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

Installation

Ensure that the package is installed globally to use the CLI:

npm install -g enterprise-ai-recursive-web-scraper

Running the CLI

Once installed, you can use the web-scraper command to start scraping. Here’s a basic example of how to use it:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

CLI Options

-k, --api-key <key>: (Required) Your Google Gemini API key
-u, --url <url>: (Required) The URL of the website to scrape
-o, --output <directory>: Output directory for scraped data (default: scraping_output)
-d, --depth <number>: Maximum crawl depth (default: 3)
-c, --concurrency <number>: Concurrent scraping limit (default: 5)
-r, --rate-limit <number>: Requests per second (default: 5)
-t, --timeout <number>: Request timeout in milliseconds (default: 30000)
-f, --format <type>: Output format: json|csv|markdown (default: json)
-v, --verbose: Enable verbose logging
--retry-attempts <number>: Number of retry attempts (default: 3)
--retry-delay <number>: Delay between retries in ms (default: 1000)

Example usage with rate limiting:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
  --depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose

🔧 Advanced Usage

Rate Limiting Configuration

Configure rate limiting to respect server limits and prevent throttling:

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    rateLimiter: new RateLimiter({
        maxTokens: 5,      // Maximum number of tokens
        refillRate: 1,     // Tokens refilled per second
        retryAttempts: 3,  // Number of retry attempts
        retryDelay: 1000   // Delay between retries (ms)
    })
});

Structured Data Extraction

To extract structured data using a JSON schema, you can use the JsonExtractionStrategy:

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
    baseSelector: "article",
    fields: [
        { name: "title", selector: "h1" },
        { name: "content", selector: ".content" },
        { name: "date", selector: "time", attribute: "datetime" }
    ]
};

const scraper = new WebScraper({
    extractionStrategy: new JsonExtractionStrategy(schema)
});

Custom Browser Session

You can customize the browser session with specific configurations:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    browserConfig: {
        headless: false,
        proxy: "http://proxy.example.com",
        userAgent: "Custom User Agent"
    }
});

🤝 Contributors

📄 License

💙 Built with create-typescript-app