npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

enterprise-ai-recursive-web-scraper

v1.0.7

Published

AI powered, recursive, web-scraper utilizing Gemini models, Puppeteer, and Playwright

Downloads

596

Readme

✨ Features

  • 🚀 High Performance:
    • Blazing fast multi-threaded scraping with concurrent processing
    • Smart rate limiting to prevent API throttling and server overload
    • Automatic request queuing and retry mechanisms
  • 🤖 AI-Powered: Intelligent content extraction using Groq LLMs
  • 🌐 Multi-Browser: Support for Chromium, Firefox, and WebKit
  • 📊 Smart Extraction:
    • Structured data extraction without LLMs using CSS selectors
    • Topic-based and semantic chunking strategies
    • Cosine similarity clustering for content deduplication
  • 🎯 Advanced Capabilities:
    • Recursive domain crawling with boundary respect
    • Intelligent rate limiting with token bucket algorithm
    • Session management for complex multi-page flows
    • Custom JavaScript execution support
    • Enhanced screenshot capture with lazy-load detection
    • iframe content extraction
  • 🔒 Enterprise Ready:
    • Proxy support with authentication
    • Custom headers and user-agent configuration
    • Comprehensive error handling and retry mechanisms
    • Flexible timeout and rate limit management
    • Detailed logging and monitoring

🚀 Quick Start

To install the package, run:

npm install enterprise-ai-recursive-web-scraper

Using the CLI

The enterprise-ai-recursive-web-scraper package includes a command-line interface (CLI) that you can use to perform web scraping tasks directly from the terminal.

Installation

Ensure that the package is installed globally to use the CLI:

npm install -g enterprise-ai-recursive-web-scraper

Running the CLI

Once installed, you can use the web-scraper command to start scraping. Here’s a basic example of how to use it:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output

CLI Options

  • -k, --api-key <key>: (Required) Your Google Gemini API key
  • -u, --url <url>: (Required) The URL of the website to scrape
  • -o, --output <directory>: Output directory for scraped data (default: scraping_output)
  • -d, --depth <number>: Maximum crawl depth (default: 3)
  • -c, --concurrency <number>: Concurrent scraping limit (default: 5)
  • -r, --rate-limit <number>: Requests per second (default: 5)
  • -t, --timeout <number>: Request timeout in milliseconds (default: 30000)
  • -f, --format <type>: Output format: json|csv|markdown (default: json)
  • -v, --verbose: Enable verbose logging
  • --retry-attempts <number>: Number of retry attempts (default: 3)
  • --retry-delay <number>: Delay between retries in ms (default: 1000)

Example usage with rate limiting:

web-scraper --api-key YOUR_API_KEY --url https://example.com --output ./output \
  --depth 5 --concurrency 10 --rate-limit 2 --retry-attempts 3 --format csv --verbose

🔧 Advanced Usage

Rate Limiting Configuration

Configure rate limiting to respect server limits and prevent throttling:

import { WebScraper, RateLimiter } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    rateLimiter: new RateLimiter({
        maxTokens: 5,      // Maximum number of tokens
        refillRate: 1,     // Tokens refilled per second
        retryAttempts: 3,  // Number of retry attempts
        retryDelay: 1000   // Delay between retries (ms)
    })
});

Structured Data Extraction

To extract structured data using a JSON schema, you can use the JsonExtractionStrategy:

import { WebScraper, JsonExtractionStrategy } from "enterprise-ai-recursive-web-scraper";

const schema = {
    baseSelector: "article",
    fields: [
        { name: "title", selector: "h1" },
        { name: "content", selector: ".content" },
        { name: "date", selector: "time", attribute: "datetime" }
    ]
};

const scraper = new WebScraper({
    extractionStrategy: new JsonExtractionStrategy(schema)
});

Custom Browser Session

You can customize the browser session with specific configurations:

import { WebScraper } from "enterprise-ai-recursive-web-scraper";

const scraper = new WebScraper({
    browserConfig: {
        headless: false,
        proxy: "http://proxy.example.com",
        userAgent: "Custom User Agent"
    }
});

🤝 Contributors

📄 License

MIT © Mike Odnis

💙 Built with create-typescript-app