@trantoanvan/cloudflare-scraper
v1.0.0
Published
A scraper class for Cloudflare Workers
Downloads
9
Readme
cloudflare-scraper
A scraper class for Cloudflare Workers
Overview
The Scraper class is a tool designed for web scraping tasks in a Cloudflare Worker environment. It provides an intuitive API for extracting HTML, text, and attributes from web pages, with support for both single operations and chained multiple operations.
Basic Usage
Creating a Scraper Instance
You can create a Scraper instance using the static create
method:
const scraper = await Scraper.create('https://example.com');
Alternatively, you can use the constructor and fetch
method:
const scraper = new Scraper();
await scraper.fetch('https://example.com');
Single Operations
Extracting HTML
To extract the HTML content of an element:
const html = await scraper.html('div.content');
Extracting Text
To extract the text content of an element:
const text = await scraper.text('p.description');
You can also specify options:
const text = await scraper.text('p.description', { spaced: true });
Extracting Attributes
To extract an attribute from an element:
const href = await scraper.attribute('a.link', 'href');
Chained Operations
You can chain multiple operations together:
const results = await scraper.chain()
.html('div.content')
.text('p.description')
.attribute('a.link', 'href')
.getResult();
Advanced Usage
Using querySelector
For compatibility with the original API, you can use the querySelector
method:
const text = await scraper.querySelector('p.description').getText();
Error Handling
The Scraper class includes built-in error handling for common issues, including Cloudflare-specific errors. It's recommended to wrap your scraping operations in try-catch blocks:
try {
const scraper = await Scraper.create('https://example.com');
const content = await scraper.text('div.content');
} catch (error) {
console.error('Scraping failed:', error.message);
}
Explanation of Key Features
HTMLRewriter Usage: The Scraper class utilizes Cloudflare's HTMLRewriter for efficient HTML parsing. This allows for streaming parsing of HTML, which is crucial for performance in a Worker environment.
Flexible API: The class supports both immediate execution of single operations and chained execution of multiple operations. This flexibility allows for various scraping patterns.
Error Handling: The class includes robust error handling, especially for Cloudflare-specific scenarios. It distinguishes between errors in the Worker itself and errors from the scraped site.
Text Extraction: The
getText
method includes advanced features like handling multiple selectors and optional spacing between text nodes.Attribute Extraction: The
getAttribute
method efficiently extracts the first matching attribute value.Chainable Operations: The
chain
method allows for defining multiple operations that are executed together, which can be more efficient for complex scraping tasks.Stateless Design: The class is designed to be stateless between operations, which aligns well with the Cloudflare Worker environment.
Best Practices
- Use chained operations when scraping multiple elements from the same page to reduce the number of HTTP requests.
- Be mindful of the Worker CPU time limits (50ms on the free plan) when scraping large or complex pages.
- Implement caching mechanisms in your Worker script to store frequently scraped data and reduce load on target websites.
- Respect robots.txt files and implement rate limiting to be a good web citizen.
- Use error handling to gracefully manage cases where the scraped website's structure changes.
Limitations
- The Scraper class is designed for use in a Cloudflare Worker environment and may not work in other contexts without modification.
- Complex JavaScript-rendered content may not be scrapable with this class, as it operates on the initial HTML response.
- Large pages may approach CPU time limits in the Worker environment. Consider implementing pagination or incremental scraping for such cases.
By following this guide, you should be able to effectively use the Scraper class for various web scraping tasks in your Cloudflare Worker projects.