puppeteer-ecommerce-scraper
v0.1.4
Published
Client-side rendering approach for extracting product data from ecommerce websites with pagination using undetectable puppeteer-cluster
Downloads
6
Maintainers
Readme
Puppeteer Ecommerce Scraper
Demo: https://youtu.be/KOoI-CLNHxU
This is a flexible web scraper for extracting product data from various ecommerce websites:
- It uses high-level API from Puppeteer to control Chrome or Chromium, making it capable of extracting data from websites that dynamically load content using JavaScript.
- This scraper is also designed to handle pagination and bot detection, along with the use of puppeteer-cluster for efficient and parallel scraping.
npm i puppeteer-ecommerce-scraper
Examples
The examples folder contains my example scripts for different e-commerce websites. You can use them as a starting point for your own scraping tasks.
For example, the tiki1.js script configures the scraper to navigate throughout the Android
and iPhone
product pages of Tiki (a Vietnamese ecommerce website) and extract each product title, its price, and image URL from them, using a consistent user profile and a proxy server.
This script only use 2 functions: clusterWrapper to wrap the scraping process and scrapeWithPagination, an end-to-end function to scrape, paginate, and save the product data from the website automatically. If you want a more customized scraping process, you can use the other functions provided in the different modules. I also provided scripts with post-fix 2
(such as tiki2.js) to demonstrate how to use these functions to scrape the same website.
Functions Provided
The functions and utilities of the scraper are divided into 3 modules: clusterWrapper
, scraper
, and helpers
. They are exported in src/index.js in the following order:
clusterWrapper
.scraper
: { scrapeWithPagination, autoScroll, saveProduct, navigatePage }.helpers
: { isFileExists, createFile, getWebName, url2FileName, getChromeProfilePath, getChromeExecutablePath }.
clusterWrapper
🔝
async function clusterWrapper({
func, // Function to be executed on each queue entry
queueEntries, // Array or Object of queue entries. This can be the keywords you want to peform the scape.
proxyEndpoint = '', // Must be in the form of http://username:password@host:port
monitor = false, // Whether to monitor the progress of the scraping process
useProfile = false, // Whether to use a consistent user profile
otherConfigs = {}, // Other configurations for Puppeteer
})
This function uses the puppeteer-cluster to launch multiple instances of the browser at the same time (maximum 5) and set up different web scraping tasks to execute for each queue entry with a default timeout of 10 seconds before closing the cluster. Here, the scraper uses several techniques to avoid detection:
- puppeteer-extra-plugin-stealth: This plugin applies evasion techniques to make the scraping activity appear more like normal browsing or a real user and less like a bot.
- useProfile: By using a consistent user profile (enabled by the
useProfile
option), the scraper can appear as a returning user rather than a new session each time. This option can be also beneficial when solving CAPTCHAs as we may avoid doing the same thing next time. - CAPTCHAs: If the website requires solving CAPTCHAs, the script can wait until you solve it manually and then continue the scraping process.
- proxyEndpoint: The scraper can route its requests through different proxy servers to disguise its IP address and avoid IP-based blocking.
You can run the test.js script to see the bot detection result when using this wrapper. Each task loads a page, gets the IP information, and then calls the func
function with the Puppeteer page and queue data from the queueEntries
.
scraper
.scrapeWithPagination 🔝
async function scrapeWithPagination({
page, // Puppeteer page object, which represents a single tab in Chrome
extractFunc, // Function to extract product info from product DOM
scrapingConfig = { // Configuration for scraping process
url: '', // URL of the webpage to scrape
productSelector: '', // CSS selector for product elements
filePath: '', // File path to save the scraped data. If not provided, the function will generate one based on the URL
fileHeader: '' // Header for the file
},
paginationConfig = { // Configuration for handling pagination
nextPageSelector: '', // CSS selector for the "next page" button
disabledSelector: '', // CSS selector for the disabled state of the "next page" button (to detect the end of pagination)
sleep: 1000, // Delay the execution to allow for page loading or other asynchronous operations to complete
maxPages: 0 // Maximum number of pages to scrape (0 for unlimited)
},
scrollConfig = { // Configuration for auto-scrolling
scrollDelay: NaN, // Delay between scrolls
scrollStep: NaN, // The amount (size) to scroll each time
numOfScroll: 1, // Number of scrolls to perform
direction: 'both' // Scroll direction ('up', 'down', 'both')
},
})
👉 return { products
, totalPages
, scrapingConfig
, paginationConfig
, scrollConfig
}
The scraper can navigate through multiple pages of results using this function:
- It begins by navigating to the specified
url
and uses thenextPageSelector
anddisabledSelector
from thepaginationConfig
to identify the "next page" button on the webpage and clicks it to load the next set of results. - This process is repeated until all pages have been scraped (the "next page" button has
disabledSelector
) or a maximum limit (maxPages
) has been reached. - Inside the loop, the function waits for the product elements to be visible on the page, then autoScroll the page according to the
scrollConfig
setup. This is done to ensure that all product elements are fully rendered and can be scraped. - Next, the function scrapes the product information using the provided
extractFunc
and then saveProduct to the file. - Finally, the function attempts to navigate to the next page using the navigatePage function and the
paginationConfig
parameters.
scraper
.autoScroll 🔝
function autoScroll(
delay, // Delay between scrolls
scrollStep, // The amount (size) to scroll each time
direction // Scroll direction ('up', 'down', 'both')
)
This function automatically scrolls a Puppeteer page
object in the specified direction
(up, down, or both) by the specified scrollStep
amount. It continues to scroll until the end of the page is reached, waiting for the specified delay
between each scroll.
scraper
.saveProduct 🔝
function saveProduct(
products, // Array of product information
productInfo, // Object containing information about the product
filePath // File path to save the scraped data
)
If all productInfo
's values are truthy, the function will push them into the products
array and append (save) them to a file at the specified filePath
.
scraper
.navigatePage 🔝
async function navigatePage({
page, // Puppeteer page object
nextPageSelector, // CSS selector for the "next page" button
disabledSelector, // CSS selector for the disabled state of the "next page" button (to detect the end of pagination)
sleep = 1000 // Delay the execution to allow for page loading or other asynchronous operations to complete
})
👉 return Boolean
indicating whether the navigation was successful or if there is a "next page".
This function identifies if "next page" aimed to navigate is not the last page by using disabledSelector
. If there is a "next page", it waits for current the navigation to complete and then click the nextPageSelector
. Otherwise, it returns false
, indicating that there is no "next page" to navigate. This could be used by the calling code to decide whether to continue scraping or stop.
helpers
🔝
- isFileExists(
filePath
): Checks if a file exists at the givenfilePath
. It returns a boolean value indicating whether the file exists. - createFile(
filePath
,header
= ''): Creates a new file at the givenfilePath
with the providedheader
as the first line. If the file already exists, it will not be overwritten. - getWebName(
url
): Extracts the website name from a URL. - url2FileName(
url
): Converts a URL into a filename-safe string by removing invalid characters. - getChromeProfilePath(): Returns the path to the Chrome profile directory on different platforms (Windows, macOS, Linux).
- getChromeExecutablePath(): Returns the path to the Chrome executable on different platforms (Windows, macOS, Linux).
Disclaimer
This scraper is designed for educational purposes only. The user is responsible for complying with the terms of service of the websites being scraped. The scraper should be used responsibly and respectfully to avoid overloading the websites with requests and to prevent IP blocking or other forms of retaliation.