web-scrap-ai
v1.0.4
Published
A chatbot CLI for web scraping using Puppeteer
Downloads
8
Maintainers
Readme
Web-Scrap-AI
Version: 1.0.3 Author: Akash Singh License: GPL-3.0
Table of Contents
- Overview
- Features
- Installation
- Usage
- How It Works
- Environment Variables
- Error Handling
- Docker Support
- Contributing
- License
Overview
Web-Scrap-AI is a powerful command-line tool that combines the capabilities of web scraping and artificial intelligence. It allows users to scrape data from websites using Puppeteer and interact with that data through an AI-powered chatbot. The tool is designed to be intuitive, allowing both automatic and custom class-based scraping, with answers tailored to the specific JSON data scraped from the website.
Features
- Docker Support: Dockerfile provided for easy containerization and deployment.
- Automatic Web Scraping: Quickly scrape a website's data using the built-in automatic scraping function.
- Custom Class-Based Scraping: Target specific parts of a website by specifying class names and assigning custom titles to the scraped data.
- AI-Powered Chatbot: Interact with the scraped data using an AI chatbot that can answer questions based on the data in JSON format.
- Context-Specific Queries: Ask the chatbot about specific sections of the scraped data, like titles or paragraphs, using simple commands such as
/title
or/paragraph
. - Environment Configuration: Easily configure and manage your Groq API key for AI interactions.
Step-by-Step Guide
- Start the CLI: Run the following command in your terminal:
- Enter the Website URL: The CLI will prompt you to enter the URL of the website you want to scrape.
- Choose Scraping Method:
- Automatic Scraping: The tool will automatically scrape the most relevant data from the website.
- Custom Class-Based Scraping: You can specify the class names you want to scrape and assign custom titles to the scraped data.
Custom Class-Based Scraping
Custom class-based scraping allows you to target specific elements of a website by providing the class names of those elements. This is particularly useful when you want to scrape specific sections of a page, such as product details, reviews, or any other content marked with identifiable class names.
How It Class-Based Works
- Class Names: You provide the class names of the elements you want to scrape.
- Titles: For each class name, you assign a title that will be used in the resulting JSON file to categorize the data.
Example
Let's say you want to scrape a website's product names and prices:
- Class name for product names:
.product-title
- Class name for product prices:
.product-price
- Assigned titles:
Product Names
andProduct Prices
During the scraping process, the tool will extract data from elements with these class names and store them under the provided titles in the JSON file.
Installation
To install Web-Scrap-AI , you can use npm:
npm install web-scrap-ai
This will install the package and automatically set up the required script in your package.json
.
Usage
Command-Line Interface (CLI)
After installation, you can use the web-scrap-ai
command to start the tool:
How It Works
Web Scraping
- Automatic Scraping: The tool can automatically scrape data from the provided URL using Puppeteer. It tries to intelligently extract meaningful content from the page.
- Custom Class-Based Scraping: Users can manually specify the class names of the HTML elements they want to scrape. The tool will prompt for the class names and the titles to assign to the data in the JSON file.
AI Chatbot
- Initiating the Chatbot: After scraping, the tool can start an AI chatbot that interacts with the scraped data. The chatbot only answers questions related to the scraped JSON data.
- Selecting the JSON File: If there are multiple JSON files in the root directory, the user can select which file to use for the chatbot session.
- Data-Specific Responses:
The AI can automatically detect commands like
/title
or/paragraph
and respond with information specific to those sections. - Fine-Tuning: The model is fine-tuned during the session to ensure it only answers questions relevant to the selected JSON file.
Environment Variables
This package requires a Groq API key only if ai model doesn't work or you are not satisfied with the answer, which is stored in a .env
file. The key is used to access Groq's AI model.
- To set api key enter "api key" when chatbot is running.
Setting Up the API Key
- If the
.env
file doesn't exist, it will be created automatically when you enter the API key for the first time. - The API key is stored under the variable name
GROQ_API_KEY
.
Error Handling
- Invalid URLs: If the provided URL is invalid, the tool will display an error message and terminate the session.
- API Errors: If there is an issue with the Groq API, the tool will prompt the user to re-enter the API key.
- File System Errors:
The tool includes error handling for file reading/writing operations. If it cannot access the
package.json
or any other required file, it will display an appropriate error message.
Docker Support
Web-Scrap-AI comes with a Dockerfile to allow easy containerization and deployment. You can build and run the Docker container as follows:
Contributing
Contributions to Web-Scrap-AI are welcome! If you have ideas for new features or improvements, feel free to submit a pull request or open an issue.
Support
If you like this project, show your support & love!
License
This project is licensed under the GPL-3.0 License. See the LICENSE file for details.