xscrape

v1.3.1

Published

5 days ago

A flexible and powerful library designed to extract and transform data from HTML documents using user-defined schemas

Downloads

431

0High
0Medium
0Low

johnie

web-scraping data-extraction automation html-parsing data-transformation user-defined-schemas crawler scraper zod yup joi effect-schema

xscrape

xscrape is a powerful and flexible library designed for extracting and transforming data from HTML documents using user-defined schemas. It integrates seamlessly with various schema validation libraries such as Zod, Yup, Joi, and Effect Schema, allowing you to use your preferred validation tool.

Features

HTML Parsing: Extract data from HTML using CSS selectors with the help of cheerio.
Schema Validation: Validate and transform extracted data with schema validation libraries like Zod.
Custom Transformations: Provide custom transformations for extractedattributes.
Default Values: Define default values for missing data fields.
Nested Field Support: Define and extract nested data structures from HTML elements.

Schema Support

| Schema Library | Status | Notes | | ---------------------------------------------------- | ------------------- | ------------------------------------------------------------------ | | Zod | ✅ Supported | Default schema tool for xscrape | | Effect/Schema | ✅ Supported | Support for Effect/Schema for additional flexibility | | Joi | ✅ Supported | Support for Joi for those familiar with server-side validation | | Yup | 🚧 Planned | Adding Yup support for schema validation in front-end applications | | Others... | 🔄 In Consideration | Potential support for other schema tools as per user feedback |

Installation

To install this library, use npm or yarn:

pnpm add xscrape
# or
npm install xscrape

Usage

Below is an example of how to use xscrape for extracting and transforming data from an HTML document:

Define Your Schema

import { z } from 'zod';

const schema = z.object({
  title: z.string().default('No title'),
  description: z.string(),
  keywords: z.array(z.string()),
  views: z.number(),
  image: z
    .object({
      url: z.string(),
      width: z.number(),
      height: z.number(),
    })
    .default({ url: '', width: 0, height: 0 })
    .optional(),
});

Define Field Definitions

import { type SchemaFieldDefinitions } from 'xscrape';

type FieldDefinitions = SchemaFieldDefinitions<z.infer<typeof schema>>;

const fields: FieldDefinitions = {
  title: { selector: 'title' },
  description: {
    selector: 'meta[name="description"]',
    attribute: 'content',

    defaultValue: 'No description',
  },
  keywords: {
    selector: 'meta[name="keywords"]',
    attribute: 'content',
    transform: (value) => value.split(','),
    defaultValue: [],
  },
  views: {
    selector: 'meta[name="views"]',
    attribute: 'content',
    transform: (value) => parseInt(value, 10),
    defaultValue: 0,
  },
  // Example of a nested field
  image: {
    fields: {
      url: {
        selector: 'meta[property="og:image"]',
        attribute: 'content',
      },
      width: {
        selector: 'meta[property="og:image:width"]',
        attribute: 'content',
        transform: (value) => parseInt(value, 10),
      },
      height: {
        selector: 'meta[property="og:image:height"]',
        attribute: 'content',
        transform: (value) => parseInt(value, 10),
      },
    },
  },
};

Create a Scraper and Extract Data

import { createScraper, ZodValidator } from 'xscrape';

const validator = new ZodValidator(schema);
const scraper = createScraper({ fields, validator });

const html = `
   <!DOCTYPE html>
   <html>
   <head>
     <meta name="description" content="An example description.">
     <meta name="keywords" content="typescript,html,parsing">
     <meta name="views" content="1234">
     <meta property="og:image" content="https://example.se/images/c12ffe73-3227-4a4a-b8ad-a3003cdf1d70?h=708&amp;tight=false&amp;w=1372">
     <meta property="og:image:width" content="1372">
     <meta property="og:image:height" content="708">
     <title>Example Title</title>
   </head>
   <body></body>
   </html>
   `;

const data = scraper(html);
console.log(data);

// Outputs:
// {
// title: 'Example Title',
// description: 'An example description.',
// keywords: ['typescript', 'html', 'parsing'],
// views: 1234
// image: {
//   url: 'https://example.se/images/c12ffe73-3227-4a4a-b8ad-a3003cdf1d70?h=708&amp;tight=false&amp;w=1372',
//   width: 1372,
//   height: 708
// }
// }

Configuration

xscrape offers a range of configuration options through the types provided, allowing for detailed customization and robust data extraction and validation:

SchemaFieldDefinitions: Determines how fields are extracted from the HTML.
SchemaValidator: Validates the extracted data according to defined schemas.

API Reference

createScraper(config: ScrapeConfig): (html: string) => T Creates a scraping function based on the specified fields and validator.
ZodValidator A built-in validator using Zod, allowing you to define schemas andvalidate data effortlessly.

For a complete list of API methods and more advanced configuration options,refer to the documentation on the project homepage https://github.com/johnie/xscrape.

Contributing

Contributions are welcome! Please see the Contributing Guide https://github.com/johnie/xscrape/blob/main/CONTRIBUTING.md for more information.

License

This project is licensed under the MIT License. See the LICENSE https://github.com/johnie/xscrape/blob/main/LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

xscrape

Features

Schema Support

Installation

Usage

Configuration

API Reference

Contributing

License