npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

lusail

v0.8.1

Published

JavaScript implementation of Lusail, a domain-specific language for extracting structured data from HTML

Downloads

5

Readme

lusail-js

npm version CI/CD Workflow

JavaScript implementation of Lusail, a domain-specific language for extracting structured data from HTML.

What is Lusail?

Lusail is an extensible domain-specific language designed to make it easy to express the structure of the data that needs to be extracted from an HTML document. It relies on a combination of field definitions and transformation pipelines to dictate data extraction and processing for each field. The transforms within a pipeline process input data sequentially, with each transform receiving the output of its predecessor, applying its specific logic, and then passing the result to the subsequent transform.

A Lusail template can be defined using any object notation. Here's a simple example of a Lusail template in YAML:

# Get the text content of the first element matching the CSS selector "title" and assign it to the
# field "pageTitle".
pageTitle:
  - cssSelector: title
  - get: single
  - get: text
# Get the text content of the first element matching the CSS selector ".description" and assign it
# to the field "pageDescription".
pageDescription:
  - cssSelector: .description
  - get: single
  - get: text
# Get all the href attributes of elements matching the CSS selector "body > a" and assign the
# resulting array to the "links" field.
links:
  - cssSelector: "body > a"
  - attribute: href
# Get all elements matching the CSS selector ".post" and extract certain fields from each.
posts:
  - cssSelector: .post
  - fields:
      title:
        - cssSelector: .title
        - get: single
        - get: text
      content:
        - cssSelector: .content
        - get: single
        - get: text

Now consider this HTML document:

<html>
  <head>
    <title>Lusail</title>
  </head>
  <body>
    <h1 class="description">JavaScript implementation of Lusail</h1>
    <a href="https://www.example.com">Example</a>
    <a href="https://www.github.com">Example 2</a>
    <div class="post">
      <h2 class="title">Post 1</h2>
      <p class="content">Content 1.</p>
    </div>
    <div class="post">
      <h2 class="title">Post 2</h2>
      <p class="content">Content 2.</p>
    </div>
  </body>
</html>

Applying the above template to the given HTML document will produce:

{
  "pageTitle": "Lusail",
  "pageDescription": "JavaScript implementation of Lusail",
  "links": [ "https://www.example.com", "https://www.github.com" ],
  "posts": [
    { "title": "Post 1", "content": "Content 1." },
    { "title": "Post 2", "content": "Content 2." }
  ]
}

This library is a JavaScript parser for the Lusail language.

Installation

npm install --save lusail

Usage

Create a Lusail instance by passing in a template as a JavaScript object:

import { Lusail, LusailTemplate } from 'lusail';

// Define your Lusail template.
const template: LusailTemplate = {
  pageTitle: [
    { cssSelector: '.title' },
    { get: 'single', index: 0 },
    { get: 'text' },
  ],
};

// Create a Lusail instance.
const lusail = new Lusail(template);

Or define it as a YAML template for a more concise structure:

import { Lusail } from 'lusail';

const yamlTemplate = `
pageTitle:
  - cssSelector: .title
  - get: single
    index: 0
  - get: text
`;

const lusail = Lusail.fromYaml(yamlTemplate);

Then parse your HTML as a string:

const result = await lusail.parseFromString(html);

Or let it fetch the HTML from a URL:

const result = await lusail.parseFromUrl(url);

Supported Transforms

Single

Retrieves a single element from an array by index.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | If index is not defined | single | | index | The index to pick | If getBy is not specified | 0 |

Range

Retrieves a range of elements by start and end indexes.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | If none of the other properties are provided | range | | start | The starting index of the range | If none of the other properties are provided | 0 | | end | The ending index of the range | If none of the other properties are provided | End of the input array |

CSS Selector

Retrieves elements matching the given selector.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | cssSelector | | cssSelector | The CSS selector to match elements | Yes | - |

Element Text

Retrieves the text content of input element(s).

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Triggers this transform | Yes | text |

Attribute

Retrieves the value of the specified attribute of input element(s).

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | attribute | | attribute | The name of the attribute to retrieve | Yes | - |

Cast

Casts incoming value(s) to a target type.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | cast | | castTo | The field type to cast the value to | Yes | - |

Date

Casts incoming value(s) to date(s), using an optional format and locale.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | cast | | castTo | The field type to cast the value to | Yes | date | | format | The format of the date string or 'timeAgo' for relative time | No | ISO 8601 format | | locale | The locale to be used when parsing the date | No | - |

Regex

Applies a regular expression substitution to the input value(s).

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | regex | | regex | The regex pattern to apply | Yes | - | | replaceWith | The string to replace matched patterns with | No | '$1' | | requireMatch | Whether to pass the input value if it does not match the pattern | No | false |

Extract Fields

Extracts fields by applying a sub-template to the input.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | fields | | fields | The LusailTemplate for extracting fields | Yes | - |

Follow Links

Follows links from input strings and extracts fields given by a sub-template.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | followingLinks, followLinks, or links | | followLinks | The LusailTemplate to apply to the linked content | Yes | - |

Literal

Transforms the input(s) a fixed literal value.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | literal | | literal | The fixed literal value | Yes | - |

Hoist

Hoists nested fields to the top level of the result

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Triggers this transform | Yes | hoist or hoisting |

Existence

Determines whether the value transformed up to this point exists or not. Existence is determined by truthiness. If the value is an array, then existence is determined by the existence of a truthy value in the array.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | existence or exists | | exists | Whether to check for existence (true) or absence (false) | No | true |

Map

Replaces incoming values with new values using a key/value map.

| Property | Description | Required | Default / required value | | - | - | - | - | | getBy | Explicitly triggers this transform | No | map or mapping | | map | The map to use for the conversion | Yes | - | | strict | Whether to allow unmatched values to pass | No | false |

Adding Custom Transforms

Lusail-js allows you to extend its functionality by registering custom transformations. These additional transformations can then be used in your Lusail templates.

Here's how to create and register a custom plugin:

1. Create a custom transformer

Implement a custom transformer that extends the Transformer abstract class.

import { FieldTransform, Transformer, TransformerFactories } from 'lusail';

export interface MyTransform extends FieldTransform<'mine'> {
  myOption: any;
}

export default class MyTransformer extends Transformer<MyTransform> {
  async execute(input: Element | Element[]): Promise<string | string[]> {
    // Your transformation logic goes here...
  }
}

2. Register a custom transformer factory

A custom transformer factory is a function that returns a Transformer instance if the given FieldTransform matches the custom transformation type. To make your custom transformation available in Lusail templates, you need to register it with Lusail.

import { Lusail } from 'lusail';

function isMyTransform(transform: FieldTransform): transform is MyTransform {
  return transform.getBy === 'mine';
}

Lusail.registerTransform(
  (transform, options) => {
    return MyTransformer.isMyTransform(transform)
      ? new MyTransformer(transform, options)
      : undefined;
  },
  // Optional precedence argument. Factories that claim higher precedence will be chosen over those
  // with lower precedence in case of conflict.
  2
);

3. Use the custom transformer in your templates

Now, your custom transformation type can be used in Lusail templates:

customField:
  - getBy: mine
    myOption: <value>

Documentation

See API Documentation for more details.

Development Status

Please note that Lusail is still under development and has not been thoroughly tested. As such, its use in production environments is not yet recommended. Also note that while we will attempt to follow semantic versioning for the library, there might be breaking changes between minor versions from time to time until we reach a stable state. Please report any issues you encounter and/or submit a pull request so we can make the library better.

Contributing

This is an evolving project, and contributions are welcome. Please read the CONTRIBUTING.md file for guidelines on how to contribute.

License

This project is licensed under the MIT License. See the LICENSE file for details.