npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

cheerio-json-mapper

v1.0.4

Published

Extract data from HTML using JSON mapping with custom pipe functionality

Downloads

949

Readme

Cheerio JSON Mapper

Extract HTML markup to JSON using Cheerio.


License: MIT npm version

Install

# npm
npm i -S cheerio-json-mapper

# yarn
yarn add cheerio-json-mapper

Usage

import { cheerioJsonMapper } from 'cheerio-json-mapper';

const html = `
    <article>
        <h1>My headline</h1>
        <div class="content">
            <p>My article text.</p>
        </div>
        <div class="author">
            <a href="mailto:[email protected]">John Doe</a>
        </div>
    </article>
`;

const template = {
  headline: 'article > h1',
  articleText: 'article > .content',
  author: {
    $: 'article > .author',
    name: '> a',
    email: '> a | attr:href | substr:7',
  },
};

const result = await cheerioJsonMapper(html, template);
console.log(result);
// output:
// {
//     headline: "My headline",
//     articleText: "My article text.",
//     author: {
//         name: "John Doe",
//         email: "[email protected]"
//     }
// }

More examples are found in the repo's tests/cases folder.

Core concepts

End-Result Structure First

The main approach is to start from what we need to retrieve. Defining the end structure and just telling each property which selector to use to get its value.

Hard-coded values (literals)

We can set hard values to the structure by wrapping strings in quotes or single-quotes. Numbers and booleans are automatically detected as literals:

{
  "headline": "article > h1",
  "public": true,
  "copyright": "'© Copyright Us Inc. 2023'",
  "version": 1.23
}

Scoping

Large documents with nested parts tend to require big and ugly selectors. To simplify things, we can scope an object to only care for a certain selected part.

Add a $ property with selector to narrow down what the rest of the object should use as base.

Example:

<article>
  <h1>My headline</h1>
  <div class="content">
    <p>My article text.</p>
  </div>
  <div class="author">
    <span class="name">John Doe</span>
    <span class="tel">555-1234</span>
    <a href="mailto:[email protected]">John Doe</a>
  </div>
  <div class="other">
    <span class="name">This wont be selected due to scoping</span>
  </div>
</article>
const template = {
  $: 'article',
  headline: '> h1',
  articleText: '> .content',
  author: {
    $: '> .author',
    name: 'span.name',
    telephone: 'span.tel',
    email: 'a[href^=mailto:] | attr:href | substr:7',
  },
};

Self-selector

In some cases we want to reuse the object selector ($) for a property selector. Especially handy when targeting lists, e.g. this case:

const html = `
  <ul>
    <li>One</li>
    <li>Two</li>
    <li>Three</li>
  </ul>
`;
const template = [
  {
    $: 'ul > li',
    value: '$', // uses `ul > li` as property selector
  },
];
const result = await cheerioJsonMapper(html, template);
console.log(result);
// Output:
// [
//   { value: 'One' },
//   { value: 'Two' },
//   { value: 'Three' }
// ];

Note: Don't like the $ name for scope selector? Change it through options: cheerioJsonMapper(html, template, { scopeProp: '__scope' }):

Pipes

Sometimes the text content of a selected node is not what we need. Or not enough. Pipes to rescue!

Pipes are functionality that can be applied to a value - both a property selector and an object. Use pipes to handle any custom needs.

Multiple pipes are supported (seperated by | char) and will run in sequence. Do note that value returned from a pipe will be passed to next pipe, allowing us to chain functionality (kind of same way as *nix terminal pipes, which was the inspiration to this syntax).

Pipes can have basic arguments by adding colon (:) along with semi-colon (;) seperated values.

Pipes can by asynchronous.

Use pipes in selector props:

{
  email: 'a[href^=mailto:] | attr:href | substr:7';
}

Use pipes in objects:

{
    name: 'span.name',
    email: 'a[href^=mailto:] | attr:href | substr:7',
    telephone: 'span.tel',
    '|': 'requiredProps:name;email'
}

Note: Don't like the | name for pipe property? Change it through options: cheerioJsonMapper(html, template, { pipeProp: '__pipes' }):

Default pipes included:

  • text - grab .textContent from selected node (used default if no other pipes are specified)
  • trim - trim grabbed text
  • lower - turn grabbed text to lower case
  • upper - turn grabbed text to upper case
  • substr - get substring of grabbed text
  • default - if value is nullish/empty, use specified fallback value
  • parseAs - parse a string to different types:
    • parseAs:number - number
    • parseAs:int - integer
    • parseAs:float - float
    • parseAs:bool - boolean
    • parseAs:date - date
    • parseAs:json - JSON
  • log - will console.log current value (use for debugging)
  • attr - get attribute value from selected node

Custom pipes

Create your own pipes to handle any customization needed.

const customPipes = {
  /** Replace any http:// link into https:// */
  onlyHttps: ({ value }) => value?.toString().replace(/^http:/, 'https:'),

  /** Check if all required props exists - and if not, set object to undefined  */
  requiredProps: ({ value, args }) => {
    const obj = value; // as this should be run as object pipe, value should be an object
    const requiredProps = args; // string array
    const hasMissingProps = requiredProps.some((prop) => obj[prop] == null);
    return hasMissingProps ? undefined : obj;
  },
};

const template = [
  {
    name: 'span.name',
    telephone: 'span.tel',
    email: 'a[href^=mailto:] | attr:href | substr:7',
    website: 'a[href^=http] | attr:href | onlyHttps',
    '|': 'requiredProps:name;email',
  },
];

const contacts = await cheerioJsonMapper(html, template, { pipeFns: customPipes });

Examples

More examples are found in the repo's tests/cases folder.

Change Log

v1.0.4 - 2024-10-18

  • Update Cheerio to v1.0.0
  • Add test/example for table formatting

v1.0.3 - 2023-04-11

  • Fixed bug with getScope() method.

v1.0.2 - 2023-04-04

  • Fixed bug when scoped object selectors doesn't match anything.
  • Support self-selector.
  • Align how default pipes should behave.

v1.0.1 - 2023-03-28

  • Updated README

v1.0.0 - 2023-03-28

  • First release with initial functionality;
    • End-result structure first approach
    • Scoping
    • Pipes with default setup of pipe funcs