cheerio-json-mapper-sync
v1.0.3
Published
Extract data from HTML using JSON mapping with custom pipe functionality
Downloads
3
Maintainers
Readme
Cheerio JSON Mapper
Extract HTML markup to JSON using Cheerio.
This forked version is a synchronous version of the library that does not support Promise-returning pipe functions.
Install
# npm
npm i -S cheerio-json-mapper
# yarn
yarn add cheerio-json-mapper
Usage
import { cheerioJsonMapper } from 'cheerio-json-mapper';
const html = `
<article>
<h1>My headline</h1>
<div class="content">
<p>My article text.</p>
</div>
<div class="author">
<a href="mailto:[email protected]">John Doe</a>
</div>
</article>
`;
const template = {
headline: 'article > h1',
articleText: 'article > .content',
author: {
$: 'article > .author',
name: '> a',
email: '> a | attr:href | substr:7',
},
};
const result = cheerioJsonMapperSync(html, template);
console.log(result);
// output:
// {
// headline: "My headline",
// articleText: "My article text.",
// author: {
// name: "John Doe",
// email: "[email protected]"
// }
// }
More examples are found in the repo's tests/cases folder.
Core concepts
End-Result Structure First
The main approach is to start from what we need to retrieve. Defining the end structure and just telling each property which *selector_ to use to get its value.
Hard-coded values (literals)
We can set hard values to the structure by wrapping strings in quotes or single-quotes. Numbers and booleans are automatically detected as literals:
{
"headline": "article > h1",
"public": true,
"copyright": "'© Copyright Us Inc. 2023'",
"version": 1.23
}
Scoping
Large documents with nested parts tend to require big and ugly selectors. To simplify things, we can scope an object to only care for a certain selected part.
Add a $
property with selector to narrow down what the rest of the object should use as base.
Example:
<article>
<h1>My headline</h1>
<div class="content">
<p>My article text.</p>
</div>
<div class="author">
<span class="name">John Doe</span>
<span class="tel">555-1234</span>
<a href="mailto:[email protected]">John Doe</a>
</div>
<div class="other">
<span class="name">This wont be selected due to scoping</span>
</div>
</article>
const template = {
$: 'article',
headline: '> h1',
articleText: '> .content',
author: {
$: '> .author',
name: 'span.name',
telephone: 'span.tel',
email: 'a[href^=mailto:] | attr:href | substr:7',
},
};
Self-selector
In some cases we want to reuse the object selector ($
) for a property selector. Especially handy when targeting lists, e.g. this case:
const html = `
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
`;
const template = [
{
$: 'ul > li',
value: '$', // uses `ul > li` as property selector
},
];
const result = cheerioJsonMapperSync(html, template);
console.log(result);
// Output:
// [
// { value: 'One' },
// { value: 'Two' },
// { value: 'Three' }
// ];
Note: Don't like the
$
name for scope selector? Change it through options:cheerioJsonMapperSync(html, template, { scopeProp: '__scope' }):
Pipes
Sometimes the text content of a selected node is not what we need. Or not enough. Pipes to rescue!
Pipes are functionality that can be applied to a value - both a property selector and an object. Use pipes to handle any custom needs.
Multiple pipes are supported (seperated by |
char) and will run in sequence. Do note that value returned from a pipe will be passed to next pipe, allowing us to chain functionality (kind of same way as *nix terminal pipes, which was the inspiration to this syntax).
Pipes can have basic arguments by adding colon (:
) along with semi-colon (;
) seperated values.
Use pipes in selector props
{
email: 'a[href^=mailto:] | attr:href | substr:7';
}
Use pipes in objects
{
name: 'span.name',
email: 'a[href^=mailto:] | attr:href | substr:7',
telephone: 'span.tel',
'|': 'requiredProps:name;email'
}
Note: Don't like the
|
name for pipe property? Change it through options:cheerioJsonMapperSync(html, template, { pipeProp: '__pipes' }):
Default pipes included
text
- grab.textContent
from selected node (used default if no other pipes are specified)trim
- trim grabbed textlower
- turn grabbed text to lower caseupper
- turn grabbed text to upper casesubstr
- get substring of grabbed textdefault
- if value is nullish/empty, use specified fallback valueparseAs
- parse a string to different types:parseAs:number
- numberparseAs:int
- integerparseAs:float
- floatparseAs:bool
- booleanparseAs:date
- dateparseAs:json
- JSON
log
- will console.log current value (use for debugging)attr
- get attribute value from selected node
Custom pipes
Create your own pipes to handle any customization needed.
const customPipes = {
/** Replace any http:// link into https:// */
onlyHttps: ({ value }) => value?.toString().replace(/^http:/, 'https:'),
/** Check if all required props exists - and if not, set object to undefined */
requiredProps: ({ value, args }) => {
const obj = value; // as this should be run as object pipe, value should be an object
const requiredProps = args; // string array
const hasMissingProps = requiredProps.some((prop) => obj[prop] == null);
return hasMissingProps ? undefined : obj;
},
};
const template = [
{
name: 'span.name',
telephone: 'span.tel',
email: 'a[href^=mailto:] | attr:href | substr:7',
website: 'a[href^=http] | attr:href | onlyHttps',
'|': 'requiredProps:name;email',
},
];
const contacts = cheerioJsonMapperSync(html, template, { pipeFns: customPipes });
Examples
More examples are found in the repo's tests/cases folder.
Change Log
v1.0.3 - 2023-04-11
- Fixed bug with getScope() method.
v1.0.2 - 2023-04-04
- Fixed bug when scoped object selectors doesn't match anything.
- Support self-selector.
- Align how default pipes should behave.
v1.0.1 - 2023-03-28
- Updated README
v1.0.0 - 2023-03-28
- First release with initial functionality;
- End-result structure first approach
- Scoping
- Pipes with default setup of pipe funcs