html-extract-js
v0.1.9
Published
Extract HTML documents for collecting metadata and core context.
Downloads
5
Readme
html-extract-js
html-extract-js is a javascript library that extracts HTML documents for collecting metadata and core contextual information in infinite webpages.
This library has been created and used in Additor for web-scraping.
Installation
Using npm:
$ npm install --save html-extract-js
API
Load
First you need to pass a HTML document data as a type of "String" or "Buffer". Once you get ready to extract the document, load a html-extractor.
const HtmlExtractor = require('html-extract-js');
const extractor = HtmlExtractor.load(html);
HtmlExtractor
The HtmlExtractor uses cheerio and iconv-lite for extracting document's information.
The HtmlExtractor is a wrapping class of its sub-extractors. By default, it uses two extractors, ContextExtractor and MetaExtractor.
Also, you can configure this extractor through passing option
parameter.
const option = {
charset: 'EUC-KR', // if you set, "iconv-lite" converts the HTML document.
};
const extractor = HtmlExtractor.load(html, option);
URI
const uri = extractor.getURI(); // "https://additor.io"
Title
const title = extractor.getTitle(); // "Additor :: Just Add it. Be an Additor"
Description
const description = extractor.getDescription(); // "Additor is alchemy that turns your scattered information into well-organized content..."
Thumbnail
const thumbnail = extractor.getThumbnail(); // "https://cdn.additor.io/image/main/landing_temp.png"
Favicon
const favicon = extractor.getFavicon(); // "https://cdn.additor.io/image/logo/favicon.ico"
License #
MIT License