html-explorer
v0.2.1
Published
HTML Page Explorer
Downloads
14
Readme
html-explorer - HTML page explorer
html-explorer extracts main information from a HTML page.
Currently it extracts:
- Page meta:
title
description
keywords
canonical
feeds
- Main images - a ordered list of images;
- Main videos - a ordered list of videos;
- Page content - main page content/article;
- Page encoding;
Usage
var explorer = require('html-explorer');
explorer.explore('http://edition.cnn.com/')
.then(function(page){
// page object
});
Result structure
url
(String) - inputurl
param;href
(String) - server response url;canonical
(String) - page canonical;title
(String);description
(String);keywords
(String);content
(String);encoding
(String): utf8, windows-1251, iso-8859-2, etc.;feeds
([Feed]) - list of feeds:title
(String);href
(String) - feed url;
images
([Image]) - a list of images:src
(String) - image src;viewWidth
(Number) - image view width if founded;viewHeight
(Number);width
(Number) - real image width;height
(Number);alt
(String);title
(String);rating
(Number) - count of words matching page title words;type
(String) - (only ifidentify
option is true) - can be:bmp
,gif
,jpg
,png
,psd
,svg
,tiff
orwebp
;data
(Buffer) - (only ifidentify
option is true) - image data.
videos
([Video]) - a list of videos:sourceType
(String) - video source type:URL
,YOUTUBE
,VIMEO
orIFRAME
;sourceId
(String) - depends ofsourceType
: url or source id;width
(Number) - video width;height
(Number) - video height;
API
explorer.explore(url, [options])
Explores an url.
Options
page
- html page options:timeout
(Number) [5000] - request timeout;headers
(Object) [{}]- request headers;canonical
(Boolean) [true] - find or not;feeds
(Boolean|Function) - find or not, function for validating a feed;validator
(Function) [noop] - Validates page after exploring info, throw an error if invalid;html
(Boolean|String) [false] - Return HTML text or not. If is string it will be used as remote HTML body;lang
(String) - page language 2 chars code;
content
(Boolean|Object) - content options:filter
(Boolean|Object):minLine
: (Number) [50] - accepted minimum line length;minPhrase
: (Number) [100] - accepted minimum phrase length;phraseEndRegex
: (Regex) default: /[.!?:;¡¿%]$/ - end phrase puctuation regex;phraseEnd
: (Boolean) [false] - require phrase to end with a puctuation;maxInvalidLines
: (Number) [3] - maximum consecutive invalid lines;minScore
: (Number) [0.3] - min in text search score: 0 to 1;
images
(Boolean|Object) - images explorer options:limit
(Number) [5] - maximum number of images to return;filter
(Object):minViewHeight
(Number) [180] - accepted minimum image view height;minViewWidth
(Number) [220] - accepted minimum image view width;minHeight
(Number) [200] - accepted minimum image height;minWidth
(Number) [250] - accepted minimum image width;minRating
(Number) [0] - accepted minimum image rating(...);minRatio
(Number) [null] - accepted minimum image ratio (ratio
=width
/height
);maxRatio
(Number) [null] - accepted maximum image ratio;invalidRatio
(Number | [Number]) [1] - example: value [1] will exclude all images with width=height;invalidExtensions
([String]) [gif, png] - invalid image extensions;src
(RegExp) [see source code] - invalidate image by SRC;extraSrc
(RegExp) - invalidate image by SRC;cssClass
(RegExp) - filter image by its css class;types
(String|[String]) - accepted image types (bmp
,gif
,jpg
,png
,psd
,svg
,tiff
,webp
), default:['jpg']
;invalidTypes
(String|[String]) - invalid image types;
identify
(Boolean) [false] - identify imagewidth
,height
andtype
by downloading data;data
(Boolean) [false] - set imagedata
property. Works only ifidentify
is true.timeout
(Number) [1000] - image downloading timeout, in ms.
video
(Boolean|Object) - video explorer options:limit
(Number) [1] - maximum number or videos to return;filter
(Object):minHeight
(Number) [200] - accepted minimum image height;minWidth
(Number) [250] - accepted minimum image width;minRatio
(Number) [null] - accepted minimum image ratio (ratio
=width
/height
);maxRatio
(Number) [null] - accepted maximum image ratio;invalidRatio
(Number | [Number]) [1] - example: value [1] will exclude all images with width=height;src
(RegExp) [see source code] - invalidate image by SRC;extraSrc
(RegExp) - invalidate image by SRC;
priority
([String]) - video source type priority - default:['YOUTUBE', 'VIMEO', 'URL', 'IFRAME']
;customFinders
([Finder]) - a list of custom video fiders.
Changelog
v0.1.12 - July 16, 2016
- filter page content by relevancy score option;
- added
lang
option; - using ascripe module instead of readability-js;
- using in-text-search module;
v0.1.11 - August 16, 2016
- find videos from known iframes
v0.1.9 - August 15, 2015
- explore content with
readability-js
- fix videos explore bug
v0.1.6 - August 3, 2015
- explore videos from microdata
v0.1.5 - August 3, 2015
- filter page content
- better encoding detection & add to the response object
v0.1.4 - August 2, 2015
- tests
- extracting page content
- editorconfig, eslint
v0.1.2 - June 17, 2015
- custom video finders
- sort videos by priority option
- head(og:video) video finder
v0.1.1 - June 13, 2015
- decode page urls
- image downloading timeout
v0.1.0 - May 30, 2015
- detect embedded videos
- better images order
v0.0.8 - May 29, 2015
- detect charset from content-type response header
- image filter:
invalidRatio
v0.0.7 - May 22, 2015
- filter images by view size - width & heigth detected in image attributes
- merge images with same src