npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

html-explorer

v0.2.1

Published

HTML Page Explorer

Downloads

14

Readme

html-explorer - HTML page explorer

html-explorer extracts main information from a HTML page.

Currently it extracts:

  • Page meta:
    • title
    • description
    • keywords
    • canonical
    • feeds
  • Main images - a ordered list of images;
  • Main videos - a ordered list of videos;
  • Page content - main page content/article;
  • Page encoding;

Usage

var explorer = require('html-explorer');
explorer.explore('http://edition.cnn.com/')
.then(function(page){
  // page object
});

Result structure

  • url (String) - input url param;

  • href (String) - server response url;

  • canonical (String) - page canonical;

  • title (String);

  • description (String);

  • keywords (String);

  • content (String);

  • encoding (String): utf8, windows-1251, iso-8859-2, etc.;

  • feeds ([Feed]) - list of feeds:

    • title (String);
    • href (String) - feed url;
  • images ([Image]) - a list of images:

    • src (String) - image src;
    • viewWidth (Number) - image view width if founded;
    • viewHeight (Number);
    • width (Number) - real image width;
    • height (Number);
    • alt (String);
    • title (String);
    • rating (Number) - count of words matching page title words;
    • type (String) - (only if identify option is true) - can be: bmp, gif, jpg, png, psd, svg, tiff or webp;
    • data (Buffer) - (only if identify option is true) - image data.
  • videos ([Video]) - a list of videos:

    • sourceType (String) - video source type: URL, YOUTUBE, VIMEO or IFRAME;
    • sourceId (String) - depends of sourceType: url or source id;
    • width (Number) - video width;
    • height (Number) - video height;

API

explorer.explore(url, [options])

Explores an url.

Options

  • page - html page options:

    • timeout (Number) [5000] - request timeout;
    • headers (Object) [{}]- request headers;
    • canonical (Boolean) [true] - find or not;
    • feeds (Boolean|Function) - find or not, function for validating a feed;
    • validator (Function) [noop] - Validates page after exploring info, throw an error if invalid;
    • html (Boolean|String) [false] - Return HTML text or not. If is string it will be used as remote HTML body;
    • lang (String) - page language 2 chars code;
  • content (Boolean|Object) - content options:

    • filter (Boolean|Object):
      • minLine: (Number) [50] - accepted minimum line length;
      • minPhrase: (Number) [100] - accepted minimum phrase length;
      • phraseEndRegex: (Regex) default: /[.!?:;¡¿%]$/ - end phrase puctuation regex;
      • phraseEnd: (Boolean) [false] - require phrase to end with a puctuation;
      • maxInvalidLines: (Number) [3] - maximum consecutive invalid lines;
      • minScore: (Number) [0.3] - min in text search score: 0 to 1;
  • images (Boolean|Object) - images explorer options:

    • limit (Number) [5] - maximum number of images to return;
    • filter (Object):
      • minViewHeight (Number) [180] - accepted minimum image view height;
      • minViewWidth (Number) [220] - accepted minimum image view width;
      • minHeight (Number) [200] - accepted minimum image height;
      • minWidth (Number) [250] - accepted minimum image width;
      • minRating (Number) [0] - accepted minimum image rating(...);
      • minRatio (Number) [null] - accepted minimum image ratio (ratio=width/height);
      • maxRatio (Number) [null] - accepted maximum image ratio;
      • invalidRatio (Number | [Number]) [1] - example: value [1] will exclude all images with width=height;
      • invalidExtensions ([String]) [gif, png] - invalid image extensions;
      • src (RegExp) [see source code] - invalidate image by SRC;
      • extraSrc (RegExp) - invalidate image by SRC;
      • cssClass (RegExp) - filter image by its css class;
      • types (String|[String]) - accepted image types (bmp, gif, jpg, png, psd, svg, tiff, webp), default: ['jpg'];
      • invalidTypes (String|[String]) - invalid image types;
    • identify (Boolean) [false] - identify image width, height and type by downloading data;
    • data (Boolean) [false] - set image data property. Works only if identify is true.
    • timeout (Number) [1000] - image downloading timeout, in ms.
  • video (Boolean|Object) - video explorer options:

    • limit (Number) [1] - maximum number or videos to return;
    • filter (Object):
      • minHeight (Number) [200] - accepted minimum image height;
      • minWidth (Number) [250] - accepted minimum image width;
      • minRatio (Number) [null] - accepted minimum image ratio (ratio=width/height);
      • maxRatio (Number) [null] - accepted maximum image ratio;
      • invalidRatio (Number | [Number]) [1] - example: value [1] will exclude all images with width=height;
      • src (RegExp) [see source code] - invalidate image by SRC;
      • extraSrc (RegExp) - invalidate image by SRC;
    • priority ([String]) - video source type priority - default: ['YOUTUBE', 'VIMEO', 'URL', 'IFRAME'];
    • customFinders ([Finder]) - a list of custom video fiders.

Changelog

v0.1.12 - July 16, 2016

  • filter page content by relevancy score option;
  • added lang option;
  • using ascripe module instead of readability-js;
  • using in-text-search module;

v0.1.11 - August 16, 2016

  • find videos from known iframes

v0.1.9 - August 15, 2015

  • explore content with readability-js
  • fix videos explore bug

v0.1.6 - August 3, 2015

  • explore videos from microdata

v0.1.5 - August 3, 2015

  • filter page content
  • better encoding detection & add to the response object

v0.1.4 - August 2, 2015

  • tests
  • extracting page content
  • editorconfig, eslint

v0.1.2 - June 17, 2015

  • custom video finders
  • sort videos by priority option
  • head(og:video) video finder

v0.1.1 - June 13, 2015

  • decode page urls
  • image downloading timeout

v0.1.0 - May 30, 2015

  • detect embedded videos
  • better images order

v0.0.8 - May 29, 2015

  • detect charset from content-type response header
  • image filter: invalidRatio

v0.0.7 - May 22, 2015

  • filter images by view size - width & heigth detected in image attributes
  • merge images with same src