seize

v0.1.7

Published

2 years ago

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Downloads

0High
0Medium
0Low

peremenov

dom html extract extractor document text significant part content readability reader

seize

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.

Install

npm i --save seize

Usage

Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.

Example

var Seize = require('seize'),
    jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>').defaultView,
    seize  = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text();    // returns only text

Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

/**
 * Converts html string to Document
 * @param  {String} html  html document string
 * @return {Node}         document
 */
function HTMLParser(html){
  var doc = document.implementation.createHTMLDocument("example");
  doc.documentElement.innerHTML = html;
  return doc;
};

How it works

Here is algorythm how it works:

Getting html tags that we expect to be text or content container such as p, table, img, etc.
Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
Setting score for each container by containing tags
Setting score by class name, id name, tag xPath score and text score
Sorting canditates by score
Taking first candidate
Cleaning up article

Todo

Seize still in development, so you can use it at one's own risk. You always can help to improve it.

Improve readme
Improve text scoring
Improve page detection wich can't be extracted
More tests
More examples

Contributing

You are welcomed to improve this small piece of software :)

Author

Kir Peremenov

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

seize

Install

Usage

Example

Browser usage

How it works

Todo

Contributing

Author