html-chunk-process

v1.0.1

Published

3 years ago

html-chunk-process chunks HTML to a collection of the largest possible blocks of code, processes these chunks by a custom processor, and then returns the processed chunks after stitching them back together

Downloads

0High
0Medium
0Low

tommedema

html chunk split

#node-html-chunk-process

Do you need to access an HTML-digesting API with a request payload limit? Distributing your payload across multiple requests is incredibly complex in this case, since HTML defines a hierarchical structure that cannot be split in a linear way without breaking context.

This library aims to help by chunking a HTML document (defined as a string) to a collection of the largest possible blocks of valid HTML (where a character length limit defines the boundary). Each chunk is then processed by a passed-in asynchronous processing function (which typically invokes your external API), after which the processed chunks are intelligently stitched back together.

##Install

npm install html-chunk-process

##Why? This is useful when you need to process HTML using an API with a request payload limit (such as a translation library), but cannot send invalid chunks of HTML or when the APIs effectiveness requires context. A naive string split does not account for either of these cases, but html-chunk-process does.

##Illustrative example

Take the following HTML document:

<!DOCTYPE html>
<html class="test">
    <head>
      <title>Hi there</title>
    </head>
    <body>
      This is a page a simple page
      <div>
          and here is more content we don't want
      </div>
      Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children.
    </body>
</html>

html-chunk-process breaks the HTML into valid chunks of HTML where each chunk has a total length less than or equal to the given limit. This works by decomposing the HTML into chunks, like so (with a length limit of 100, although typically the limit would be much greater, such as 10k characters):

{
    tag: 'root',
    attribs: {},
    children: [{
        fragmentPreProcessed: '<!DOCTYPE html>',
        fragmentPostProcessed: '<!DOCTYPE html>'
    }, {
        tag: 'html',
        attribs: {
            class: 'test'
        },
        children: [{
            fragmentPreProcessed: '<head>\n      <title>Hi there</title>\n    </head>',
            fragmentPostProcessed: '<head>\n      <title>Goodbye</title>\n    </head>'
        }, {
            tag: 'body',
            attribs: {},
            children: [{
                fragmentPreProcessed: 'This is a page a simple page',
                fragmentPostProcessed: 'This is a page a simple page'
            }, {
                fragmentPreProcessed: '<div>\n          and here is more content we don\'t want\n      </div>',
                fragmentPostProcessed: '<div>\n          and here is more content we don\'t want\n      </div>'
            }]
        }]
    }]
}

These processed chunks are then stitched back together (and optionally beautified when beautify: true is passed as an option), giving you a processed result like the following:

<!DOCTYPE html>
<html class="test">
<head>
    <title>It works</title>
</head>
<body>
    This is a page a simple page
    <div>
        and here is more content we do want
    </div>
</body>
</html>

##Note If an element has no children but exceeds the limit length it is not included in the result (because there is no reliable way to chunk it). However, all excluded elements are returned as a third parameter in the callback function. See the following code example.

##Code example

test1.html

<!DOCTYPE html>
<html class="test">
    <head>
      <title>Hi there</title>
    </head>
    <body>
      This is a page a simple page
      <div>
          and here is more content we don't want
      </div>
      Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children.
    </body>
</html>

example/index.js

var chunkProcessHTML = require('../');
var fs               = require('fs');
var originalHTML     = fs.readFileSync(__dirname + '/../test/input/test1.html', {encoding: 'utf8'});

chunkProcessHTML({
    lengthInt   : 100,
    originalHTML: originalHTML,
    beautify    : true,
    processorFn : processAsync
}, function(err, processedHTML, excludedFragments)
{
    console.log(
        'original:\n'   +
        '%s\n\n'        +

        'result:\n'     +
        '%s\n\n'        +

        'excluded:\n'   +
        '%j',

        originalHTML, processedHTML, excludedFragments
    );
});

function processAsync(htmlFragment, cb)
{
    //typically this would invoke an external HTML-digesting API with a payload limit
    htmlFragment = htmlFragment.replace('Hi there', 'It works').replace('don\'t', 'do');
    setTimeout(function()
    {
        cb(htmlFragment);
    }, 1000);   
}

output (result)

<!DOCTYPE html>
<html class="test">
<head>
    <title>It works</title>
</head>
<body>
    This is a page a simple page
    <div>
        and here is more content we do want
    </div>
</body>
</html>

output (excluded)

["Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children."]

##Test

npm test

Tests require mocha. The current tests are very minimal, feel free to add more tests and submit a pull request.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

html-chunk-process

v1.0.1

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme