npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

html-chunk-process

v1.0.1

Published

html-chunk-process chunks HTML to a collection of the largest possible blocks of code, processes these chunks by a custom processor, and then returns the processed chunks after stitching them back together

Downloads

8

Readme

#node-html-chunk-process

Do you need to access an HTML-digesting API with a request payload limit? Distributing your payload across multiple requests is incredibly complex in this case, since HTML defines a hierarchical structure that cannot be split in a linear way without breaking context.

This library aims to help by chunking a HTML document (defined as a string) to a collection of the largest possible blocks of valid HTML (where a character length limit defines the boundary). Each chunk is then processed by a passed-in asynchronous processing function (which typically invokes your external API), after which the processed chunks are intelligently stitched back together.

##Install

npm install html-chunk-process

##Why? This is useful when you need to process HTML using an API with a request payload limit (such as a translation library), but cannot send invalid chunks of HTML or when the APIs effectiveness requires context. A naive string split does not account for either of these cases, but html-chunk-process does.

##Illustrative example

Take the following HTML document:

<!DOCTYPE html>
<html class="test">
    <head>
      <title>Hi there</title>
    </head>
    <body>
      This is a page a simple page
      <div>
          and here is more content we don't want
      </div>
      Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children.
    </body>
</html>

html-chunk-process breaks the HTML into valid chunks of HTML where each chunk has a total length less than or equal to the given limit. This works by decomposing the HTML into chunks, like so (with a length limit of 100, although typically the limit would be much greater, such as 10k characters):

{
    tag: 'root',
    attribs: {},
    children: [{
        fragmentPreProcessed: '<!DOCTYPE html>',
        fragmentPostProcessed: '<!DOCTYPE html>'
    }, {
        tag: 'html',
        attribs: {
            class: 'test'
        },
        children: [{
            fragmentPreProcessed: '<head>\n      <title>Hi there</title>\n    </head>',
            fragmentPostProcessed: '<head>\n      <title>Goodbye</title>\n    </head>'
        }, {
            tag: 'body',
            attribs: {},
            children: [{
                fragmentPreProcessed: 'This is a page a simple page',
                fragmentPostProcessed: 'This is a page a simple page'
            }, {
                fragmentPreProcessed: '<div>\n          and here is more content we don\'t want\n      </div>',
                fragmentPostProcessed: '<div>\n          and here is more content we don\'t want\n      </div>'
            }]
        }]
    }]
}

These processed chunks are then stitched back together (and optionally beautified when beautify: true is passed as an option), giving you a processed result like the following:

<!DOCTYPE html>
<html class="test">
<head>
    <title>It works</title>
</head>
<body>
    This is a page a simple page
    <div>
        and here is more content we do want
    </div>
</body>
</html>

##Note If an element has no children but exceeds the limit length it is not included in the result (because there is no reliable way to chunk it). However, all excluded elements are returned as a third parameter in the callback function. See the following code example.

##Code example

test1.html

<!DOCTYPE html>
<html class="test">
    <head>
      <title>Hi there</title>
    </head>
    <body>
      This is a page a simple page
      <div>
          and here is more content we don't want
      </div>
      Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children.
    </body>
</html>

example/index.js

var chunkProcessHTML = require('../');
var fs               = require('fs');
var originalHTML     = fs.readFileSync(__dirname + '/../test/input/test1.html', {encoding: 'utf8'});

chunkProcessHTML({
    lengthInt   : 100,
    originalHTML: originalHTML,
    beautify    : true,
    processorFn : processAsync
}, function(err, processedHTML, excludedFragments)
{
    console.log(
        'original:\n'   +
        '%s\n\n'        +

        'result:\n'     +
        '%s\n\n'        +

        'excluded:\n'   +
        '%j',

        originalHTML, processedHTML, excludedFragments
    );
});

function processAsync(htmlFragment, cb)
{
    //typically this would invoke an external HTML-digesting API with a payload limit
    htmlFragment = htmlFragment.replace('Hi there', 'It works').replace('don\'t', 'do');
    setTimeout(function()
    {
        cb(htmlFragment);
    }, 1000);   
}

output (result)

<!DOCTYPE html>
<html class="test">
<head>
    <title>It works</title>
</head>
<body>
    This is a page a simple page
    <div>
        and here is more content we do want
    </div>
</body>
</html>

output (excluded)

["Here is content that is very long but doesnt have any children. Really there is no way to know how to chunk it in a reliable, cross-script, and cross-language manner. Skipping this fragment should not be a problem with typical APIs because those will allow over thousands of characters, at which point this would not be a fragment without children."]

##Test

npm test

Tests require mocha. The current tests are very minimal, feel free to add more tests and submit a pull request.