npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

html-juicer

v0.2.0

Published

[![CircleCI](https://circleci.com/gh/fragment0/html-juicer.svg?style=svg)](https://circleci.com/gh/fragment0/html-juicer)

Downloads

3

Readme

html-juicer

CircleCI

A more simple way to clean 3rd webpage. Similar to arc90 readability.

Why

arc90 readability is been used widely for getting a clean view of a webpage. But it's algorithm has some shortcoming then some page got a wrong result.

In the algorithm of arc90 readability, it first calculate all paragraph's score, add the paragraph, its parentNode and parentNode's parentNode to a candidate list, then pick the topCandidate which has the highest score. With a existing candidate, arc90 then walk through its siblings for other possible content. So under this algorithm, the traverser will search max to 4th depth to a top candidate.

But in reality, many famous blog site use very deep nest structure for its content. Like this article in medium, a arc90 readability only get the first section of the whole article. The bottom-up traverse process can't do any thing about it.

How we implements

So html-juicer has a top-down traverse process.

We first calculate all paragraph's score like arc90, but we also score every parentNode until we reach the root. Then we traverse down the dom tree, find out the most possible root for the article. This is the final target. Simple right? 🤓

More things

With the article root, we will do more stuff based on caller's config. For default config, we remove h1 tag, clean all useless attribute, and replace resouce' src to a correct result. All helper methods is well tested in helpers.test.ts.

Usage

Currently html-juicer only work in node.js.

npm i html-juicer
import {Juicer} from 'html-juicer'

new Juicer(
  html: string, 
  config?: {
      useHeaderAsTitle?: boolean
      cleanH1?: boolean
      cleanAttribute?: boolean
      url?: URL | string | null
  },
): {
  content: string
  title: string
}

Config

|name|description|default| |-|-|-| |useHeaderAsTitle|use h1 as result title or document.title|true| |cleanH1|remove h1 tag in article root|true| |cleanAttribute|clean useless attribute|true| |url|the url of html|null|

Dependencies

html-juicer only depend on jsdom.