npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

find-main-content

v1.2.0

Published

Module for finding the main content on an HTML page with the help of Cheerio

Downloads

37

Readme

Find The Main Content In An HTML Page

Module for finding the main content on a page with the help of Cheerio. It can convert it into markdown, text or keep it in HTML.

It removes header, footer, menu, sidebar, ...

Installation

$ npm install find-main-content -S

You need also to use Cheerio

$ npm install cheerio -S

Simple usage

const cheerio = require('cheerio');
const { findContent } = require('find-main-content');

const $ = cheerio.load('<html> .... </html>');

// Return a nice data structure within the main content &
// some extract infos on links, images, headers, title, description, ...
const html = findContent($); // get the main content in the html format
const txt = findContent($, 'txt'); // get the main content in the txt format
const md = findContent($, 'md'); // get the main content in the markdown format

Options

You can control how to extract the main div with some options. You can specify a subset of the following attributes.


const options = {

  // If more then one H1 is found, use the first one as the main title of the page
  useFirstH1: true,

  // Remove the H1 from the main content, the H1 will be in the final json structure
  removeH1FromContent: true,

  // Some site set some links in Hn, if true, we remove them
  removeHeadersWithoutText: true,

  // if true, don't add the images in the final extraction
  removeImages: true,

  // Remove HTML tag figcaption
  removeFigcaptions: true,

  // Replace links by their anchor text
  replaceLinks: true,

  // Remove HTML Form
  removeForm: false,

  // Remove basic html tags that have no children
  removeEmptyTag: false

  // Remove tags that match to selectors
  removeTags : '... ' // list of selectors separated by comma or line break

  // The HTML selector. If specified, the main content will be extract from the html element that matchs to the selector
  htmlSelector : '...'


};

const  cheerio  = require('cheerio');
const { findContent } = require('find-main-content');

const $ = cheerio.load('<html> .... </html>');

const data = findContent($, 'html', options);

Structure returned by the function findContent

{
  title: '...',
  description: "...',
  images: [
    {
      src: 'https://... .jpg',
      alt: '...'
    },
    ...
  ],
  links: [
    {
      href: 'https://...',
      text: '...'
    },

  ],
  headers: [
    {
      type: 'h1',
      text: '...'
    },
    {
      type: 'h2',
      text: '...'
    }
    ...
  ],
  content: '....' // in either html, markdown or txt format
}