npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

node-html-markdown-cloudflare

v1.3.0

Published

Fast HTML to markdown cross-compiler, compatible with both node and the browser

Downloads

669

Readme

npm version NPM Downloads Build Status Coverage Status

node-html-markdown

NHM is a fast HTML to markdown converter, compatible with both node and the browser.

It was built with the following two goals in mind:

1. Speed

We had a need to convert gigabytes of HTML daily very quickly. All libraries we found were too slow with node. We considered using a low-level language but decided to attempt to write something that would squeeze every bit of performance out of the JIT that we could. The end result was fast enough to make the cut!

2. Human Readability

The other libraries we tested produced output that would break in numerous conditions and produced output with many repeating linefeeds, etc. Generally speaking, outside of a markdown viewer, the result was not easy to read.

We took the approach of producing a clean, concise result with consistent spacing rules.

Install

<yarn|npm|pnpm> add node-html-markdown

Benchmarks

-----------------------------------------------------------------------------

Estimated processing times (fastest to slowest):

  [node-html-markdown (reused instance)]
    100 kB:  17ms
    1 MB:    176ms
    50 MB:   8.80sec
    1 GB:    3min, 0sec
    50 GB:   2hr, 30min, 14sec

  [turndown (reused instance)]
    100 kB:  27ms
    1 MB:    280ms
    50 MB:   13.98sec
    1 GB:    4min, 46sec
    50 GB:   3hr, 58min, 35sec

-----------------------------------------------------------------------------

Speed comparison - node-html-markdown (reused instance) is:

  1.02 times as fast as node-html-markdown
  1.57 times as fast as turndown
  1.59 times as fast as turndown (reused instance)

-----------------------------------------------------------------------------

Usage

import { NodeHtmlMarkdown, NodeHtmlMarkdownOptions } from 'node-html-markdown'


/* ********************************************************* *
 * Single use
 * If using it once, you can use the static method
 * ********************************************************* */

// Single file
NodeHtmlMarkdown.translate(
  /* html */ `<b>hello</b>`, 
  /* options (optional) */ {}, 
  /* customTranslators (optional) */ undefined,
  /* customCodeBlockTranslators (optional) */ undefined
);

// Multiple files
NodeHtmlMarkdown.translate(
  /* FileCollection */ { 
    'file1.html': `<b>hello</b>`, 
    'file2.html': `<b>goodbye</b>` 
  }, 
  /* options (optional) */ {}, 
  /* customTranslators (optional) */ undefined,
  /* customCodeBlockTranslators (optional) */ undefined
);


/* ********************************************************* *
 * Re-use
 * If using it several times, creating an instance saves time
 * ********************************************************* */

const nhm = new NodeHtmlMarkdown(
  /* options (optional) */ {}, 
  /* customTransformers (optional) */ undefined,
  /* customCodeBlockTranslators (optional) */ undefined
);

// Single file
nhm.translate(/* html */ `<b>hello</b>`);

// Multiple Files
nhm.translate(
  /* FileCollection */ { 
    'file1.html': `<b>hello</b>`, 
    'file2.html': `<b>goodbye</b>` 
  }, 
);

Options


export interface NodeHtmlMarkdownOptions {
  /**
   * Use native window DOMParser when available
   * @default false
   */
  preferNativeParser: boolean,

  /**
   * Code block fence
   * @default ```
   */
  codeFence: string,

  /**
   * Bullet marker
   * @default *
   */
  bulletMarker: string,

  /**
   * Style for code block
   * @default fence
   */
  codeBlockStyle: 'indented' | 'fenced',

  /**
   * Emphasis delimiter
   * @default _
   */
  emDelimiter: string,

  /**
   * Strong delimiter
   * @default **
   */
  strongDelimiter: string,

  /**
   * Strong delimiter
   * @default ~~
   */
  strikeDelimiter: string,

  /**
   * Supplied elements will be ignored (ignores inner text does not parse children)
   */
  ignore?: string[],

  /**
   * Supplied elements will be treated as blocks (surrounded with blank lines)
   */
  blockElements?: string[],

  /**
   * Max consecutive new lines allowed
   * @default 3
   */
  maxConsecutiveNewlines: number,

  /**
   * Line Start Escape pattern
   * (Note: Setting this will override the default escape settings, you might want to use textReplace option instead)
   */
  lineStartEscape: [ pattern: RegExp, replacement: string ]

  /**
   * Global escape pattern
   * (Note: Setting this will override the default escape settings, you might want to use textReplace option instead)
   */
  globalEscape: [ pattern: RegExp, replacement: string ]

  /**
   * User-defined text replacement pattern (Replaces matching text retrieved from nodes)
   */
  textReplace?: [ pattern: RegExp, replacement: string ][]

  /**
   * Keep images with data: URI (Note: These can be up to 1MB each)
   * @example
   * <img src="data:image/gif;base64,R0lGODlhEAAQAMQAAORHHOVSK......0o/">
   * @default false
   */
  keepDataImages?: boolean

  /**
   * Place URLS at the bottom and format links using link reference definitions
   *
   * @example
   * Click <a href="/url1">here</a>. Or <a href="/url2">here</a>. Or <a href="/url1">this link</a>.
   *
   * Becomes:
   * Click [here][1]. Or [here][2]. Or [this link][1].
   *
   * [1]: /url
   * [2]: /url2
   */
  useLinkReferenceDefinitions?: boolean

  /**
   * Wrap URL text in < > instead of []() syntax.
   * 
   * @example
   * The input <a href="https://google.com">https://google.com</a>
   * becomes <https://google.com>
   * instead of [https://google.com](https://google.com)
   * 
   * @default true
   */
  useInlineLinks?: boolean
}

Custom Translators

Custom translators are an advanced option to allow handling certain elements a specific way.

These can be modified via the NodeHtmlMarkdown#translators property, or added during creation.

For detail on how to use them see:

The NodeHtmlMarkdown#codeBlockTranslators property is a collection of translators which handles elements within a <pre><code> block.

Further improvements

Being a performance-centric library, we're always interested in further improvements. There are several probable routes by which we could gain substantial performance increases over the current model.

Such methods include:

  • Writing a custom parser
  • Integrating an async worker-thread based model for multi-threading
  • Fully replacing any remaining regex

These would be fun to implement; however, for the time being, the present library is fast enough for my purposes. That said, I welcome discussion and any PR toward the effort of further improving performance, and I may ultimately do more work in that capacity in the future!

Help Wanted!

Looking to contribute? Check out our help wanted list for a good place to start!