node-red-contrib-unfluff

v1.0.0

Published

3 years ago

An automatic web page content extractor.

Downloads

0High
0Medium
0Low

balsimpson

node-red unfluff html scraping content extraction

node-red-contrib-unfluff

An automatic web page content extractor.

Automatically grab the main text out of a webpage, or in other words, it turns pretty webpages into boring plain text/json data.

This is a Node-RED wrapper for the npm module unfluff. Read more at https://www.npmjs.com/package/unfluff

Outputs a JSON object with the following fields:

title - The document's title (from the <title> tag)
softTitle - A version of title with less truncation
date - The document's publication date
copyright - The document's copyright line, if present
author - The document's author
publisher - The document's publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what's used by facebook, etc.)
videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from <meta> tags
favicon - The url of the document's favicon.
links - An array of links embedded within the article text. (text and href for each)

input

send a msg.url to dynamically input a url.

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

node-red-contrib-unfluff

v1.0.0

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-red-contrib-unfluff

An automatic web page content extractor.

input