web-pdf-scraper

v1.0.2

Published

2 years ago

Scrapes the main article from a website or pdf.

Downloads

0High
0Medium
0Low

saternius

webpage article scrapper text WebToText parser

WebPDFArticleScrape

####WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.

	var scraper = require('web-pdf-scraper');

Input: URL or Directory Returned DataStructures: sizeMap: Map<key:fontSize, val:Array of all text chunks of said size> output: { title: [Array of all text chunks classified as titles] content: [Array ofall text chunks classified as content] }

###Basic Usage #####Generating sizeMap of a PDF

		scraper.scrapePDF("pdfDir.pdf").then(
			function(sizeMap){
				console.log(sizeMap);
			}
		).catch(
		        function(reason) {
		            console.log('Handle rejected promise ('+reason+') here.');
		        }
        );

#####Generating output of a PDF

		scraper.smartPDF("pdfDir.pdf").then(
			function(output){
				console.log(output);
			}
		).catch(
		        function(reason) {
		            console.log('Handle rejected promise ('+reason+') here.');
		        }
        );

#####Generating sizeMap of a Web Article

		scraper.scrapeWeb("https://en.wikipedia.org/wiki/Heart").then(
			function(sizeMap){
				console.log(sizeMap);
			}
		).catch(
		        function(reason) {
		            console.log('Handle rejected promise ('+reason+') here.');
		        }
        );

#####Generating output of a Web Article

		scraper.smartWeb("https://en.wikipedia.org/wiki/Heart").then(
			function(output){
				console.log(output);
			}
		).catch(
		        function(reason) {
		            console.log('Handle rejected promise ('+reason+') here.');
		        }
        );

	HOW TO USE MANUAL CONFIG SITE:
		manually toggle which font sizes correspond to the useful text
		 ->Yellow : the text is a title
		 ->Green : the text is content
		 ->Red : ignore these texts

		 once you fix the page, give your configuration a name and publish it.

		 A more in depth visual explaination will be provided in the near future.

###Additional Useful Functions:

    scraper.makeVerbose()  //logs more information about the processing to console
	scraper.stopVerbose()
	
	scraper.ignoreTitles() //sometimes the regex for title classification causes HUGE lag, so ignoring them is sometimes useful
	scraper.markTitles()

	scraper.shutUp()	//stops logging the manual config link

	scraper.closeServer() //closes the server

Pkg
Stats

Discover Tips

General search

Package details

User packages

Sponsor

About

Twitter

GitHub

Twitter

GitHub

Site

Open Software & Tools

Framework

Server

Data Store

Caching

CSS / Styling

Typeface

Avatars

Data Viz

Date formatting

Infinite scrolling

Markdown rendering

Repository url parsing

User data

Compiling

Types

Odds & Ends

web-pdf-scraper

v1.0.2

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

WebPDFArticleScrape