web-pdf-scraper
v1.0.2
Published
Scrapes the main article from a website or pdf.
Downloads
10
Maintainers
Readme
WebPDFArticleScrape
####WebPDFArticleScrape is a npm module that allows you to scrape main content out of pdfs and webpage articles.
var scraper = require('web-pdf-scraper');
Input: URL or Directory Returned DataStructures: sizeMap: Map<key:fontSize, val:Array of all text chunks of said size> output: { title: [Array of all text chunks classified as titles] content: [Array ofall text chunks classified as content] }
###Basic Usage #####Generating sizeMap of a PDF
scraper.scrapePDF("pdfDir.pdf").then(
function(sizeMap){
console.log(sizeMap);
}
).catch(
function(reason) {
console.log('Handle rejected promise ('+reason+') here.');
}
);
#####Generating output of a PDF
scraper.smartPDF("pdfDir.pdf").then(
function(output){
console.log(output);
}
).catch(
function(reason) {
console.log('Handle rejected promise ('+reason+') here.');
}
);
#####Generating sizeMap of a Web Article
scraper.scrapeWeb("https://en.wikipedia.org/wiki/Heart").then(
function(sizeMap){
console.log(sizeMap);
}
).catch(
function(reason) {
console.log('Handle rejected promise ('+reason+') here.');
}
);
#####Generating output of a Web Article
scraper.smartWeb("https://en.wikipedia.org/wiki/Heart").then(
function(output){
console.log(output);
}
).catch(
function(reason) {
console.log('Handle rejected promise ('+reason+') here.');
}
);
HOW TO USE MANUAL CONFIG SITE:
manually toggle which font sizes correspond to the useful text
->Yellow : the text is a title
->Green : the text is content
->Red : ignore these texts
once you fix the page, give your configuration a name and publish it.
A more in depth visual explaination will be provided in the near future.
###Additional Useful Functions:
scraper.makeVerbose() //logs more information about the processing to console
scraper.stopVerbose()
scraper.ignoreTitles() //sometimes the regex for title classification causes HUGE lag, so ignoring them is sometimes useful
scraper.markTitles()
scraper.shutUp() //stops logging the manual config link
scraper.closeServer() //closes the server