news-scraper-core
v2.0.0
Published
core module for the NewScraper
Downloads
7
Maintainers
Readme
NewScraper Core Module
The core module for the NewScraper https://github.com/XOP/news-scraper
Goal
NewScraper Core Module (NewScraper) is a NodeJS module, that receives specific directives as props and returns scraped pages data.
Both directives' and output' format is JSON
.
NewScraper is designed to be used as a middleware for a server / hybrid / CLI application.
API
Config
limit
Number, default: undefined
(bypass)
Defines the default common limit; will overwrite directive's Input -> limit
output
Object:
{
path,
current
}
output.path
String, default: "./"
Path to the scraped data directory
output.current
String, default: "data.json"
Path to the current data json file (used to filter previously shown news)
updateStrategy
String, default: ""
Defines logic of the post-processing the scraped data:"scratch"
- ignores previous runs, creates new json file every new scraping round"compare"
- compares scraping results to the previous result, stores in output.current
file (data.json by default)""
- bypass, no scraping results saved
scraperOptions
Object, default: {}
Parameters to pass to the currently used scraper.
Version 1.x - Nightmare, find all options here.
Input
Input is the collection of directives in a JSON
format.
It is recommended for the application to store directives in a most readable format (e.g.
YAML
) and convert it on the fly to theJSON
.
Example:
[
{
"title": "Smashing magazine",
"url": "http://www.smashingmagazine.com/",
"elem": "article.post",
"link": "h2 > a",
"author": "h2 + ul li.a a",
"time": "h2 + ul li.rd",
"image": "figure > a > img",
"limit": 6
},
{...},
{...}
]
title
String
Name of the resource, required
url
String
Url of the resource, required
elem
String
CSS selector of the news item container element, required
link
String
CSS selector of the link (...) inside of the elem
If the elem
itself is a link, this is not required
author
String
CSS selector of the author element inside of the elem
time
String
CSS selector of the time element inside of the elem
image
String
CSS selector of the image element inside of the elem
This one can be img
tag or any other - NewScraper will search for data-src
and background-image
CSS properties to find proper image data
limit
Number
How many elem
-s from the url
will be scraped, maximum
See also: Config -> limit
Output
Output includes all Input datapages -> [] -> {...}
Plus the parsed scraping result, ready for the favourite templating enginepages -> [] -> {data -> [] -> {...}}
Plus the unmodified markup from the specified pagespages -> [] -> {data -> [] -> {raw}}
It also contains some meta-data, such as path to the current data file and the exact moment of the scraping start.
Example:
{
"meta": {
"file": "/Users/[...]/data/1474811135645.json",
"date": 1474811135645
},
"pages": [
{
"url": "https://www.smashingmagazine.com",
"elem": "article.post",
"link": "h2 > a",
"author": "h2 + ul li.a a",
"time": "h2 + ul li.rd",
"image": "figure > a > img",
"limit": 6,
"data": [
{
"href": "https://www.smashingmagazine.com/2016/09/interview-with-matan-stauber/",
"text": "\n\t\t\tAn Interview With Matan Stauber\n\t\t\tStretching The Limits Of What’s Possible\n\t\t",
"title": "Read 'Stretching The Limits Of What’s Possible'",
"raw": "<article class=\"post-266432 post type-post status-publish format-standard has-post-thumbnail hentry category-general tag-interviews\" vocab=\"http://schema.org/\" typeof=\"TechArticle\"> [ ... a lot of markup ... ] </article>",
"author": "Cosima Mielke",
"time": "September 23rd, 2016",
"imageSrc": "https://www.smashingmagazine.com/wp-content/uploads/2016/09/histography-website-small-opt.png"
},
{... x5}
]
},
{...},
{...}
]
Events
:construction: coming up!