npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

news-scraper-core

v2.0.0

Published

core module for the NewScraper

Downloads

7

Readme

NewScraper Core Module

npm version npm dependencies

The core module for the NewScraper https://github.com/XOP/news-scraper

Goal

NewScraper Core Module (NewScraper) is a NodeJS module, that receives specific directives as props and returns scraped pages data.

Both directives' and output' format is JSON.

NewScraper is designed to be used as a middleware for a server / hybrid / CLI application.

API

Config

limit
Number, default: undefined (bypass)
Defines the default common limit; will overwrite directive's Input -> limit

output
Object:

{ 
    path,
    current
}

output.path
String, default: "./"
Path to the scraped data directory

output.current String, default: "data.json"
Path to the current data json file (used to filter previously shown news)

updateStrategy
String, default: ""
Defines logic of the post-processing the scraped data:
"scratch" - ignores previous runs, creates new json file every new scraping round
"compare" - compares scraping results to the previous result, stores in output.current file (data.json by default)
"" - bypass, no scraping results saved

scraperOptions
Object, default: {}
Parameters to pass to the currently used scraper.
Version 1.x - Nightmare, find all options here.

Input

Input is the collection of directives in a JSON format.

It is recommended for the application to store directives in a most readable format (e.g. YAML) and convert it on the fly to the JSON.

Example:

[
    {
        "title": "Smashing magazine",
        "url": "http://www.smashingmagazine.com/",
        "elem": "article.post",
        "link": "h2 > a",
        "author": "h2 + ul li.a a",
        "time": "h2 + ul li.rd",
        "image": "figure > a > img",
        "limit": 6
    },
    {...},
    {...}
]

title
String
Name of the resource, required

url
String
Url of the resource, required

elem
String
CSS selector of the news item container element, required

link
String
CSS selector of the link (...) inside of the elem If the elem itself is a link, this is not required

author
String
CSS selector of the author element inside of the elem

time
String
CSS selector of the time element inside of the elem

image
String
CSS selector of the image element inside of the elem This one can be img tag or any other - NewScraper will search for data-src and background-image CSS properties to find proper image data

limit
Number
How many elem-s from the url will be scraped, maximum
See also: Config -> limit

Output

Output includes all Input data
pages -> [] -> {...}

Plus the parsed scraping result, ready for the favourite templating engine
pages -> [] -> {data -> [] -> {...}}

Plus the unmodified markup from the specified pages
pages -> [] -> {data -> [] -> {raw}}

It also contains some meta-data, such as path to the current data file and the exact moment of the scraping start.

Example:

{
    "meta": {
        "file": "/Users/[...]/data/1474811135645.json",
        "date": 1474811135645
    },
    "pages": [
        {
            "url": "https://www.smashingmagazine.com",
            "elem": "article.post",
            "link": "h2 > a",
            "author": "h2 + ul li.a a",
            "time": "h2 + ul li.rd",
            "image": "figure > a > img",
            "limit": 6,
            "data": [
                {
                "href": "https://www.smashingmagazine.com/2016/09/interview-with-matan-stauber/",
                "text": "\n\t\t\tAn Interview With Matan Stauber\n\t\t\tStretching The Limits Of What’s Possible\n\t\t",
                "title": "Read 'Stretching The Limits Of What’s Possible'",
                "raw": "<article class=\"post-266432 post type-post status-publish format-standard has-post-thumbnail hentry category-general tag-interviews\" vocab=\"http://schema.org/\" typeof=\"TechArticle\"> [ ... a lot of markup ... ] </article>",
                "author": "Cosima Mielke",
                "time": "September 23rd, 2016",
                "imageSrc": "https://www.smashingmagazine.com/wp-content/uploads/2016/09/histography-website-small-opt.png"
                },
                {... x5}
            ]
        },
        {...},
        {...}
    ]

Events

:construction: coming up!

MIT License