npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

google-news-scraper

v2.7.0

Published

Lightweight async scraper for Google News

Downloads

666

Readme

📰 Google News Scraper

A lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.

"Buy Me A Coffee"

Google News Scraper

Installation 🔌

# Install via NPM
npm install google-news-scraper
# Install via Yarn
yarn add google-news-scraper

Usage 🕹️

Simply import the package and pass a config object.

import googleNewsScraper from 'google-news-scraper';

const articles = await googleNewsScraper({ searchTerm: "The Oscars" });

Full documentation on the config object can be found below.

Output 📲

The output is an array of JSON objects, with each article following the structure below:

[
    {
        "title":  "Article title",
        "link":  "http://url-to-website.com/path/to/article",
        "image":"http://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "datetime": 2024-05-13T08:02:22.000Z,
        "time":  "Time/date published (human-readable)", 
        "articleType": "String, one of ['regular' | 'topicFeatured' | 'topicSmall']"
    }
]

Config ⚙️

The config object passed to the function above has the following properties:

searchTerm

This is the search query you'd like to find articles for, simply pass the search string like so: searchTerm: "The Oscars".

The search term is no longer a required field, as hahagu added support for topic pages in #44. If searchTerm and baseUrl are both supplied, the scraper will just return results from the Google News homepage.

baseUrl

The baseUrl property enables you to specify an alternate base URL for your search. This is useful when you want to scrape, for example, a specific Google news topic.

PLEASE NOTE: Using both a baseUrl that points to a topic AND a searchTerm is not advised, as the searchTerm will typically be ignored in favour of the topic in the baseUrl.

In the scraped URL, your baseUrl will be immediately followed by query parameters (eg: ?hl=en-US&gl=US&ceid=US), so it doesn't matter whether your baseUrl has a trailing slash or not.

Defaults to https://news.google.com/search

prettyURLs

The URLs that Google News supplies for each article are "ugly" links (eg: "https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen"), buy default the scraper will retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html"). This is done using some base64 decoding, so the overhead is negligible. To prevent this default behaviour and retrieve the "ugly" links instead, pass prettyURLs: false to the config object.

Credit to anthonyfranc for the base64 decode fix 🙏

Defaults to true.

timeframe

The results can be filtered to articles published within a given timeframe prior to the request. The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example 1y would signify 1 year. Full list of operators below:

  • h = hours (eg: 12h)
  • d = days (eg: 7d)
  • m = months (eg: 6m)
  • y = years (eg: 1y)

Defaults to 7d.

getArticleContent

By default, the scraper does not return the article content, as this would require Puppeteer to navigate to each individual article in the results (increasing execution time significantly). If you would like to enable this behaviour, and receive the content of each article, simply pass getArticleContent: true, in the config. This will add two fields to each article in the output: content and favicon.

[
    {
        "title":  "Article title",
        "link":  "https://url-to-website.com/path/to/article",
        "image":"https://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "time":  "Time/date published (human-readable)", 
        "content": "The full text content of the article...", 
        "favicon": "https://url-to-website.com/path/to/favicon.png",
    }
]

PLEASE NOTE: Due to the large amount of variable factors to take into account, this feature fails on many websites. All errors are handled gracefully and wil return an empty string as the content. Please ensure you handle such outcomes in your application.

Defaults to false

logLevel

You can customise the log level to any of the following:

  • none: No logs will be output at all.
  • error: Only errors will be outputted to the log.
  • warn: Errors and warnings will be output to the log.
  • info: Info, errors and warnings will be output to the log.
  • verbose: All of the above and potentially more. Currently there are no specifically verbose logs, but in future I may move some of the info logs to verbose and/or add some debugging info there.

Defaults to error.

queryVars

An object of additional query params to add to the Google News URL string, formatted as key value pairs. This can be useful if you want to search for articles in a specific language, for example:

const articles = await googleNewsScraper({
    searchTerm: "Últimas noticias en Madrid",
    queryVars: {
        gl:"ES",
        ceid:"ES:es"
    },
});

Defaults to null

puppeteerArgs

An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the --no-sandbox and --disable-setuid-sandbox flags, as explained in this SO answer.

Defaults to []

puppeteerHeadlessMode

Whether or not Puppeteer should run in headless mode. Running in headless mode increases performance by approximately 30% (credit to ole-ve for finding this). If you're not sure about this setting, leave it as it is.

Defaults to true

limit

The total number of articles that you would like to be returned. Please note that with higher numbers, the actual returned number may be lower. Typically the max is 99, but it varies depending on many variables in Puppeteer (such as rate limiting, network conditions etc.).

Defaults to 99

TypeScript 💙

Google News Scraper includes full TypeScript definitions.

Your IDE should pick the types up automatically, but if not you can find them in the dist/tsc/ folder.

Common JS 👴🏻

Google News Scraper is built to work as an ESM module out of the box, but also works as a Common JS module too, just use require instead of import:

const googleNewsScraper = require('google-news-scraper');

const articles = await googleNewsScraper({ searchTerm: "The Oscars" });

Performance 📈

My test query returned 94 results, which took 4.5 seconds with article content and 3.6 seconds without it. I'm on a fibre connection, and other queries may return a different number of results, so your mileage may vary.

Upkeep 🧹

Please note that this is a web-scraper, which relies on DOM selectors, so any fundamental changes in the markup on the Google News site will probably break this tool. I'll try my best to keep it up-to-date, but changes to the markup on Google News will be silent and therefore difficult to keep track of. Feel free to submit an issue if the tool stops working.

Bugs 🐞

Due to the size of Chromium, this package is too large to run on Vercel free tier. For more information please refer to this issue.

Please report bugs via the issue tracker.

Contribute 🤝

Feel free to submit a PR if you've fixed an open issue. Thank you.

Python version 🐍

If you're looking for a Python version, there's one here. Please note, the Python version is a fork and is maintained separately. If you have any issues with the Python version, please open an issue on that repo instead here.