google-news-scraper
v2.7.0
Published
Lightweight async scraper for Google News
Downloads
666
Maintainers
Readme
📰 Google News Scraper
A lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.
Installation 🔌
# Install via NPM
npm install google-news-scraper
# Install via Yarn
yarn add google-news-scraper
Usage 🕹️
Simply import the package and pass a config object.
import googleNewsScraper from 'google-news-scraper';
const articles = await googleNewsScraper({ searchTerm: "The Oscars" });
Full documentation on the config object can be found below.
Output 📲
The output is an array of JSON objects, with each article following the structure below:
[
{
"title": "Article title",
"link": "http://url-to-website.com/path/to/article",
"image":"http://url-to-website.com/path/to/image.jpg",
"source": "Name of publication",
"datetime": 2024-05-13T08:02:22.000Z,
"time": "Time/date published (human-readable)",
"articleType": "String, one of ['regular' | 'topicFeatured' | 'topicSmall']"
}
]
Config ⚙️
The config object passed to the function above has the following properties:
searchTerm
This is the search query you'd like to find articles for, simply pass the search string like so: searchTerm: "The Oscars"
.
The search term is no longer a required field, as hahagu added support for topic pages in #44. If searchTerm
and baseUrl
are both supplied, the scraper will just return results from the Google News homepage.
baseUrl
The baseUrl
property enables you to specify an alternate base URL for your search. This is useful when you want to scrape, for example, a specific Google news topic.
PLEASE NOTE: Using both a baseUrl
that points to a topic AND a searchTerm
is not advised, as the searchTerm
will typically be ignored in favour of the topic in the baseUrl
.
In the scraped URL, your baseUrl
will be immediately followed by query parameters (eg: ?hl=en-US&gl=US&ceid=US
), so it doesn't matter whether your baseUrl
has a trailing slash or not.
Defaults to https://news.google.com/search
prettyURLs
The URLs that Google News supplies for each article are "ugly" links (eg: "https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen"
), buy default the scraper will retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html"
). This is done using some base64 decoding, so the overhead is negligible. To prevent this default behaviour and retrieve the "ugly" links instead, pass prettyURLs: false
to the config object.
Credit to anthonyfranc for the base64 decode fix 🙏
Defaults to true
.
timeframe
The results can be filtered to articles published within a given timeframe prior to the request.
The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example 1y
would signify 1 year. Full list of operators below:
- h = hours (eg:
12h
) - d = days (eg:
7d
) - m = months (eg:
6m
) - y = years (eg:
1y
)
Defaults to 7d
.
getArticleContent
By default, the scraper does not return the article content, as this would require Puppeteer to navigate to each individual article in the results (increasing execution time significantly). If you would like to enable this behaviour, and receive the content of each article, simply pass getArticleContent: true,
in the config. This will add two fields to each article in the output: content
and favicon
.
[
{
"title": "Article title",
"link": "https://url-to-website.com/path/to/article",
"image":"https://url-to-website.com/path/to/image.jpg",
"source": "Name of publication",
"time": "Time/date published (human-readable)",
"content": "The full text content of the article...",
"favicon": "https://url-to-website.com/path/to/favicon.png",
}
]
PLEASE NOTE: Due to the large amount of variable factors to take into account, this feature fails on many websites. All errors are handled gracefully and wil return an empty string as the content. Please ensure you handle such outcomes in your application.
Defaults to false
logLevel
You can customise the log level to any of the following:
none
: No logs will be output at all.error
: Only errors will be outputted to the log.warn
: Errors and warnings will be output to the log.info
: Info, errors and warnings will be output to the log.verbose
: All of the above and potentially more. Currently there are no specifically verbose logs, but in future I may move some of the info logs to verbose and/or add some debugging info there.
Defaults to error
.
queryVars
An object of additional query params to add to the Google News URL string, formatted as key value pairs. This can be useful if you want to search for articles in a specific language, for example:
const articles = await googleNewsScraper({
searchTerm: "Últimas noticias en Madrid",
queryVars: {
gl:"ES",
ceid:"ES:es"
},
});
Defaults to null
puppeteerArgs
An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the --no-sandbox
and --disable-setuid-sandbox
flags, as explained in this SO answer.
Defaults to []
puppeteerHeadlessMode
Whether or not Puppeteer should run in headless mode. Running in headless mode increases performance by approximately 30% (credit to ole-ve for finding this). If you're not sure about this setting, leave it as it is.
Defaults to true
limit
The total number of articles that you would like to be returned. Please note that with higher numbers, the actual returned number may be lower. Typically the max is 99
, but it varies depending on many variables in Puppeteer (such as rate limiting, network conditions etc.).
Defaults to 99
TypeScript 💙
Google News Scraper includes full TypeScript definitions.
Your IDE should pick the types up automatically, but if not you can find them in the dist/tsc/
folder.
Common JS 👴🏻
Google News Scraper is built to work as an ESM module out of the box, but also works as a Common JS module too, just use require
instead of import
:
const googleNewsScraper = require('google-news-scraper');
const articles = await googleNewsScraper({ searchTerm: "The Oscars" });
Performance 📈
My test query returned 94 results, which took 4.5 seconds with article content and 3.6 seconds without it. I'm on a fibre connection, and other queries may return a different number of results, so your mileage may vary.
Upkeep 🧹
Please note that this is a web-scraper, which relies on DOM selectors, so any fundamental changes in the markup on the Google News site will probably break this tool. I'll try my best to keep it up-to-date, but changes to the markup on Google News will be silent and therefore difficult to keep track of. Feel free to submit an issue if the tool stops working.
Bugs 🐞
Due to the size of Chromium, this package is too large to run on Vercel free tier. For more information please refer to this issue.
Please report bugs via the issue tracker.
Contribute 🤝
Feel free to submit a PR if you've fixed an open issue. Thank you.
Python version 🐍
If you're looking for a Python version, there's one here. Please note, the Python version is a fork and is maintained separately. If you have any issues with the Python version, please open an issue on that repo instead here.