opengraph-io
v3.0.1
Published
Website scraper to grab OpenGraph tags or supplement them when they don't exist
Downloads
4,040
Readme
OpenGraph ( Node Client 3.0.1 )
OpenGraph.io client library for nodejs. Given a URL, the client will make a HTTP request to OpenGraph.io which will scrape the site for OpenGraph tags. If tags exist the tags will be returned to you.
If some tags are missing, the client will attempt to infer them from the content on the page, and these inferred tags will be returned as part of the hybridGraph
.
The hybridGraph
results will always default to any OpenGraph tags that were found on the page. If only some tags
were found, or none were, the missing tags will be inferred from the content on the page.
To get a free forever key, signup at OpenGraph.io.
The vast majority of projects will be totally covered using one of our inexpensive plans.
Dedicated plans are also available upon request.
Installation
To install the OpenGraph.io client...
Install the NPM package
npm install opengraph-io --save
Usage
The library provides an OpenGraphIO class that you can instantiate with the required options. Here's an example:
// With Require
const { OpenGraphIO } = require('opengraph-io');
// Or with Import
import { OpenGraphIO } from 'opengraph-io';
const options = {
appId: 'YOUR_APP_ID', // This is your OpenGraph.io App ID and Required. Sign up for a free one at https://www.opengraph.io/
service: 'site', // We currently have three services: site, extract, and scrape. Site will be the default
cacheOk: true, // If a cached result is available, use it for quickness
useProxy: false, // Proxies help avoid being blocked and can bypass capchas
maxCacheAge: 432000000, // The maximum cache age to accept
acceptLang: 'en-US,en;q=0.9', // Language to present to the site.
fullRender: false // This will cause JS to execute when rendering to deal with JS dependant sites
};
const ogClient = new OpenGraphIO(options);
The options shown above are the default options. To understand more about these parameters, please view our documentation at: https://www.opengraph.io/documentation/
The options supplied to the constructor above will be applied to any requests made by the library but can be overridden
by supplying parameters at the time of calling getSiteInfo
.
Retrieving Open Graph Data
To retrieve Open Graph data for a specific URL, you can use the getSiteInfo method. It returns a Promise that resolves to the Open Graph data.
const url = 'https://www.example.com';
ogClient.getSiteInfo(url)
.then((data) => {
// Handle the Open Graph data
console.log(data);
})
.catch((error) => {
// Handle errors
console.error(error);
});
Using Async Await
If you are using async await, the same call will behave very similarly
async function fetchOpenGraphData(url) {
try {
const data = await ogClient.getSiteInfo(url);
// Handle the Open Graph data
console.log(data);
} catch (error) {
// Handle errors
console.error(error);
}
}
// OR, with custom options for the request
async function fetchOpenGraphDataWithProxy(url) {
const options = {
useProxy: true,
cacheOk: false
}
try {
const data = await ogClient.getSiteInfo(url, options);
// Handle the Open Graph data
console.log(data);
} catch (error) {
// Handle errors
console.error(error);
}
}
OpenGraphIO Services
Service Options: site
, extract
, scrape
Site Service
Unleash the power of our Unfurling API to effortlessly extract Open Graph tags from any URL.
Options
| Parameter | Required | Example | Description |
|-------------|----------|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| appId | yes | - | The API key for registered users. Create an account (no cc ever required) to receive your app_id
. |
| cacheOk | no | false | This will force our servers to pull a fresh version of the site being requested. By default, this value is true. |
| fullRender | no | false | This will fully render the site using a chrome browser before parsing its contents. This is especially helpful for single page applications and JS redirects. This will slow down the time it takes to get a response by around 1.5 seconds. |
| useProxy | no | false | Route your request through residential and mobile proxies to avoid bot detection. This will slow down requests 3-10 seconds and can cause requests to time out. NOTE: Proxies are a limited resource and expensive for our team maintain. Free accounts share a small pool of proxies. If you plan on using proxies often, paid accounts provide dedicated concurrent proxies for your account. |
| maxCacheAge | no | 432000000 | This specifies the maximum age in milliseconds that a cached response should be. If not specified, the value is set to 5 days. (5 days * 24 hours * 60 minutes * 60 seconds * 1000ms = 432,000,000 ms) |
| acceptLang | no | en-US,en;q=0.9 auto | This specifies the request language sent when requesting the URL. This is useful if you want to get the site for languages other than English. The default setting for this will return an English version of a page if it exists. Note: if you specify the value auto, the API will use the same language settings as your current request. For more information on what to supply for this field, please see: Accept-Language - MDN Web Docs |
Extract Service
The extract endpoint enables you to extract information from any website by providing its URL. With this endpoint, you can extract any element you need from the website, including but not limited to the title, header elements (h1 to h5), and paragraph elements (p).
Options
| Parameter | Required | Example | Description |
|--------------|----------|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| appId | yes | - | The API key for registered users. Create an account (no cc ever required) to receive your app_id
. |
| htmlElements | no | h1,h2,p | This is an optional parameter and specifies the HTML elements you want to extract from the website. The value should be a comma-separated list of HTML element names. If this parameter is not supplied, the default elements that will be extracted are h1, h2, h3, h4, h5, p, and title. |
| cacheOk | no | false | This will force our servers to pull a fresh version of the site being requested. By default, this value is true. |
| fullRender | no | false | This will fully render the site using a chrome browser before parsing its contents. This is especially helpful for single page applications and JS redirects. This will slow down the time it takes to get a response by around 1.5 seconds. |
| useProxy | no | false | Route your request through residential and mobile proxies to avoid bot detection. This will slow down requests 3-10 seconds and can cause requests to time out. NOTE: Proxies are a limited resource and expensive for our team maintain. Free accounts share a small pool of proxies. If you plan on using proxies often, paid accounts provide dedicated concurrent proxies for your account. |
| maxCacheAge | no | 432000000 | This specifies the maximum age in milliseconds that a cached response should be. If not specified, the value is set to 5 days. (5 days * 24 hours * 60 minutes * 60 seconds * 1000ms = 432,000,000 ms) |
| acceptLang | no | en-US,en;q=0.9 auto | This specifies the request language sent when requesting the URL. This is useful if you want to get the site for languages other than English. The default setting for this will return an English version of a page if it exists. Note: if you specify the value auto, the API will use the same language settings as your current request. For more information on what to supply for this field, please see: Accept-Language - MDN Web Docs |
Scrape Service
Just need the raw HTML? The Scrape Site endpoint is used to scrape the HTML of a website given its URL
Options
| Parameter | Required | Example | Description |
|--------------|----------|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| appId | yes | - | The API key for registered users. Create an account (no cc ever required) to receive your app_id
. |
| cacheOk | no | false | This will force our servers to pull a fresh version of the site being requested. By default, this value is true. |
| fullRender | no | false | This will fully render the site using a chrome browser before parsing its contents. This is especially helpful for single page applications and JS redirects. This will slow down the time it takes to get a response by around 1.5 seconds. |
| useProxy | no | false | Route your request through residential and mobile proxies to avoid bot detection. This will slow down requests 3-10 seconds and can cause requests to time out. NOTE: Proxies are a limited resource and expensive for our team maintain. Free accounts share a small pool of proxies. If you plan on using proxies often, paid accounts provide dedicated concurrent proxies for your account. |
| maxCacheAge | no | 432000000 | This specifies the maximum age in milliseconds that a cached response should be. If not specified, the value is set to 5 days. (5 days * 24 hours * 60 minutes * 60 seconds * 1000ms = 432,000,000 ms) |
| acceptLang | no | en-US,en;q=0.9 auto | This specifies the request language sent when requesting the URL. This is useful if you want to get the site for languages other than English. The default setting for this will return an English version of a page if it exists. Note: if you specify the value auto, the API will use the same language settings as your current request. For more information on what to supply for this field, please see: Accept-Language - MDN Web Docs |
Using Retry Strategies
NOTE: each retry will be counted as an additional request based on the paramters for that request
With retry strategies you can define properties you require to be populated (e.g. title or description) and also provide the strategies in the order you would like them to be attempted.
The client will work through the strategies until the requires
parameter list is satisfied.
Key points to know is this is a
- Note that all requests made are returned using an element of result named
allRequests
. - You must provide
requires
as a parameter of option, it's used to define what values you want back to consider it a "good" response.
The example below will execute as follows:
- Make a normal request for this url.
- If the previous request did not return the
openGraph.title
field, it will make a request using full javascript rendering - If the previous request did not return the
openGraph.title
field, it will make a request using our proxy service
If all retry strategies have failed the requirements specified in requires
then it will return the last
request made.
const options = {
appId: 'YOUR_APP_ID', // This is your OpenGraph.io App ID and Required. Sign up for a free one at https://www.opengraph.io/
cacheOk: false,
retryStrategies: [
{
// Requests using the default options.
requires: ["openGraph.title"]
},
{
// Make a full request to full render
fullRender: true,
requires: ["openGraph.title"]
},
{
// Make a request using the proxy
useProxy: true,
requires: ["openGraph.title"]
}
]
};
const ogClient = new OpenGraphIO(options);
opengraph.getSiteInfo('https://www.newegg.com/Product/Product.aspx?Item=N82E16813157762')
.then(function(result){
console.log('Site title is', result.hybridGraph.title);
});
If you only need retry strategies for a single request, you can also pass them in as a parameter to the request.
const options = {
appId: 'YOUR_APP_ID', // This is your OpenGraph.io App ID and Required. Sign up for a free one at https://www.opengraph.io/
cacheOk: false,
};
const ogClient = new OpenGraphIO(options);
const retryStrategies = [
{
// Requests using the default options.
requires: ["hybridGraph.title"]
},
{
// Make a full request to full render
fullRender: true,
requires: ["hybridGraph.title"]
},
{
// Make a request using the proxy
useProxy: true,
requires: ["hybridGraph.title"]
}
]
opengraph.getSiteInfo('https://www.newegg.com/Product/Product.aspx?Item=N82E16813157762', { retryStrategies })
.then(function(result){
console.log('Site title is', result.hybridGraph.title);
});
Support
Feel free to reach out at any time with questions or suggestions by adding to the issues for this repo or if you'd prefer, head over to https://www.opengraph.io/support/ and drop us a line!
License
MIT License
Copyright (c) Opengraph.io
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.