sitemapper

v3.2.20

Published

5 days ago

Parser for XML Sitemaps to be used with Robots.txt and web crawlers

Downloads

130,934

0High
0Medium
0Low

seantomburke

parse sitemap xml robots.txt sitemaps crawlers webcrawler

Sitemap-parser

Parse through a sitemaps xml to get all the urls for your crawler.

Version 2

Installation

npm install sitemapper --save

Simple Example

const Sitemapper = require('sitemapper');

const sitemap = new Sitemapper();

sitemap.fetch('https://wp.seantburke.com/sitemap.xml').then(function (sites) {
  console.log(sites);
});

Examples

import Sitemapper from 'sitemapper';

(async () => {
  const Google = new Sitemapper({
    url: 'https://www.google.com/work/sitemap.xml',
    timeout: 15000, // 15 seconds
  });

  try {
    const { sites } = await Google.fetch();
    console.log(sites);
  } catch (error) {
    console.log(error);
  }
})();

// or

const sitemapper = new Sitemapper();
sitemapper.timeout = 5000;

sitemapper
  .fetch('https://wp.seantburke.com/sitemap.xml')
  .then(({ url, sites }) => console.log(`url:${url}`, 'sites:', sites))
  .catch((error) => console.log(error));

Options

You can add options on the initial Sitemapper object when instantiating it.

requestHeaders: (Object) - Additional Request Headers (e.g. User-Agent)
timeout: (Number) - Maximum timeout in ms for a single URL. Default: 15000 (15 seconds)
url: (String) - Sitemap URL to crawl
debug: (Boolean) - Enables/Disables debug console logging. Default: False
concurrency: (Number) - Sets the maximum number of concurrent sitemap crawling threads. Default: 10
retries: (Number) - Sets the maximum number of retries to attempt in case of an error response (e.g. 404 or Timeout). Default: 0
rejectUnauthorized: (Boolean) - If true, it will throw on invalid certificates, such as expired or self-signed ones. Default: True
lastmod: (Number) - Timestamp of the minimum lastmod value allowed for returned urls
proxyAgent: (HttpProxyAgent|HttpsProxyAgent) - instance of npm "hpagent" HttpProxyAgent or HttpsProxyAgent to be passed to npm "got"
exclusions: (Array) - Array of regex patterns to exclude URLs from being processed
field: (Object) - An object of fields to be returned from the sitemap. Leaving a field out has the same effect as <field>: false. If not specified sitemapper defaults to returning the 'classic' array of urls. Available fields:
- loc: (Boolean) - The URL location of the page
- lastmod: (Boolean) - The date of last modification of the page
- changefreq: (Boolean) - How frequently the page is likely to change
- priority: (Boolean) - The priority of this URL relative to other URLs on your site
- image:loc: (Boolean) - The URL location of the image (for image sitemaps)
- image:title: (Boolean) - The title of the image (for image sitemaps)
- image:caption: (Boolean) - The caption of the image (for image sitemaps)
- video:title: (Boolean) - The title of the video (for video sitemaps)
- video:description: (Boolean) - The description of the video (for video sitemaps)
- video:thumbnail_loc: (Boolean) - The thumbnail URL of the video (for video sitemaps)

For Example:

field: {
  loc: true,
  lastmod: true,
  changefreq: true,
  priority: true,
}

Leaving a field out has the same effect as <field>: false. If not specified sitemapper defaults to returning the 'classic' array of urls.

An example using all available options:

const sitemapper = new Sitemapper({
  url: 'https://art-works.community/sitemap.xml',
  timeout: 15000,
  requestHeaders: {
    'User-Agent':
      'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:81.0) Gecko/20100101 Firefox/81.0',
  },
  debug: true,
  concurrency: 2,
  retries: 1,
  rejectUnauthorized: false,
  field: {
    loc: true,
    lastmod: true,
    changefreq: true,
    priority: true,
  },
  proxyAgent: new HttpProxyAgent('http://localhost:8080'),
});

Published

Vulnerabilities

Links

Maintainers

Keywords