npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@web-extractors/arachnid-seo

v1.0.3-beta

Published

Web crawler for extracting internal site links info for SEO auditing & optimization purposes.

Downloads

5

Readme

Arachnid-SEO

An open-source web crawler that extracts internal links info for SEO auditing & optimization purposes. The project builds upon Puppeteer headless browser. Inspired by Arachnid PHP library.

Features

  1. Simple NodeJS library with asynchronous crawling capability.
  2. Crawl site pages controlled by maximum depth or maximum result count.
  3. Implements BFS (Breadth First Search) algorithm, traversing pages ordered level by level.
  4. Event driven implementation enables users of the library to consume output in real-time (crawling started/completed/skipped/failed ...etc.).
  5. Extracting the following SEO-related information for each page in a site:
    • Page titles, main heading H1 and subheading H2 tag contents.
    • Page status code/text, enabling to detect broken links (4xx/5xx).
    • Meta tag information including: description, keywords, author, robots tags.
    • Detect broken image resources and images with missing alt attributes.
    • Extract page indexability status, and if page is not indexable detect the reason (ex. blocked by robots.txt, client error, canonicalized)
    • Retrieve information about page resources (document/stylesheet/javascript/images files...etc requested by a page)
    • More SEO-oriented information will be added soon...

Getting Started

Installing

NodeJS v10.0.0+ is required.

npm install @web-extractors/arachnid-seo

Basic Usage

const Arachnid = require('@web-extractors/arachnid-seo').default;
const crawler = new Arachnid('https://www.example.com');
crawler.setCrawlDepth(2)
       .traverse()
       .then((results) => console.log(results)); //pages results
// or you can use in await/async manner: 
// const results = await crawler.traverse();       

results output:

Map(3) {
  "https://www.example.com/" => {
    "url": "https://www.example.com/",
    "urlEncoded": "https://www.example.com/",
    "isInternal": true,
    "statusCode": 200,
    "statusText": "",
    "contentType": "text/html; charset=UTF-8",
    "depth": 1,
    "resourceInfo": [
      {
        "type": "document",
        "count": 1,
        "broken": []
      }
    ],
    "responseTimeMs": 340,
    "DOMInfo": {
      "title": "Example Domain",
      "h1": [
        "Example Domain"
      ],
      "h2": [],
      "meta": [],
      "images": {
        "missingAlt": []
      },
      "canonicalUrl": "",
      "uniqueOutLinks": 1
    },
    "isIndexable": true,
    "indexabilityStatus": ""
  },
  "https://www.iana.org/domains/example" => {
    "url": "https://www.iana.org/domains/example",
    "urlEncoded": "https://www.iana.org/domains/example",
    "statusCode": 301,
    "statusText": "",
    "contentType": "text/html; charset=iso-8859-1",
    "isInternal": false,
    "robotsHeader": null,
    "depth": 2,
    "redirectUrl": "https://www.iana.org/domains/reserved",
    "isIndexable": false,
    "indexabilityStatus": "Redirected"
  },
  "https://www.iana.org/domains/reserved" => {
    "url": "https://www.iana.org/domains/reserved",
    "urlEncoded": "https://www.iana.org/domains/reserved",
    "isInternal": false,
    "statusCode": 200,
    "statusText": "",
    "contentType": "text/html; charset=UTF-8",
    "depth": 2,
    "isIndexable": true,
    "indexabilityStatus": ""
  }
}

Advanced Usage

The library designed using Builder pattern to construct flexible Arachnid-SEO crawling variables, as following:

Method chain

Setting maximum depth

To specify maximum links depth to crawl, setCrawlDepth method can be used:

depth equal 1 would be used by default, if CrawlDepth is not set nor MaxResultsNum.

cralwer.setCrawlDepth(3);

Setting maximum results number

To specify the maximum results to be crawled, setMaxResultsNum method can be used:

setMaxResultsNum overwrites setCrawlDepth when both methods are used.

cralwer.setMaxResultsNum(100);

Setting number of concurrent requests

To improve the speed of the crawl the package concurrently crawls 5 urls by default, to change that concurrency value, setConcurrency method can be used:

That will modify the number of pages/tabs created by puppeteer at the same time, increasing it to a large number may cause some memory impact.

cralwer.setConcurrency(10);

Setting Puppeteer Launch Options

To pass additional arguments to puppeteer browser instance, setPuppeteerOptions method can be used:

Refer to puppeteeer documentation for more information about options.

Sample below to run Arachnid-SEO on UNIX with no need to install extra dependencies:

  cralwer.setPuppeteerOptions({
      args: [
        '--disable-gpu',
        '--disable-dev-shm-usage',
        '--disable-setuid-sandbox',
        '--no-first-run',
        '--no-sandbox',
        '--no-zygote',
        '--single-process'
    ]
  });

Enable following subdomains links

By default, only crawling and extracting information of internal links with the same domain is enabled. To enable following subdomain links, shouldFollowSubdomains method can be used:

cralwer.shouldFollowSubdomains(true);

Ignoring Robots.txt rules

By default, the crawler will respect robots.txt allow/disallow results, to ignore robots rules, ignoreRobots method can be used:

cralwer.ignoreRobots();

Using Events

Arachnid-SEO provides methods to track crawling activity progress, by emitting various events as below:

Events example

const Arachnid = require('@web-extractors/arachnid-seo').default;
const crawler = new Arachnid('https://www.example.com/')
                        .setConcurrency(5)
                        .setCrawlDepth(2);

crawler.on('results', response => console.log(response))
       .on('pageCrawlingSuccessed', pageResponse => processResponsePerPage(pageResponse))
       .on('pageCrawlingFailed', pageFailed => handleFailedCrwaling(pageFailed));
       // See https://github.com/web-extractors/arachnid-seo-js#using-events for full list of events emitted

crawler.traverse();

See Full examples for full list of events emitted.

List of events

event: 'info'
  • Emitted when a general activity takes place like: getting the next page batch to process.
  • Payload: <InformativeMessage(String)>
event: 'error'
  • Emitted when an error occurs while processing a link or batch of links, ex. URL with invalid hostname.
  • Payload: <ErrorMessage(String)>
event: 'pageCrawlingStarted'
  • Emitted when crawling of a page start (puppeteer open tab for page URL).
  • Payload: <{url(String), depth(int)}>
event: 'pageCrawlingSuccessed'
  • Emitted when a success response received for Url (2xx/3xx).
  • Payload: <{url(String), statusCode(int)}>
event: 'pageCrawlingFailed'
  • Emitted when a failure response received for Url (4xx/5xx).
  • Payload: <{url(String), statusCode(int)}>
event: 'pageCrawlingFinished'
  • Emitted when the page url marked as processed after extracting all information and adding it to thr results map.
  • Payload: <{url(String), ResultInfo}>
event: 'pageCrawlingSkipped'
  • Emitted when crawling or extracting page info skipped due to non-html content, invalid or external link.
  • Payload: <{url(String), reason(String)}>
event: 'results'
  • Emitted when crawling all links matching parameters completed and returning all links information.
  • Payload: <Map<{url(String), ResultInfo}>>

Changelog

We are still in Beta version :new_moon:

Contributing

Feel free to raise a ticket under Issue tab or Submit PRs for any new bug fix/feature/enhancement.

Authors

License

MIT Public License