pagean

v13.1.0

Published

2 days ago

Pagean is a web page analysis tool designed to automate tests requiring web pages to be loaded in a browser window (e.g. horizontal scrollbar, console errors)

Downloads

371

0High
0Medium
0Low

aarongoldenthal

analysis automated testing ci htmlhint jest puppeteer testing

Pagean

Pagean is a web page analysis tool designed to automate tests requiring web pages to be loaded in a browser window (for example 404 error loading an external resource, page renders with horizontal scrollbars). The specific tests are outlined below, but are all general tests that do not include any page-specific logic.

Installation

Install Pagean globally (as shown below), or locally, via npm.

npm install -g pagean

Usage

Pagean runs as a command line tool and is executed as follows:

Installed globally:
> pagean [options]

Installed locally:
> npx pagean [options]

Options:
  -V, --version        output the version number
  -c, --config <file>  the path to the pagean configuration file (default: "./.pageanrc.json")
  -h, --help           display help for command

Pagean requires a configuration file named, which can be specified via the CLI as detailed previously, or use the default file .pageanrc.json in the project root. This file provides the URLs to be tested and options to configure the tests and reports. Details on the available tests and the configuration file format are provided below.

Test cases

The tests use Puppeteer to launch a headless Chrome browser. The URLs defined in the configuration file are each loaded once, and after page load the applicable tests are executed. Test results are passed or failed, but can be configured to report warning instead of failure. Only a failed test causes the test process to fail and exit with an error code (a warning does not). If a page URL fails to load, it is retried up to two additional times and if unsuccessful the URL is logged as a page error with the error message.

Broken link test

The broken link test checks for broken links on the page. It checks any <a> tag on the page with href pointing to another location on the current page or another page (that is, only http(s) or file protocols).

For links within the page, this test checks for existence of the element on the page, passing if the element exists and failing otherwise (and passing for cases that are always valid, for example # or #top for the current page). It doesn't check the visibility of the element. Failing tests return a response of "#element Not Found" (where #element identifies the specific element).
For links to other pages, the test tries to most efficiently confirm whether the target link is valid. It first makes a HEAD request for that URL and checks the response. If an erroneous response is returned (>= 400 with no execution error) and not code 429 (Too Many Requests), the request is retried with a GET request. The test passes for HTTP responses < 400 and fails otherwise (if HTTP response is >= 400 or another error occurs).
- This can result in false failure indications, specifically for file: links (404 or ECONNREFUSED) or where the browser passes a domain identity with the request (page loads when tested, but 401 response for links to that page). For these cases, or other false failures, the test configuration allows a Boolean checkWithBrowser option that instead checks links by loading the target in the browser (via puppeteer). Note this can increase test execution time, in some cases substantially, due to the time to open a new browser tab and plus load the page and all assets.
- Note that file: links can only be tested with the checkWithBrowser option.
- If the link to another page includes a hash it's removed prior to checking. The test in this case is confirming a valid link, not that the element exists, which is only done for the current page.
- The test configuration allows an ignoredLinks array listing link URLs to ignore for this test. Note this only applies to links to other pages, not links within the page, which are always checked.
To optimize performance, link test results are cached and those links aren't re-tested for the entire test run (across all tested URLs). The test configuration allows a Boolean ignoreDuplicates option that can be set to false to bypass this behavior and re-test all links. The results for any failed links are included in the reports in any case.

For any failing test, the data array in the test report includes the original URL and the response code or error as shown below.

[
  {
    "href": "https://about.gitlab.com/not-found",
    "status": 404
  },
  {
    "href": "http://localhost:3000/brokenLinks.html#notlinked",
    "status": "#notlinked Not Found"
  },
  {
    "href": "https://this.url.does.not.exist/",
    "status": "ENOTFOUND"
  }
]

Note: this test checks all links on the page, and doesn't respect mechanisms intended to limit web crawlers such as robots.txt or noindex tags.

Console error test

The console error test fails if any error is written to the browser console, but is otherwise simply a subset of the console output test. This separation allows for testing for console errors, but allowing any other console output.

Console output test

The console output test fails if any output is written to the browser console. An array is included in the report with all entries, as shown below:

[
  {
    "type": "error",
    "text": "Failed to load resource: net::ERR_NAME_NOT_RESOLVED",
    "location": {
      "url": "https://this.url.does.not.exist/file.js"
    }
  }
]

External script test

The external script test is intended to identify any externally loaded JavaScript files (for example loaded from a CDN) and aggregate those files so they can undergo further analysis (for example dependency vulnerability scanning). The test is included here since these tests load fully rendered pages, therefore allowing the aggregation of this data for pages generated using any language or framework. By default the test returns a warning if the page includes any JavaScript files loaded from a different domain than the page (although this could be overridden to fail instead via setting failWarn: false, see the Configuration section below). These files are then downloaded and saved in the "pagean-external-files" directory in the project root. Subdirectories are created for each domain, then following the URL path. For example, the following script…

<script src="https://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js"></script>

…is saved as ./bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js. The data array in the test report includes the original file URL and the local saved filename or applicable error, as shown below.

[
  {
    "url": "https://code.jquery.com/jquery-3.4.1.slim.min.js",
    "localFile": "pagean-external-scripts/code.jquery.com/jquery-3.4.1.slim.min.js"
  },
  {
    "url": "http://bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js",
    "error": "Request failed with status code 404"
  }
]

Each external script is saved only once, but is reported on any page where it's referenced.

Horizontal scrollbar test

The horizontal scrollbar test fails if the rendered page has a horizontal scrollbar. If a specific browser viewport size is desired for this test, that can be configured in the puppeteerLaunchOptions.

Page load time test

The page load time test fails if the page load time (from start through the load event) exceeds the defined threshold in the configuration file (or the default of 2 seconds). The actual load time is included in the report. Tests time out at twice the page load time threshold.

Rendered HTML test

The rendered HTML test is intended for cases where content is dynamically created prior to page load (that is, the load event firing). The rendered HTML is returned and checked with HTML Hint and the test fails if any issues are found. An array is included in the report with all HTML Hint issues, as shown below:

[
  {
    "col": 9,
    "evidence": "    <div id=\"div1\"></div>",
    "line": 6,
    "message": "The id value [ div1 ] must be unique.",
    "raw": " id=\"div1\"",
    "rule": {
      "description": "The value of id attributes must be unique.",
      "id": "id-unique",
      "link": "https://github.com/thedaviddias/HTMLHint/wiki/id-unique"
    },
    "type": "error"
  }
]

An htmlhintrc file can be specified in the configuration file, otherwise the default "./.htmlhintrc" file is used (if it exists). See the Configuration section below.

Note: this test may not find some errors in the original HTML that are removed/resolved as the page is parsed (for example closing tags with no opening tags).

Reports

Based on the reporters configuration, Pagean results may be displayed in the console and saved in two reports in the project root directory (any or all of the three):

A JSON report named pagean-results.json.
An HTML report named pagean-results.html.

Both reports contain:

The time of test execution.
A summary of the total tests and results (passed, warning, failed, and page errors).
The detailed test results, including the URL tested, list of tests performed on that URL with results, and, if applicable, any relevant data associated with the test failure (for example the console errors if the console error test fails).

Complete reports for the example case in this project (the tests as specified in the project .pageanrc.json file) can be found at the preceding links.

Configuration

Pagean looks for a configuration file as specified via the CLI, or defaults to a file named .pageanrc.json in the project root. If the configuration file is not found, is not valid JSON, or doesn't contain any URLs to check the job fails.

Below is an example .pageanrc.json file, which is broken into seven major properties:

htmlhintrc: An optional path to an htmlhintrc file to be used in the rendered HTML test.
project: An optional name of the project, which is included in HTML and JSON reports.
puppeteerLaunchOptions: An optional set of options to pass to Puppeteer on launch. The complete list of available options can be found at https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#puppeteerlaunchoptions.
reporters: An optional array of reporters indicating the test reports that should be provided. There are three possible options - cli, html, and json. The cli option reports all test details to the console, but the final results summary is always output (even with cli disabled). If reporters is specified, at least one reporter must be included. The default value, as specified below, is all three reporters enabled.
settings: These settings enable/disable or configure tests, and are applied to all tests overriding the default values.
- The shorthand notation allows easy enabling/disabling of tests. In this format the test name is given with a Boolean value to enable or disable the test. In this case any other test-specific settings use the default values.
- The longhand version includes an object for each test. Every test includes two possible properties (some tests include additional settings):
  - enabled: A Boolean value to enable/disable the test, and some tests include additional settings (default true for all tests).
  - failWarn: A Boolean value causing a failed test to report a warning instead of failure. A warning result doesn't cause the test process to fail (exit with an error code). The default value for all tests is false except the externalScriptTest, as shown below.

The shorthand:

"settings": {
    "consoleErrorTest": true
}

is equivalent to the longhand:

"settings": {
    "consoleErrorTest": {
        "enabled": true,
        "failWarn": false
    }
}

sitemap: Specify a sitemap with URLs to test. If a sitemap is specified, the URLs from the sitemap are added to the urls array. If a URL is in the urls array with settings, those settings are retained. Note that <sitemapindex> is currently not supported. The sitemap object can have the following properties:
- url: The URL of the sitemap (required if sitemap is included). This can be either an actual URL or a local file.
- find: A string to search for in sitemap URLs (for example https://somehere.test) (required if replace is specified).
- replace: The string to replace the find string with (for example http://localhost:3000) (required if find is specified).
- exclude: An array of strings with regular expressions to exclude URLs from the sitemap (for example ['\.pdf$'] to exclude any PDF files). Since these are string representations of regular expressions, the backslash must be escaped (for example \\.). Exclude is performed before find/replace, so uses the original URLs from the sitemap.
urls: An array of URLs to be tested, which must contain at least one value. Each array entry can either be a URL string, or an object that contains a url string and an optional settings object. This object can contain any of the settings values identified previously and overrides that setting for testing that URL. The url string can be either an actual URL or a local file, as shown in the example below.

The following shows all available settings, except sitemap, with the default values.

{
  "puppeteerLaunchOptions": {
    "headless": "new"
  },
  "reporters": ["cli", "html", "json"],
  "settings": {
    "brokenLinkTest": {
      "enabled": true,
      "failWarn": false,
      "checkWithBrowser": false,
      "ignoreDuplicates": true
    },
    "consoleErrorTest": {
      "enabled": true,
      "failWarn": false
    },
    "consoleOutputTest": {
      "enabled": true,
      "failWarn": false
    },
    "externalScriptTest": {
      "enabled": true,
      "failWarn": true
    },
    "horizontalScrollbarTest": {
      "enabled": true,
      "failWarn": false
    },
    "pageLoadTimeTest": {
      "enabled": true,
      "failWarn": false,
      "pageLoadTimeThreshold": 2
    },
    "renderedHtmlTest": {
      "enabled": true,
      "failWarn": false
    }
  }
}

Numerous example config files used in the tests can be found here.

Container images

Provided with the Pagean project are container images configured to run the tests. All available image tags can be found in the registry.gitlab.com/gitlab-ci-utils/pagean repository here. Details on each release can be found on the Releases page.

Note: any images in the gitlab-ci-utils/pagean/tmp repository are temporary images used during the build process and may be deleted at any point.

Puppeteer cache location

In Puppeteer v19 the default cache location for installing the Chrome binary was changed from within the project's node_modules folder to ~/.cache/puppeteer. To simplify execution in a container, the PUPPETEER_CACHE_DIR environment variable is set to install the Chrome binaries in /home/pptruser/.cache/puppeteer during container build, so setting to another value before execution can cause errors where Puppeteer can't find the Chrome binary.

GitLab CI configuration

The following is an example job from a .gitlab-ci.yml file to use this image to run Pagean against another project in GitLab CI:

pagean:
  image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
  stage: test
  script:
    - pagean
  artifacts:
    when: always
    paths:
      - pagean-results.html
      - pagean-results.json
      - pagean-external-scripts/

Testing with a static HTTP server

The container image shown previously includes serve and wait-on installed globally to run a local HTTP server for testing static content. The example job below illustrates how to use this for Pagean tests. The script starts the server in this project's ./tests/fixtures/site directory and uses wait-on to hold the script until the server is running and returns a valid response. The referenced pageanrc file is the same as the project default pageanrc, but references all test URLs from the local server.

pagean:
  image: registry.gitlab.com/gitlab-ci-utils/pagean:latest
  stage: test
  before_script:
    # Start static server in test cases directory, discarding any console output,
    # and wait until the server is running.
    - serve ./tests/fixtures/site > /dev/null 2>&1 & wait-on http://localhost:3000
  script:
    - pagean -c static-server.pageanrc.json
  artifacts:
    when: always
    paths:
      - pagean-results.html
      - pagean-results.json
      - pagean-external-scripts/

Linting pageanrc files

A command line tool is also available to lint pageanrc files, which is executed as follows:

Installed globally:
> pageanrc-lint [options] [file] (default: "./.pageanrc.json")

Installed locally:
> npx pageanrc-lint [options] [file] (default: "./.pageanrc.json")

Lint a pageanrc file

Options:
  -V, --version  output the version number
  -j, --json     output JSON with full details
  -h, --help     display help for command

The --json option outputs the JSON results to stdout in all cases for consistency ([] if no errors found, so that it always outputs valid JSON). Otherwise errors are output to stderr, for example:

.\tests\test-configs\cli-tests\some-test.pageanrc.json
  <pageanrc>.puppeteerLaunchOptions                  must NOT have fewer than 1 properties
  <pageanrc>.reporters[0]                            must be equal to one of the allowed values (cli, html, json)
  <pageanrc>.settings.consoleOutputTest              must be either Boolean or object with the appropriate properties
  <pageanrc>.settings.pageLoadTimeTest.foo           must NOT contain additional properties: "foo"
  <pageanrc>.settings.pageLoadTimeTest               must be either Boolean or object with the appropriate properties
  <pageanrc>.sitemap                                 must use 'find' and 'replace' together
  <pageanrc>.urls[2].settings.consoleOutputTest      must be either Boolean or object with the appropriate properties
  <pageanrc>.urls[3]                                 must be either URL string or object with the appropriate properties
  <pageanrc>.urls[5]                                 must have required property 'url'

In some cases, a single error might result in multiple messages based on the options in the schema definition, especially for cases that can be either a single value or an object with specific properties (for example the errors for <pageanrc>.settings.pageLoadTimeTest in the preceding example).

Note that because of the large number of options, which are dependent on an external project, the linting of puppeteerLaunchOptions only checks that at least one property is provided, it doesn't check the detailed settings.