site-audit-seo

v6.0.0

Published

8 months ago

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv.

Downloads

0High
0Medium
0Low

popstas

audit crawler scraper puppeteer seo cli

Web service and CLI tool for SEO site audit: crawl site, lighthouse all pages, view public reports in browser. Also output to console, json, csv.

Web view report - site-audit-seo-viewer.

Demo:

Русское описание ниже

site-audit-demo

Using without install

Open https://viasite.github.io/site-audit-seo-viewer/.

Features:

Crawls the entire site, collects links to pages and documents
Does not follow links outside the scanned domain (configurable)
Analyse each page with Lighthouse (see below)
Analyse main page text with Mozilla Readability and Yake
Search pages with SSL mixed content
Scan list of urls, --url-list
Set default report fields and filters
Scan presets
Documents with the extensions doc, docx, xls, xlsx, ppt, pptx, pdf, rar, zip are added to the list with a depth == 0

Technical details:

Does not load images, css, js (configurable)
Each site is saved to a file with a domain name in ~/site-audit-seo/
Some URLs are ignored (preRequest in src/scrap-site.js)

Web viewer features:

Fixed table header and url column
Add/remove columns
Column presets
Field groups by categories
Filters presets (ex. h1_count != 1)
Color validation
Verbose page details (+ button)
Direct URL to same report with selected fields, filters, sort
Stats for whole scanned pages, validation summary
Persistent URL to report when --upload using
Switch between last uploaded reports
Rescan current report

Fields list (18.08.2020):

url
mixed_content_url
canonical
is_canonical
previousUrl
depth
status
request_time
redirects
redirected_from
title
h1
page_date
description
keywords
og_title
og_image
schema_types
h1_count
h2_count
h3_count
h4_count
canonical_count
google_amp
images
images_without_alt
images_alt_empty
images_outer
links
links_inner
links_outer
text_ratio_percent
dom_size
html_size
html_size_rendered
lighthouse_scores_performance
lighthouse_scores_pwa
lighthouse_scores_accessibility
lighthouse_scores_best-practices
lighthouse_scores_seo
lighthouse_first-contentful-paint
lighthouse_speed-index
lighthouse_largest-contentful-paint
lighthouse_interactive
lighthouse_total-blocking-time
lighthouse_cumulative-layout-shift
and 150 more lighthouse tests!

Install

Zero-knowledge install

Requires Docker.

Windows: download and run `install-run.bat`.

Script will clone repository to %LocalAppData%\Programs\site-audit-seo and run service on http://localhost:5302.

Linux/MacOS:

curl https://raw.githubusercontent.com/viasite/site-audit-seo/master/install-run.sh | bash

Script will clone repository to $HOME/.local/share/programs/site-audit-seo and run service on http://localhost:5302.

Service will available on http://localhost:5302

Default ports:

Backend: 5301
Frontend: 5302
Yake: 5303

You can change it in .env file or in docker-compose.yml.

Install with NPM:

npm install -g site-audit-seo

For linux users

npm install -g site-audit-seo --unsafe-perm=true

After installing on Ubuntu, you may need to change the owner of the Chrome directory from root to user.

Run this (replace $USER to your username or run from your user, not from root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Install developer instanse with docker-compose

git clone https://github.com/viasite/site-audit-seo
cd site-audit-seo
git clone https://github.com/viasite/site-audit-seo-viewer data/front
docker-compose pull # for skip build step
docker-compose up -d

Error details Invalid file descriptor to ICU data received.

Command line usage:

$ site-audit-seo --help
Usage: site-audit-seo -u https://example.com

Options:
  -u --urls <urls>                  Comma separated url list for scan
  -p, --preset <preset>             Table preset (minimal, seo, seo-minimal, headers, parse, lighthouse,
                                    lighthouse-all) (default: "seo")
  -t, --timeout <timeout>           Timeout for page request, in ms (default: 10000)
  -e, --exclude <fields>            Comma separated fields to exclude from results
  -d, --max-depth <depth>           Max scan depth (default: 10)
  -c, --concurrency <threads>       Threads number (default: by cpu cores)
  --lighthouse                      Appends base Lighthouse fields to preset
  --delay <ms>                      Delay between requests (default: 0)
  -f, --fields <json>               Field in format --field 'title=$("title").text()' (default: [])
  --default-filter <defaultFilter>  Default filter when JSON viewed, example: depth>1
  --no-skip-static                  Scan static files
  --no-limit-domain                 Scan not only current domain
  --docs-extensions <ext>           Comma-separated extensions that will be add to table (default:
                                    doc,docx,xls,xlsx,ppt,pptx,pdf,rar,zip)
  --follow-xml-sitemap              Follow sitemap.xml (default: false)
  --ignore-robots-txt               Ignore disallowed in robots.txt (default: false)
  --url-list                        assume that --url contains url list, will set -d 1 --no-limit-domain
                                    --ignore-robots-txt (default: false)
  --remove-selectors <selectors>    CSS selectors for remove before screenshot, comma separated (default:
                                    ".matter-after,#matter-1,[data-slug]")
  -m, --max-requests <num>          Limit max pages scan (default: 0)
  --influxdb-max-send <num>         Limit send to InfluxDB (default: 5)
  --no-headless                     Show browser GUI while scan
  --remove-csv                      Delete csv after json generate (default: true)
  --remove-json                     Delete json after serve (default: true)
  --no-remove-csv                   No delete csv after generate
  --no-remove-json                  No delete json after serve
  --out-dir <dir>                   Output directory (default: "~/site-audit-seo/")
  --out-name <name>                 Output file name, default: domain
  --csv <path>                      Skip scan, only convert existing csv to json
  --json                            Save as JSON (default: true)
  --no-json                         No save as JSON
  --upload                          Upload JSON to public web (default: false)
  --no-color                        No console colors
  --partial-report <partialReport>
  --lang <lang>                     Language (en, ru, default: system language)
  --no-console-validate             Don't output validate messages in console
  --disable-plugins <plugins>       Comma-separated plugin list (default: [])
  --screenshot                      Save page screenshot (default: false)
  -V, --version                     output the version number
  -h, --help                        display help for command

Custom fields

Linux/Mac:

site-audit-seo -d 1 -u https://example -f 'title=$("title").text()' -f 'h1=$("h1").text()'
site-audit-seo -d 1 -u https://example -f noindex=$('meta[content="noindex,%20nofollow"]').length

Windows:

site-audit-seo -d 1 -u https://example -f title=$('title').text() -f h1=$('h1').text()

Remove fields from results

This will output fields from seo preset excluding canonical fields:

site-audit-seo -u https://example.com --exclude canonical,is_canonical

Lighthouse

Analyse each page with Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Analyse seo + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Config file

You can copy .site-audit-seo.conf.js to your home directory and tune options.

Send to InfluxDB

It is beta feature. How to config:

Add this to ~/.site-audit-seo.conf:

module.exports = {
  influxdb: {
    host: 'influxdb.host',
    port: 8086,
    database: 'telegraf',
    measurement: 'site_audit_seo', // optional
    username: 'user',
    password: 'password',
    maxSendCount: 5, // optional, default send part of pages
  }
};

Use --influxdb-max-send in terminal.
Create command for scan your urls:

site-audit-seo -u https://page-with-url-list.txt --url-list --lighthouse --upload --influxdb-max-send 100 >> ~/log/site-audit-seo.log

Add command to cron.

Plugins

Readability - main page text length, reading time
Yake - keywords extraction from main page text

See CONTRIBUTING.md for details about plugin development.

Install plugins:

cd data
npm install site-audit-seo-readability
npm install site-audit-seo-yake

Disable plugins:

You can add argument such: --disable-plugins readability,yake. It more faster, but less data extracted.

Credentials

Based on headless-chrome-crawler (puppeteer). Used forked version @popstas/headless-chrome-crawler.

Bugs

Sometimes it writes identical pages to csv. This happens in 2 cases: 1.1. Redirect from another page to this (solved by setting skipRequestedRedirect: true, hardcoded). 1.2. Simultaneous request of the same page in parallel threads.

Free audit tools alternatives

WebSite Auditor (Link Assistant) - desktop app, 500 pages
Screaming Frog SEO Spider - desktop app, same as site-audit-seo, 500 pages
Seobility - 1 project up to 1000 pages free
Neilpatel (Ubersuggest) - 1 project, 150 pages
Semrush - 1 project, 100 pages per month free
Seoptimer - good for single page analysis

Free data scrapers

Web Scraper - free for local use extension
Portia - self-hosted visual scraper builder, scrapy based
Crawlab - distributed web crawler admin platform, self-hosted with Docker
OutWit Hub - free edition, pro edition for $99
Octoparse - 10 000 records free
Parsers.me - 1 000 pages per run free
website-scraper - opensource, CLI, download site to local directory
website-scraper-puppeteer - same but puppeteer based
Gerapy - distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Русский

Сканирование одного или несколько сайтов в json файл с веб-интерфейсом.

Особенности:

Обходит весь сайт, собирает ссылки на страницы и документы
Сводка результатов после сканирования
Документы с расширениями doc, docx, xls, xlsx, pdf, rar, zip добавляются в список с глубиной 0
Поиск страниц с SSL mixed content
Каждый сайт сохраняется в файл с именем домена
Не ходит по ссылкам вне сканируемого домена (настраивается)
Не загружает картинки, css, js (настраивается)
Некоторые URL игнорируются (preRequest в src/scrap-site.js)
Можно прогнать каждую страницу по Lighthouse (см. ниже)
Сканирование произвольного списка URL, --url-list

Установка:

npm install -g site-audit-seo

Если у вас Ubuntu

npm install -g site-audit-seo --unsafe-perm=true

npm run postinstall-puppeteer-fix

Или запустите это (замените $USER на вашего юзера, либо запускайте под юзером, не под root):

sudo chown -R $USER:$USER "$(npm prefix -g)/lib/node_modules/site-audit-seo/node_modules/puppeteer/.local-chromium/"

Подробности ошибки Invalid file descriptor to ICU data received.

Использование

site-audit-seo -u https://example.com

Кастомные поля

Можно передать дополнительные поля так:

site-audit-seo -d 1 -u https://example -f "title=$('title').text()" -f "h1=$('h1').text()"

Lighthouse

Прогнать каждую страницу по Lighthouse

site-audit-seo -u https://example.com --preset lighthouse

Обычный seo аудит + Lighthouse

site-audit-seo -u https://example.com --lighthouse

Как посчитать контент по csv

Открыть в блокноте
Документы посчитать поиском ,0
Листалки исключить поиском ?
Вычесть 1 (шапка)

Баги

Иногда пишет в csv одинаковые страницы. Это бывает в 2 случаях: 1.1. Редирект с другой страницы на эту (решается установкой skipRequestedRedirect: true, сделано). 1.2. Одновременный запрос одной и той же страницы в параллельных потоках.

TODO:

Unique links
Offline w3c validation
Words count
Sentences count
Do not load image with non-standard URL, like this
External follow links
Broken images
Breadcrumbs - https://github.com/glitchdigital/structured-data-testing-tool
joeyguerra/schema.js - https://gist.github.com/joeyguerra/7740007
smhg/microdata-js - https://github.com/smhg/microdata-js
indicate page scan error
Find broken encoding like СЂРµРіРёРѕРЅР°Р»СЊРЅРѕРіРѕ

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Using without install

Features:

Technical details:

Web viewer features:

Fields list (18.08.2020):

Install

Zero-knowledge install

Windows: download and run install-run.bat.

Linux/MacOS:

Default ports:

Install with NPM:

For linux users

Install developer instanse with docker-compose

Command line usage:

Custom fields

Linux/Mac:

Windows:

Remove fields from results

Lighthouse

Analyse each page with Lighthouse

Analyse seo + Lighthouse

Config file

Send to InfluxDB

Plugins

Install plugins:

Disable plugins:

Credentials

Bugs

Free audit tools alternatives

Free data scrapers

Русский

Особенности:

Установка:

Если у вас Ubuntu

Использование

Кастомные поля

Lighthouse

Прогнать каждую страницу по Lighthouse

Обычный seo аудит + Lighthouse

Как посчитать контент по csv

Баги

TODO:

Windows: download and run `install-run.bat`.