npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

squid-crawler

v1.0.2

Published

[![npm Package](https://img.shields.io/npm/v/squid-crawler.svg?style=flat-square)](https://www.npmjs.com/package/squid-crawler)

Downloads

3

Readme

Squid

npm Package

Squid is a high fidelity archival crawler that uses Chrome or Chrome Headless.

Squid aims to address the need for a high fidelity crawler akin to Heritrix while still easy enough for the personal archivist to setup and use.

Squid does not seek (at the moment) to dethrone Heritrix as the king of wide archival crawls rather seeks to address Heritrix's short comings namely

  • No JavaScript execution
  • Everything is plain text
  • Requiring configuration to known how to preserve the web
  • Setup time and technical knowledge required of its users

For more information about this see

Squid is built using Node.js and chrome-remote-interface.

Can't install Node on your system then Squid highly recommends WARCreate or WAIL.
WARCreate did this first and if it had not Squid would not exist :two_hearts:

If recording the web is what you seek Squid highly recommends Webrecorder.

Out Of The Box Crawls

Page Only

Preserve the page such that there is no difference when replaying the page from viewing the page in a web browser at preservation time

Page + Same Domain Links

Page Only option plus preserve all links found on the page that are on the same domain as the page

Page + All internal and external links

Page + Same Domain Link option plus all links from other domains

Crawls Operate In Terms Of A Composite memento

A Memento is an archived copy of web resource RFC 7089 The datetime when the copy was archived is called its Memento-Datetime. A composite memento is a root resource such as an HTML web page and all of the embedded resources (images, CSS, etc.) required for a complete presentation.

More information about this terminology can be found via ws-dl.blogspot.com

Usage

There are two shell scripts provided to help you use the project at the current stage.

run-chrome.sh

You can change the variable chromeBinary to point to the Chrome command to use, that is to launch Chrome via.

The value for chromeBinary currently is google-chrome-beta

The remoteDebugPort variable is used for --remote-debugging-port=<port>

Chrome v59 (stable) or v60 (beta) are actively tested on Ubuntu 16.04.

v60 is currently used and known to work well :+1:

Chrome < v59 will not work.

No testing is done on canary or google-chrome-unstable so your millage may vary if you use these versions. Windows sorry your not supported yet.

Takes one argument headless if you wish to use Chrome headless otherwise runs Chrome with a head :grinning:

For more information see Google web dev updates.

run-crawler.sh

Once Chrome has been started you can use run-crawler.sh passing it -c <path-to-config.json>

More information can be retrieved by using -h or --help

The config.json file example below is provided beside the two shell scripts without annotations as the annotations (comments) are not valid json

{
 // supports page-only, page-same-domain, page-all-links
// crawl only the page, crawl the page and all same domain links,
// and crawl page and all links. In terms of a composite memento
  "mode": "page-same-domain",
 // an array of seeds or a single seed
  "seeds": [
    "http://acid.matkelly.com"
  ],
  "warc": {
    "naming": "url", // currently this is the only option supported do not change.....
    "output": "path" // where do you want the WARCs to be placed. optional defaults to cwd
  },
 // Chrome instance we are to connect to is running on host, port.  
// must match --remote-debugging-port=<port> set when launching chrome.
// localhost is default host when only setting --remote-debugging-port
  "connect": {
    "host": "localhost",
    "port": 9222
  },
// time is in milliseconds
  "timeouts": {
   // wait at maxium 8s for Chrome to navigate to a page
    "navigationTimeout": 8000,
 // wait 7 seconds after page load
    "waitAfterLoad": 7000
  },
// optional auto scrolling of the page. same feature as webrecorders auto-scroll page
// time is in milliseconds and indicates the duration of the scroll
// in proportion to page size. Higher values means longer smooth scrolling, shorter values means faster smooth scroll
 "scroll": 4000
}

JavaScript Style Guide