@harvard-lil/wacz-preparator
v0.0.5
Published
π CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
Downloads
1
Keywords
Readme
wacz-preparator π
CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
wacz-preparator --extractor "archive-it" --username "lil" --password $PASSWORD --collection-id 12345
See also: wacz-exhibitor for embedding a self-contained web archive collection on a web page.
Summary
Foreword
β οΈπ₯Όπ§ͺ Experimental:
This pipeline was originally developed in the context of The Harvard Library Innovation Lab's partnership with the Radcliffe Institute's Schlesinger Library on experimental access to web archives.
We have only tested it on The Schlesinger #meToo Web Archives collection and would welcome feedback from the community to help solidify it.
In particular, we would love to hear more about:
- Any edge cases this pipeline currently doesn't account for.
- General interest in exploring new ways of storing, copying, and giving access to web archives
Contact: [email protected]
How does it work?
Given a specific extractor and valid combination of credentials, wacz-preparator will perform the following steps in order to pull and package a remote web archives collection into a single WACZ file.
Example: Archive-It Extractor
| # | Description | Notes | | --- | --- | -- | | 01 | Check validity of credentials and access to the collection | | | 02 | Create local collection folder if not already present | Because the underlying files are kept around in that folder, processing can be interrupted, resumed, and run multiple times over. | | 03 | Pull Collection Information | | | 04 | Pull list of available WARC files | | | 05 | Pull crawl information for all WARC files | This includes retrieving seeds (urls).| | 06 | Pull page title for all of the crawled URLs | Will first try to fetch that information from the seed meta data. If not available, will try to pull that information from the Wayback Machine. | | 07 | Delete "loose" WARCs from local collection folder | This comparison allows for discarding WARC files that may have previously been pulled locally but are no longer part of the collection. | | 08 | Compare hashes of local WARC files against remote hashes (1) | This allows for determining what files need to be downloaded or re-downloaded. | | 09 | Pull WARC files | Only the files that are not already present locally will be pulled. | | 10 | Compare hashes of local WARC files against remote hashes (2) | At this stage, there should be no discrepancies. | | 11 | Build pages list | | | 12 | Prepare WACZ file | |
At the end of this process, a WACZ file named after the collection ID should be available (ie: 12345.wacz).
WACZ files can be read with any compatible playback software, such as replayweb.page.
Note: All of the operations that involve talking to the Archive-It API are run in parallel batches: the --concurrency
option allows for determining how many requests can be run in parallel.
Getting Started
Dependencies
wacz-preparator requires Node.js 18+.
Compatibility
This program has been written for UNIX-like systems and is expected to work on Linux, Mac OS, and Windows Subsystem for Linux.
Installation
wacz-preparator is available on npmjs.org and can be installed as follows:
# As a CLI
npm install -g @harvard-lil/wacz-preparator
# As a library
npm install @harvard-lil/wacz-preparator --save
CLI
Here are a few examples of how wacz-preparator can be used in the command line to extract a full collection from Archive-It into a WACZ file:
# The program needs an Archive-It username, password, and collection-id to operate ...
wacz-preparator --extractor "archive-it" --username 'foo' --password 'bar' --collection-id 12345
# ... the latter can / should be passed as an environment variable
wacz-preparator --extractor "archive-it" --username 'foo' --password $PASSWORD --collection-id 12345
# Unless specified otherwise with --output-path, wacz-preparator will work in the current directory
wacz-preparator --extractor "archive-it" --output-path "/path/to/directory" --username 'foo' --password $PASSWORD --collection-id 12345
# The resulting WACZ file can be signed using an authsign-compatible endpoint.
# See: https://specs.webrecorder.net/wacz-auth/0.1.0/#implementations
wacz-preparator --extractor "archive-it" --signing-url "https://example.com/sign" --username foo --password $PASSWORD --collection-id 12345
# Use --help to list the available options, and see what the defaults are.
wacz-preparator --help
Usage: wacz-preparator [options]
π CLI and Javascript library for packaging a remote web archive collection into a single WACZ file.
More info: https://github.com/harvard-lil/wacz-preparator
Options:
-v, --version Display Library and CLI version.
-e, --extractor <string> Web Archiving platform to extract the collection from. (choices: "archive-it", default: "archive-it")
-u, --username <string> API username (required for Archive-it). (default: null)
-p, --password <string> API password (required for Archive-it). (default: null)
-i, --collection-id <string> Id of the collection to process (required for Archive-it). (default: null)
-o, --output-path <string> Path in which wacz-preparator will work. (default: pwd)
-c, --concurrency <number> Sets a limit for parallel requests to the Archive-It API. (default: 50)
--auto-clear <bool> Automatically delete the collection-specific folder that was created? (choices: "true", "false", default: "false")
--signing-url <string> Authsign-compatible endpoint for signing WACZ file.
--signing-token <string> Authentication token to --signing-url, if needed.
--log-level <string> Controls CLI verbosity. (choices: "silent", "trace", "debug", "info", "warn", "error", default: "info")
-h, --help Show options list.
JavaScript Library
wacz-preparator can also be used as JavaScript library in a Node.js project.
Example: Using the Preparator.process() method
import { ArchiveItExtractor } from "@harvard-lil/wacz-preparator"
const collection = new ArchiveItExtractor({
username: 'username',
password: 'password',
collectionId: 12345
})
if (await collection.process()) {
// WACZ file is ready!
// ...
}
The process()
method runs through all the steps described in the "How does it work?" section.
It is also possible to go through each individual step manually and customize the behavior of wacz-preparator.
Development
Standard JS
This codebase uses the Standard JS coding style.
npm run lint
can be used to check formatting.npm run lint-autofix
can be used to check formatting and automatically edit files accordingly when possible.- Most IDEs can be configured to automatically check and enforce this coding style.
JSDoc
JSDoc is used for both documentation and loose type checking purposes on this project.
Testing
β οΈ In its current state, this experimental codebase doesn't come with an automated test suite.
Available CLI
# Runs linter
npm run lint
# Runs linter and attempts to automatically fix issues
npm run lint-autofix
# Step-by-step NPM publishing helper
npm run publish-util