@harvard-lil/js-wacz
v0.1.4
Published
JavaScript module and CLI tool for working with web archive data using the WACZ format specification.
Downloads
295
Keywords
Readme
js-wacz
JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.
It can be used to combine a set of .warc
/ .warc.gz
files into a single .wacz
file:
... programmatically (Node.js):
import { WACZ } from '@harvard-lil/js-wacz'
const archive = new WACZ({
input: 'collection/*.warc.gz',
output: 'collection.wacz',
})
await archive.process() // "my-collection.wacz" is ready!
... or via the command line:
js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"
js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.
Summary
Install
js-wacz requires Node JS 18+.
npm
can be used to install this package and make the js-wacz command accessible system-wide:
npm install -g @harvard-lil/js-wacz
CLI: create
command
The create
command helps combine one or multiple .warc
or .warc.gz
files into a single .wacz
file.
js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"
js-wacz accepts the following options and arguments for customizing how the WACZ file is assembled.
--file, -f
This is the only required argument, which indicates what file(s) should be processed and added to the resulting WACZ file.
The target can be a single file, or a glob pattern such as folder/*.warc.gz
.
# Single file:
js-wacz create --file archive.warc
# Collection:
js-wacz create --file "collection/*.warc"
Note: When using globs, make sure to surround the path with quotation marks.
--output, -o
Specify where the resulting .wacz
file should be created, and what its filename should be.
Defaults to archive.wacz
in the current directory if not provided.
js-wacz create --file cool-beans.warc --output cool-beans.wacz
--pages, -p
Path to a folder containing pages.jsonl files (pages.jsonl
, extraPages.jsonl
...).
If not provided, js-wacz is going to attempt to detect pages in WARC records to build its own pages.jsonl
index.
# Assuming the following file exists: /collections/pages/pages.jsonl
js-wacz create -f "collection/*.warc.gz" --pages collection/pages/
--cdxj
Pass a directory of existing CDXJ files, rather than indexing from WARCs. Must be used in combination with --pages
.
js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl --cdxj collection/indexes/
--url
If provided, will be used as the mainPageUrl
attribute for datapackage.json
.
Must be a valid URL.
js-wacz create -f "collection/*.warc.gz" --url "https://lil.law.harvard.edu"
--ts
If provided, will be used as the mainPageDate
attribute for datapackage.json
.
Can be any value that can be parsed by JavaScript's Date() constructor
.
js-wacz create -f "collection/*.warc.gz" --ts "2023-02-22T12:00:00.000Z"
--title
If provided, will be used as the title
attribute for datapackage.json
.
js-wacz create -f "collection/*.warc.gz" --title "My collection."
--desc
If provided, will be used as the description
attribute for datapackage.json
.
js-wacz create -f "collection/*.warc.gz" --desc "My cool collection of web archives."
--signing-url
If provided, will be used as an API endpoint for applying a cryptographic signature to the resulting WACZ file.
This endpoint is expected to be authsign-compatible.
js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign"
--signing-token
Used conjointly with --signing-url
if provided, in case the signing server requires authentication.
js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign" --signing-token "FOO-BAR"
--log-level
Can be used to determine how verbose js-wacz needs to be.
- Possible values are:
silent
,trace
,debug
,info
,warn
,error
- Default is:
info
js-wacz create -f "collection/*.warc.gz" --log-level trace
Programmatic use
js-wacz's CLI and underlying logic are decoupled, and it can therefore be consumed as a JavaScript module (currently only with Node.js).
Example: Creating a signed WACZ programmatically
import { WACZ } from '@harvard-lil/js-wacz'
try {
const archive = new WACZ({
file: 'collection/*.warc.gz',
output: 'collection.wacz',
signingUrl: 'https://example.com/sign',
signingToken: 'FOO-BAR',
}
await archive.process()
// collection.wacz is ready
} catch(err) {
// ...
}
Although a process()
convenience method is made available, every step of said process can be run individually and the archive's state inspected / edited throughout.
Notable affordances
WACZ.addPage()
allows for manually adding an entry topages.jsonl
.WACZ.addFileToZip()
allows for manually adding any additional data to the final WACZ file.- The
datapackageExtras
option allows for adding an arbitrary JSON-serializable object to datapackage.json underextras
.
References:
Feature parity with py-wacz
js-wacz is aiming at partial feature parity with webrecorder's py-wacz, similar to Webrecorder's py-wacz.
This section lists notable differences in implementation that might affect interoperability.
Main differences in currently implemented features:
- CLI:
create --detect-pages
:--detect-pages
is implied in js-wacz unless--pages
is provided. - CLI:
create --file
: that argument can be implied in py-wacz, it is always explicit in js-wacz.
Development
Standard JS
This codebase uses the Standard JS coding style.
npm run lint
can be used to check formatting.npm run lint-autofix
can be used to check formatting and automatically edit files accordingly when possible.- Most IDEs can be configured to automatically check and enforce this coding style.
JSDoc
JSDoc is used for both documentation and loose type checking purposes on this project.
Testing
This project uses Node.js' built-in test runner.
npm run test
Tests-specific environment variables
The following environment variables allow for testing features requiring access to a third-party server.
These are optional, and can be added to a local .env
file which will be automatically interpreted by the test runner.
| Name | Description |
| --- | --- |
| TEST_SIGNING_URL
| URL of an authsign-compatible endpoint for signing WACZ files.To run such an endpoint locally, use npm run dev-signer
, which will overwrite .env
and set this variable to http://localhost:5000/sign
; see .services/signer.|
| TEST_SIGNING_TOKEN
| If required by the server at TEST_SIGNING_URL
, an authentication token. |
Available CLI
# Runs test suite
npm run test
# Runs linter
npm run lint
# Runs linter and attempts to automatically fix issues
npm run lint-autofix
# Step-by-step NPM publishing helper
npm run publish-util
# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer