npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@harvard-lil/js-wacz

v0.1.4

Published

JavaScript module and CLI tool for working with web archive data using the WACZ format specification.

Downloads

295

Readme

js-wacz

Tests npm version JavaScript Style Guide

JavaScript module and CLI tool for working with web archive data using the WACZ format specification, similar to Webrecorder's py-wacz.

It can be used to combine a set of .warc / .warc.gz files into a single .wacz file:

... programmatically (Node.js):

import { WACZ } from '@harvard-lil/js-wacz'

const archive = new WACZ({ 
  input: 'collection/*.warc.gz', 
  output: 'collection.wacz',
})

await archive.process() // "my-collection.wacz" is ready!

... or via the command line:

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz makes use of workers to process as many WARC files in parallel as the host machine can handle.


Summary


Install

js-wacz requires Node JS 18+.

npm can be used to install this package and make the js-wacz command accessible system-wide:

npm install -g @harvard-lil/js-wacz

👆 Back to summary


CLI: create command

The create command helps combine one or multiple .warc or .warc.gz files into a single .wacz file.

js-wacz create -f "collection/*.warc.gz" -o "collection.wacz"

js-wacz accepts the following options and arguments for customizing how the WACZ file is assembled.

--file, -f

This is the only required argument, which indicates what file(s) should be processed and added to the resulting WACZ file.

The target can be a single file, or a glob pattern such as folder/*.warc.gz.

# Single file:
js-wacz create --file archive.warc
# Collection:
js-wacz create --file "collection/*.warc"

Note: When using globs, make sure to surround the path with quotation marks.

--output, -o

Specify where the resulting .wacz file should be created, and what its filename should be.

Defaults to archive.wacz in the current directory if not provided.

js-wacz create --file cool-beans.warc --output cool-beans.wacz

--pages, -p

Path to a folder containing pages.jsonl files (pages.jsonl, extraPages.jsonl ...).

If not provided, js-wacz is going to attempt to detect pages in WARC records to build its own pages.jsonl index.

# Assuming the following file exists: /collections/pages/pages.jsonl
js-wacz create -f "collection/*.warc.gz" --pages collection/pages/

--cdxj

Pass a directory of existing CDXJ files, rather than indexing from WARCs. Must be used in combination with --pages.

js-wacz create -f "collection/*.warc.gz" --pages collection/pages.jsonl --cdxj collection/indexes/

--url

If provided, will be used as the mainPageUrl attribute for datapackage.json.

Must be a valid URL.

js-wacz create -f "collection/*.warc.gz" --url "https://lil.law.harvard.edu"

--ts

If provided, will be used as the mainPageDate attribute for datapackage.json.

Can be any value that can be parsed by JavaScript's Date() constructor.

js-wacz create -f "collection/*.warc.gz" --ts "2023-02-22T12:00:00.000Z"

--title

If provided, will be used as the title attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --title "My collection."

--desc

If provided, will be used as the description attribute for datapackage.json.

js-wacz create -f "collection/*.warc.gz" --desc "My cool collection of web archives."

--signing-url

If provided, will be used as an API endpoint for applying a cryptographic signature to the resulting WACZ file.

This endpoint is expected to be authsign-compatible.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign"

--signing-token

Used conjointly with --signing-url if provided, in case the signing server requires authentication.

js-wacz create -f "collection/*.warc.gz" --signing-url "https://example.com/sign" --signing-token "FOO-BAR"

--log-level

Can be used to determine how verbose js-wacz needs to be.

  • Possible values are: silent, trace, debug, info, warn, error
  • Default is: info
js-wacz create -f "collection/*.warc.gz" --log-level trace

👆 Back to summary


Programmatic use

js-wacz's CLI and underlying logic are decoupled, and it can therefore be consumed as a JavaScript module (currently only with Node.js).

Example: Creating a signed WACZ programmatically

import { WACZ } from '@harvard-lil/js-wacz'

try {
  const archive = new WACZ({ 
    file: 'collection/*.warc.gz',
    output: 'collection.wacz',
    signingUrl: 'https://example.com/sign',
    signingToken: 'FOO-BAR',
  }

  await archive.process()

  // collection.wacz is ready
} catch(err) {
  // ...
}

Although a process() convenience method is made available, every step of said process can be run individually and the archive's state inspected / edited throughout.

Notable affordances

  • WACZ.addPage() allows for manually adding an entry to pages.jsonl.
  • WACZ.addFileToZip() allows for manually adding any additional data to the final WACZ file.
  • The datapackageExtras option allows for adding an arbitrary JSON-serializable object to datapackage.json under extras.

References:

👆 Back to summary


Feature parity with py-wacz

js-wacz is aiming at partial feature parity with webrecorder's py-wacz, similar to Webrecorder's py-wacz.

This section lists notable differences in implementation that might affect interoperability.

Main differences in currently implemented features:

  • CLI: create --detect-pages: --detect-pages is implied in js-wacz unless --pages is provided.
  • CLI: create --file: that argument can be implied in py-wacz, it is always explicit in js-wacz.

👆 Back to summary


Development

Standard JS

This codebase uses the Standard JS coding style.

  • npm run lint can be used to check formatting.
  • npm run lint-autofix can be used to check formatting and automatically edit files accordingly when possible.
  • Most IDEs can be configured to automatically check and enforce this coding style.

JSDoc

JSDoc is used for both documentation and loose type checking purposes on this project.

Testing

This project uses Node.js' built-in test runner.

npm run test

Tests-specific environment variables

The following environment variables allow for testing features requiring access to a third-party server.

These are optional, and can be added to a local .env file which will be automatically interpreted by the test runner.

| Name | Description | | --- | --- | | TEST_SIGNING_URL | URL of an authsign-compatible endpoint for signing WACZ files.To run such an endpoint locally, use npm run dev-signer, which will overwrite .env and set this variable to http://localhost:5000/sign; see .services/signer.| | TEST_SIGNING_TOKEN | If required by the server at TEST_SIGNING_URL, an authentication token. |

Available CLI

# Runs test suite
npm run test

# Runs linter
npm run lint

# Runs linter and attempts to automatically fix issues
npm run lint-autofix

# Step-by-step NPM publishing helper
npm run publish-util

# Runs a local instance of wacz-signer for test purposes (see "Testing" section)
npm run dev-signer

👆 Back to summary