npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

warcio

v2.4.2

Published

Streaming web archive (WARC) file support for modern browsers and Node.

Downloads

5,766

Readme

warcio.js

Streaming web archive (WARC) file support for modern browsers and Node.

This package represents an approximate port TypeScript port of the Python warcio module.

Node.js CI

Package Contents

  • dist/index.js - ESM module for the package, external dependencies not bundled.
  • dist/index.cjs - CJS module for the package, external dependencies not bundled.
  • dist/index.all.js - ESM module with all external dependencies included, ready to be imported directly in the browser, eg. import { ... } from "https://cdn.jsdelivr.net/npm/warcio/dist/index.all.js"
  • dist/utils.js - ESM module for just the utils module, can be imported with just import "warcio/utils" in node or browser.
  • dist/utils.cjs - CJS module for the utils module
  • dist/cli.js - ESM module for the CLI script, installed as warcio.js executable.
  • dist/cli.cjs - CJS module for the CLI script.

Browser Usage

Reading WARC Files

warcio.js is designed to read WARC files incrementally using async iterators.

Browser Streams API ReadableStream is also supported.

Gzip-compressed WARC records are automatically decompressed using pako library, while gzip compression uses native Compression Streams where available.

<script type="module">
  import { WARCParser } from "https://cdn.jsdelivr.net/npm/warcio/dist/index.all.js";


  async function readWARC(url) {
    const response = await fetch(url);

    const parser = new WARCParser(response.body);

    for await (const record of parser) {
      // ways to access warc data
      console.log(record.warcType);
      console.log(record.warcTargetURI);
      console.log(record.warcHeader("WARC-Target-URI"));
      console.log(record.warcHeaders.headers.get("WARC-Record-ID"));

      // iterator over WARC content one chunk at a time (as Uint8Array)
      for await (const chunk of record) {
        ...
      }

      // access content as text
      const text = await record.contentText();
    }
  }

  readWARC("https://example.com/path/to/mywarc.warc");
</script>

The WARCParser() constructor accepts any async iterator or object with a ReadableStream.getReader() style read() method.

A shorthand for await (const record of WARCParser.iterRecords(reader)) can also be used when the parser object is not needed.

Streaming WARCs in the Browser

A key property of warcio.js is to support streaming WARC records from a server via a Service Worker

For example, the following could be used to load a single WARC record (via a Range request), parse the HTTP headers, and return a streaming Response from a Service Worker.

The response continues reading from the upstream source.

import { WARCParser } from "https://cdn.jsdelivr.net/npm/warcio/dist/index.all.js";

async function streamWARCRecord(url, offset, length) {
  const response = await fetch(url, {
    headers: { Range: `bytes=${offset}-${offset + length - 1}` },
  });

  const parser = new WARCParser(response.body);

  // parse WARC record, which includes WARC headers and HTTP headers
  const record = await parser.parse();

  // get the response options for Response constructor
  const { status, statusText, headers } = record.getResponseInfo();

  // get a ReadableStream from the WARC record and return streaming response
  return new Response(record.getReadableStream(), {
    status,
    statusText,
    headers,
  });
}

Accessing WARC Content

warcio.js provides several ways to access WARC record content. When dealing with HTTP response records, the default behavior is to decode transfer and content encoding, de-chunking and uncompressing if necessary.

For example, the following accessors, as shown above, provide access to the decompressed/dechunked content.


  // iterate over each chunk (Uint8Array)
  for await (const chunk of record) {
    ...
  }

  // iterate over lines
  for await (const line of record.iterLines()) {
    ...
  }

  // read one line
  const line = await record.readline()

  // read entire contents as Uint8Array
  const payload = await record.readFully(true)

  // read entire contents as a String (calls readFully)
  const text = await record.contentText()

Raw WARC Payload

The raw WARC content is also available using the following methods:


  // iterate over each raw chunk (not dechunked or decompressed)
  for await (const chunk of record.reader) {
    ...
  }

  const rawPayload = await record.readFully(false)

The readFully() method can read either the raw or decoded content. When using readFully(), the payload is stored in the record as record.payload so that it can be accessed again.

Note that decoded and raw access should not be mixed. Attempting to access raw data after beginning decoding will result in an exception:

// read decoded line
const line = await record.readline();

// XX this will throw error, raw data no longer available
const full = await record.readFully(false);

// this is ok
const fullDecoded = await record.readFully(true);

Node Usage

warcio.js can also be used in Node. Since 1.6.0 release, warcio uses native ESM modules and requires Node 18.x. (Use warcio.js < 1.6.0 to support node 12+).

warcio.js uses a number of web platform features, including web streams API, that are now supported natively in Node 18.x.

After installing the package, for example, with npm add warcio, the above example could be run as follows:

import { WARCParser } from "warcio";
import fs from "fs";

async function readWARC(filename) {
  const nodeStream = fs.createReadStream(filename);

  const parser = new WARCParser(nodeStream);

  for await (const record of parser) {
    // ways to access warc data
    console.log(record.warcType);
    console.log(record.warcTargetURI);
    console.log(record.warcHeader("WARC-Target-URI"));
    console.log(record.warcHeaders.headers.get("WARC-Record-ID"));

    // iterator over WARC content one chunk at a time (as Uint8Array)
    for await (const chunk of record) {
      ...
    }

    // OR, access content as text
    const text = await record.contentText();
  }
}

To build the browser-packaged files in dist/, run npm run build.

To run tests, run npm run test.

CLI Indexing Tools

warcio.js also includes a command-line interface, installed as warcio.js (or by running node ./dist/cli.js)

index

The tool does includes command-line interface which can be used in Node to index WARC files (similar to python warcio index)

warcio.js index <path-to-warc> --fields <comma,sep,fields>

The index command accepts an optional comma-separated field list include any offset,length,WARC headers and HTTP headers, prefixed with http:, eg:

warcio.js index ./test/data/example.warc --fields warc-type,warc-target-uri,http:content-type,offset,length
{"warc-type":"warcinfo","offset":0,"length":484}
{"warc-type":"warcinfo","offset":484,"length":705}
{"warc-type":"response","warc-target-uri":"http://example.com/","http:content-type":"text/html","offset":1189,"length":1365}
{"warc-type":"request","warc-target-uri":"http://example.com/","offset":2554,"length":800}
{"warc-type":"revisit","warc-target-uri":"http://example.com/","http:content-type":"text/html","offset":3354,"length":942}
{"warc-type":"request","warc-target-uri":"http://example.com/","offset":4296,"length":800}

cdx-index

It can also generate standard CDX(J) indexes in CDX, CDXJ, and line delimited-JSON formats, using standard CDX fields:

warcio.js cdx-index <path-to-warc> --format cdxj
warcio.js cdx-index ./test/data/example.warc
com,example)/ 20170306040206 {"url":"http://example.com/","mime":"text/html","status":200,"digest":"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK","length":1365,"offset":1189,"filename":"example.warc"}
com,example)/ 20170306040348 {"url":"http://example.com/","mime":"warc/revisit","status":200,"digest":"G7HRM7BGOKSKMSXZAHMUQTTV53QOFSMK","length":942,"offset":3354,"filename":"example.warc"

Programmatic Usage

The indexers can also be used programmatically, both in the browser and in Node with a custom writer.

The indexer provide an async iterator which yields the index data as an object instead of writing it anywhere.

With 2.1.0, CDXAndRecordIndexer also provides access to each WARC record and (corresponding request record, for response and revisit records) during the iteration.

For example, the following snippet demonstrates a writer that logs all HTML files in a WARC:

<script type="module">
  import { CDXAndRecordIndexer } from "https://cdn.jsdelivr.net/npm/warcio/dist/index.all.js";

  async function indexWARC(url) {
    const response = await fetch(url);
    const indexer = new CDXAndRecordIndexer();

    const files = [{ reader: response.body, filename: url }];

    for await (const {cdx, record, reqRecord} of indexer.iterIndex(files)) {
      if (cdx.mime === "text/html") {
        const text = await record.contentText();
        console.log(`${cdx.url} is an HTML page of size: ${text.length}`);
      }
    }
  }

  indexWARC("https://example.com/path/to/mywarc.warc");
</script>

Writing WARC Files

WARCs can be created using WARCRecord.create() static method, and serialized using the WARCSerializer.

When serializing, the WARC-Payload-Digest, WARC-Block-Digest and Content-Length headers are automatically computed to ensure correct values, overriding those provided in warcHeaders.

Setting gzip: true in opts will serialize to GZIP-compressed records.

Calling WARCSerializer.serialize(opts) will serialize the entire WARC record into a single array buffer.

This is the simplest way to serialize WARC records and works well for storing smaller-sized data in WARC.

<script type="module">
  import {
    WARCRecord,
    WARCSerializer,
  } from "https://cdn.jsdelivr.net/npm/warcio/dist/index.all.js";

  async function main() {
    // First, create a warcinfo record
    const warcVersion = "WARC/1.1";

    const info = {
      software: "warcio.js in browser",
    };
    const filename = "sample.warc";

    const warcinfo = await WARCRecord.createWARCInfo(
      { filename, warcVersion },
      info
    );

    const serializedWARCInfo = await WARCSerializer.serialize(warcinfo);

    // Create a sample response
    const url = "http://example.com/";
    const date = "2000-01-01T00:00:00Z";
    const type = "response";
    const httpHeaders = {
      "Custom-Header": "somevalue",
      "Content-Type": 'text/plain; charset="UTF-8"',
    };

    async function* content() {
      // content should be a Uint8Array, so encoding if emitting astring
      yield new TextEncoder().encode("sample content\n");
    }

    const record = await WARCRecord.create(
      { url, date, type, warcVersion, httpHeaders },
      content()
    );

    const serializedRecord = await WARCSerializer.serialize(record);

    console.log(new TextDecoder().decode(serializedWARCInfo));
    console.log(new TextDecoder().decode(serializedRecord));
  }

  main();
</script>
import { WARCRecord, WARCSerializer } from "warcio";

async function main() {
  // First, create a warcinfo record
  const warcVersion = "WARC/1.1";

  const info = {
    software: "warcio.js in node",
  };
  const filename = "sample.warc";

  const warcinfo = await WARCRecord.createWARCInfo(
    { filename, warcVersion },
    info
  );

  const serializedWARCInfo = await WARCSerializer.serialize(warcinfo);

  // Create a sample response
  const url = "http://example.com/";
  const date = "2000-01-01T00:00:00Z";
  const type = "response";
  const httpHeaders = {
    "Custom-Header": "somevalue",
    "Content-Type": 'text/plain; charset="UTF-8"',
  };

  async function* content() {
    // content should be a Uint8Array, so encoding if emitting astring
    yield new TextEncoder().encode("sample content\n");
  }

  const record = await WARCRecord.create(
    { url, date, type, warcVersion, httpHeaders },
    content()
  );

  const serializedRecord = await WARCSerializer.serialize(record);

  console.log(new TextDecoder().decode(serializedWARCInfo));
  console.log(new TextDecoder().decode(serializedRecord));
}

main();

Writing Larger WARC Records

For larger WARC records, it is not ideal to buffer the entire WARC payload into memory.

Starting with 2.2.0, warcio.js supports streaming serialization with the help of an external buffer. To compute the digests, the data needs to be read twice, once to compute the digest and again to be written to the WARC. To support this, warcio.js uses hash-wasm for incremental digest computation and supports an external buffer which can write and read the data at a later time.

For the Node version, a WARCSerializer provided via warcio/node will automatically buffer responses >2MB to a temporary file on disk.

If using Node and expect to have a WARC records that are big it is recommended to use import { WARCSerializer } from "warcio/node". Otherwise, using import { WARCSerializer } from "warcio" is sufficient.

For browser-based usage, the payload is still buffered in memory (in chunks), but customized solutions can be implemented by extending the src/lib/warcserializer.ts#132 and implementing custom write and readAll() functions.

import { WARCRecord } from "warcio";
import { WARCSerializer } from "warcio/node";

async function main() {
  const url = "https://example.com/some/large/file";

  const resp = await fetch(url);

  const record = await WARCRecord.create({type: "response", url}, resp.body);

  const serializer = new WARCSerializer(record, {gzip: true});

  for await (const chunk of serializer) {
    // process WARC record chunks incrementally
    console.log(chunk);
  }
}

main();

Using standard Node fs functions, it is possible to easily stream content via fetch() directly to WARC records:

import fs from "node:fs";
import { pipeline } from "node:stream/promises";
import { Readable } from "node:stream";

import { WARCRecord } from "warcio";
import { WARCSerializer } from "warcio/node";

async function fetchAndWrite(url, warcOutputStream) {
  const resp = await fetch(url);

  const record = await WARCRecord.create({type: "response", url}, resp.body);

  // set max data per WARC payload that can be buffered in memory to 16K
  // payloads larger then that are automatically buffered to a temporary file
  const serializer = new WARCSerializer(record, {gzip: true, maxMemSize: 16384});

  await pipeline(Readable.from(serializer), warcOutputStream, {end: false});
}

async function main() {
  const outputFile = fs.createWriteStream("test.warc.gz");

  await fetchAndWrite("https://example.com/some/large/file1.bin", outputFile);

  await fetchAndWrite("https://example.com/another/large/file2", outputFile);

  outputFile.close();
}

main();

Not Yet Implemented

This library is still new and some functionality is 'not yet implemented' when compared to python warcio including:

  • ~~Writing WARC files #2~~ Implemented!
  • ~~Chunked Payload Decoding #3~~ Implemented!
  • Brotli Payload Decoding #4
  • Reading ARC files #5
  • ~~Digest computation #6~~ Implemented!
  • URL canonicalization #7

They should eventually be added in future versions. See the referenced issues to track progress on each of these items.

Differences from node-warc

The node-warc package is designed for use in Node specifically.

node-warc also includes various capture utilities which are out of scope for warcio.js.

warcio.js is intended to run in browser and in Node, and to have an interface comparable to the python warcio.

Wherever possible, an attempt is made to maintain compatibility. For example, the WARC record accessors, record.warcType, record.warcTargetURI in warcio.js are compatible with the ones used in node-warc.