npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@archivator/archivable

v0.2.0

Published

Archivable Data Transfer Object and NormalizedAsset Entities for archiving web pages and their assets

Downloads

3

Readme

@archivator/archivable

Archivable Data Transfer Object and NormalizedAsset Entities for archiving web pages and their assets

A Data Transfer Object (or Entity object) and related utilities to manipulate Web Page Metadata while archiving.

| Version | Size | Dependencies | | -------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------- | | npm | npm bundle size | Libraries.io dependency status for latest release |

Usage

See also:

Archivable

import Archivable from '@archivator/archivable'

// HTML Source document URL from where the asset is embedded
// Ignore document origin if resource has full URL, protocol relative, non TLS
const sourceDocument =
  'http://example.org/@ausername/some-lengthy-string-ending-with-a-hash-1a2d8a61510'

const selector = '#main'
const truncate = '.ad,.sponsor'

/** @type {import('@archivator/archivable').ArchivableType} */
const dto = new Archivable(sourceDocument, selector, truncate)
// ... Do things with `dto`

When using CSV format

import Archivable from '@archivator/archivable'

// The following lines would be from a text file where we have one item per line
// Each item MUST have two semi-columns
const lines = [
  'http://example.org/@ausername/some-lengthy-string-ending-with-a-hash-1a2d8a61510;#main;.ad,.sponsor',
]

for (const line of lines) {
  /** @type {import('@archivator/archivable').ArchivableType} */
  const dto = Archivable.fromLine(line)
  // ... Do things with `dto`
}

DocumentAssets and NormalizedAsset

While archiving a web page, we might have a list of all assets the document makes references to. They can be embedded inside <img src="..."> tags and other similar schemes.

Each "NormalizedAsset" is an entity from which we can figure out where an asset can be downloaded in relation to the current source document URL, like web browsers do.

NormalizedAsset contains:

  • match: is the initial value passed in, that can be useful if we want to rewrite the source document
  • reference: is the normalized hash for the asset, we could use that value to replace the source document's HTML with a local name
  • dest: would be where we would archive the asset, it is basically directoryNameNormalizer(sourceDocument) + reference
  • src: is where we should attempt downloading the asset from
import { DocumentAssets, NormalizedAsset } from '@archivator/archivable'

// HTML Source document URL from where the asset is embedded
// Notice the source might not be the same as where images are stored
const sourceDocument = 'http://www1.example.net/articles/1'

// Image tag src attribute value, e.g. `<img src="//example.org/a/b.png" />`
// Notice we used protocol relative URL
// (i.e. not specify https, meaning we'll use from source document)
const assetUrl = '//www.example.org/a/b/c.png'

/**
 * `normalized` is an instance of `NormalizedAsset`, and should look like this
 *
 * ```json
 * {
 *   "dest": null,
 *   "match": "//www.example.org/a/b/c.png",
 *   "reference": null,
 *   "src": "http://www.example.org/a/b/c.png",
 * }
 * ```
 *
 * @type {import('@archivator/archivable').NormalizedAssetType}
 */
const normalized = new NormalizedAsset(sourceDocument, assetUrl)

DocumentAssets

When we have more than one asset to download, we might have a list of assets, we can use DocumentAssets class.

Using it, we can iterate from it because it implements Iterable the protocol and treat it as if it's an array of NormalizedAsset items.

import { DocumentAssets } from '@archivator/archivable'

// HTML Source document URL from where the asset is embedded
const sourceDocument = 'http://renoirboulanger.com/about/projects/'

// List of URLs you might find on that URL
// e.g. `<img src="//example.org/a/b.png" />`
// Notice some URLs are relative, protocol-relative, others are going on another domain
const matches = [
  // Case 1: On an almost (no protocol) fully-qualified URL, on another domain
  '//www.example.org/a/b/c.png',
  // Case 2: Relative URL to the current source document
  '../../avatar.jpg',
  // Case 3: Fully qualified URL that is local to the site
  'http://renoirboulanger.com/wp-content/themes/twentyseventeen/assets/images/header.jpg',
  // Case 4: Fully qualified URL that is outside
  'https://s3.amazonaws.com/github/ribbons/forkme_right_gray_6d6d6d.png',
  // Case 5: Fully qualified  URL that is outside and protocol relative
  '//www.gravatar.com/avatar/cbf8c9036c204fe85e15155f9d70faec?s=500',
  // Case 6: Relative URL to the domain name, starting at root
  '/wp-content/themes/renoirb/assets/img/zce_logo.jpg',
]

/**
 * Leverage ECMAScript 2015+ Iteration prototocol.
 *
 * Pass a collection of strings, get a normalized list with iteration.
 *
 * @type {Iterable<import('@archivator/archivable').NormalizedAssetType>}
 */
const assets = new DocumentAssets(sourceDocument, matches)
for (const normalized of assets) {
  // It is a generator function, we can iterate normalized like an array.
  // If we were in an asychronous function, we'd be able to await each step.
  // In this example, we're simply using the return of assetCollectionNormalizer like we would with an array.
  console.log(normalized)
}

Change reference hashing format

In the above example, the first item looks like this;

{
  "match": "//www.example.org/a/b/c.png",
  "src": "http://www.example.org/a/b/c.png",
  "dest": "renoirboulanger.com/about/projects/4c49ccbf4cdbdbcfc7f91cf87f6e9636008e4a97.png",
  "reference": "4c49ccbf4cdbdbcfc7f91cf87f6e9636008e4a97.png"
}

The asset file "4c49ccbf4cdbdbcfc7f91cf87f6e9636008e4a97.png" contains the SHA1 hash for "http://www.example.org/a/b/c.png".

Notice that the initial match was "//www.example.org/a/b/c.png" (the "match" attribute), but the "src" (where we will download image from) saw that the "sourceDocument" had http as protocol. If the protocol was https, the "src" (and the hash) would be different.

About the hashing, if you'd prefer a shorter file name, or use a different hashing function.

You can change it by using DocumentAssets.setReferenceHandler(hasherFn, normalizerFn) method.

The arguments are:

hasherFn : Where you can provide your own hashing function. See crypto.ts if you're OK with Node.js’ Crypto module

normalizerFn : A function with signature (file: string) => string where you can append the file extension, refer to normalizer/asset.ts at assetFileExtensionNormalizer.

// ... Continuing from example above
import {
  HashingFunctionType,
  createHashFunction,
  NormalizedAssetFileExtensionExtractorType,
} from '@archivator/archivable'

// One can set its own hash function
// As long as the returned createHashFunction is of type `(msg: string) => string`
const hashingHandler = createHashFunction('md5', 'hex')

/**
 * In the example below, in every case, the file extension would ALWAYS be ".foo".
 * We could eventually use the file's mime-type, or the source's response headers. #TODO
 */
const extensionHandler: NormalizedAssetFileExtensionExtractorType = (
  foo: string,
): string => `.foo`

collection.setReferenceHandler(
  assetReferenceHandlerFactory(hashingHandler, extensionHandler),
)

With the above configuration in place, for the item "//www.example.org/a/b/c.png", we'd have the md5 hash as 6a324cd1a0e4e480c4db3e0558360527 with .foo

Which would then look like this;

[
  {
    "dest": "renoirboulanger.com/page/3/6a324cd1a0e4e480c4db3e0558360527.foo",
    "match": "//www.example.org/a/b/c.png",
    "reference": "6a324cd1a0e4e480c4db3e0558360527.foo",
    "src": "http://www.example.org/a/b/c.png"
  }
]