npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tryghost/mg-webscraper

v0.15.0

Published

Scrapes metadata from post URLs during a migration. Fetches each post's original URL and extracts fields like `meta_title`, `meta_description`, `og_image`, etc. using CSS selectors. Results are cached to disk so re-runs don't hit the network.

Readme

Migrate WebScraper

Scrapes metadata from post URLs during a migration. Fetches each post's original URL and extracts fields like meta_title, meta_description, og_image, etc. using CSS selectors. Results are cached to disk so re-runs don't hit the network.

Install

npm install @tryghost/mg-webscraper --save

or

pnpm add @tryghost/mg-webscraper

Usage

Constructor

import WebScraper from '@tryghost/mg-webscraper';

const webScraper = new WebScraper(fileCache, config, postProcessor, skipFn);

| Parameter | Type | Description | |-----------------|------------|--------------------------------------------------------------------------------------| | fileCache | object | File cache instance (from mg-fs-utils) for caching scraped responses | | config | object | Scrape configuration with CSS selectors (see below) | | postProcessor | function | Optional transform applied to scraped data before merging. Defaults to identity. | | skipFn | function | Optional function to skip specific posts in hydrate(). Not used by scrapePost(). |

Post processor

The postProcessor function receives the raw scraped data and can clean, rename, or filter fields before they're applied to the post. It's called by both hydrate and scrapePost.

const postProcessor = (scrapedData) => {
    // Use og:image as feature_image
    if (scrapedData.og_image) {
        scrapedData.feature_image = scrapedData.og_image;
        delete scrapedData.og_image;
    }

    // Strip trailing whitespace from titles
    if (scrapedData.meta_title) {
        scrapedData.meta_title = scrapedData.meta_title.trim();
    }

    return scrapedData;
};

const webScraper = new WebScraper(fileCache, config, postProcessor);

Skip function

The skipFn is called once per post in hydrate() with the post object ({url, data}). Return a truthy value to skip scraping that post. Useful for skipping posts that already have the metadata you need, or filtering by post type.

const skipFn = (post) => {
    // Skip pages, only scrape posts
    return post.data.type === 'page';
};

const webScraper = new WebScraper(fileCache, config, null, skipFn);

scrapePost does not use skipFn — when using forEachPost, control which posts are scraped via the filter option instead:

await context.forEachPost(async (post) => {
    await webScraper.scrapePost(post);
}, {filter: {tag: {slug: 'news'}}});

Scrape configuration

The config.posts object maps field names to scrape-it selectors:

const config = {
    posts: {
        meta_title: {
            selector: 'title'
        },
        meta_description: {
            selector: 'meta[name="description"]',
            attr: 'content'
        },
        og_image: {
            selector: 'meta[property="og:image"]',
            attr: 'content'
        },
        og_title: {
            selector: 'meta[property="og:title"]',
            attr: 'content'
        }
    }
};

Per-post scraping with MigrateContext

Use scrapePost when working with mg-context's MigrateContext. It does two things:

  1. Sets fields automatically — scrapes the post's source URL using config.posts selectors, runs the postProcessor, and sets the resulting fields (e.g. meta_title, og_image) on the post via post.set(). Fields not in the post's schema are silently skipped.
  2. Stores the raw scraped response — saves the unprocessed scraped data on post.webscrapeData, persisted to the database as JSON. This is stored before post-processing, so later pipeline steps can access or reprocess the original data without re-scraping.
const webScraper = new WebScraper(fileCache, config, postProcessor);

await context.forEachPost(async (post) => {
    await webScraper.scrapePost(post, {wait_after_scrape: 100});
});

The post is duck-typed — it needs getSourceValue(key) to provide the URL, set(key, value) for writing fields, and a webscrapeData setter. No explicit save() is needed — forEachPost saves each post automatically after the callback returns.

The raw response is available on the post in any later pipeline step:

await context.forEachPost(async (post) => {
    const scraped = post.webscrapeData;
    if (scraped?.og_image && !post.get('feature_image')) {
        post.set('feature_image', scraped.og_image);
    }
});

Legacy pipeline with hydrate

In the legacy pipeline, hydrate creates a Listr task array that scrapes all posts in ctx.result.posts:

const tasks = webScraper.hydrate(ctx);
const runner = makeTaskRunner(tasks, options);
await runner.run();

Each task calls scrapeUrl for one post, then merges the result into post.data via processScrapedData.

How it works

  1. scrapeUrl(url, config, filename, wait) — checks the file cache for {filename}.json. If cached, returns it. Otherwise scrapes the URL, writes the response to the cache, strips empty fields, and optionally waits before returning.
  2. scrapePost(post, options) — gets the URL from post.getSourceValue('url'), calls scrapeUrl, stores the raw response on post.webscrapeData, runs the postProcessor, then sets each non-empty field via post.set().
  3. hydrate(ctx) — maps ctx.result.posts to Listr task objects, each calling scrapeUrl then processScrapedData to merge into the post's mutable data object.

Develop

This is a mono repository, managed with Nx and pnpm workspaces.

Follow the instructions for the top-level repo.

  1. git clone this repo & cd into it as usual
  2. Run pnpm install to install top-level dependencies.

Test

  • pnpm lint run just eslint
  • pnpm test run lint and tests
  • pnpm test:local build and run tests (for single-package development)

Copyright & License

Copyright (c) 2013-2026 Ghost Foundation - Released under the MIT license.