npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

portadom

v1.0.4

Published

Single DOM manipulation interface across Browser API, JSDOM, Cheerio, Playwright

Downloads

1

Readme

Portadom

Single DOM manipulation interface across Browser API, JSDOM, Cheerio, Playwright.

If you write web scrapers, you will know that you have multiple ways of parsing and manipulating the HTML / DOM:

  • Download the HTML and feed into JSDOM or Cheerio.
  • Through browser automation like Playwright, Puppeteer, or Selenium.
  • Or right from inside the DevTools console, if you need to test something out.

When I'm writing scrapers, my approach is usually:

  1. Define the transformations in DevTools with vanilla JS.
  2. Check if the HTML data can be extracted statically, just from the HTML (no JS).
  3. If static HTML is enough, then migrate vanilla JS to JSDOM or Cheerio.
  4. If I need JS runtime, migrate the vanilla JS to Playwright or other browser automation tool.

Migrating from one to another can be prone to errors, and you may miss some features.

Portadom takes care of this. Here's how you can move the same DOM manipulation logic from Cheerio to Playwright:

Before:

import { load as loadCheerio } from 'cheerio';
import { cheerioPortadom } from 'portadom';

// Loading step changes
const html = `<div>
  <a href="#">Click Me!</a>
</div>`;
const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);

// DOM manipulation remains the same
const btn = dom.findOne('a');
const btnText = await btn.text();
// btnText == "Click Me!"

After:

import { playwrightLocatorPortadom } from 'portadom';

// Loading step changes
const page = await somehowLoadPage();
const bodyLoc = page.locator('body');
const dom = playwrightLocatorPortadom(bodyLoc, page);

// DOM manipulation remains the same
const btn = dom.findOne('a');
const btnText = await btn.text();
// btnText == "Click Me!"

Installation

npm install portadom

How to use

Minimal example

const html = `<div>
  <a href="#">Click Me!</a>
</div>`;
const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);

const btn = dom.findOne('a');
const btnText = await btn.text();
// btnText == "Click Me!"

const btnProp = await btn.href();
// btnProp == "https://example.com#"

Full example

const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);
// ...
const rootEl = dom.root();
const url = await dom.url();

// Find and extract data
const entries = await rootEl.findMany('.list-row:not(.native-agent):not(.reach-list)')
  .mapAsyncSerial(async (el) => {
  const employerName = await el.findOne('.employer').text();
  const employerUrl = await el.findOne('.offer-company-logo-link').href();
  const employerLogoUrl = await el.findOne('.offer-company-logo-link img').src();

  const offerUrlEl = el.findOne('h2 a');
  const offerUrl = await offerUrlEl.href();
  const offerName = await offerUrlEl.text();
  const offerId = offerUrl?.match(/O\d{2,}/)?.[0] ?? null;

  const location = await el.findOne('.job-location').text();

  const salaryText = await el.findOne('.label-group > a[data-dimension7="Salary label"]').text();

  const labels = await el.findMany('.label-group > a:not([data-dimension7="Salary label"])')
    .mapAsyncSerial((el) => el.text())
    .then((arr) => arr.filter(Boolean) as string[]);

  const footerInfoEl = el.findOne('.list-footer .info');
  const lastChangeRelativeTimeEl = footerInfoEl.findOne('strong');
  const lastChangeRelativeTime = await lastChangeRelativeTimeEl.text();
  // Remove the element so it's easier to get the text content
  await lastChangeRelativeTimeEl.remove();
  const lastChangeTypeText = await footerInfoEl.textAsLower();
  const lastChangeType = lastChangeTypeText === 'pridané' ? 'added' : 'modified';

  return {
    listingUrl: url,
    employerName,
    employerUrl,
    employerLogoUrl,
    offerName,
    offerUrl,
    offerId,
    location,
    labels,
    lastChangeRelativeTime,
    lastChangeType,
  };
});

Loading

Here is how you can load DOM in different environments:

Browser

When working with browser Document, the node is an Element.

import { browserPortadom } from 'portadom';

const dom = browserPortadom(document.body);
const btnNode = await dom.findOne('a').node;

// Or
const startNode = document.querySelector('...');
const dom = browserPortadom(startNode);
const btnNode = await dom.findOne('a').node;

Cheerio

In Cheerio, the node is the Cheerio Element wrapper. See DOM traversal with Cheerio.

import { cheerioPortadom } from 'portadom';
import { load as loadCheerio } from 'cheerio';

const $ = loadCheerio(html);
const dom = cheerioPortadom($.root(), url);
const btnNode = await dom.findOne('a').node;

// Or
const startNode = $('a');
const dom = cheerioPortadom(startNode, url);
const btnNode = await dom.findOne('a').node;

// Set `null` if you don't have an URL for the HTML
const dom = cheerioPortadom($.root(), null);

Playwright (using Locators)

In Playwright, you can either work with the Locators or the ElementHandles.

When using Locators, the node is a Locator instance.

import { playwrightLocatorPortadom } from 'portadom';

const page = await somehowLoadPage();
const bodyLoc = page.locator('body');
const dom = playwrightLocatorPortadom(bodyLoc, page);
const btnNode = await dom.findOne('a').node;

Playwright (using Handles)

When using ElementHandles, the node is an ElementHandle instance.

NOTE: You can pass Locator to playwrightHandlePortadom, but this will be converted to JSHandle internally.

import { playwrightHandlePortadom } from 'portadom';

const page = await somehowLoadPage();

// Use `evaluateHandle` with page-side logic to query the target element
const handle = await page.evaluateHandle(, () => document.body);
const handle = await page.evaluateHandle(, () => document.querySelector('.myClass'));

// Or use other helpers such as `getByText`
const handle = await page.getByText('hello');

// Or use locators
const handle = page.locator('body');

const dom = playwrightHandlePortadom(bodyLoc, page);
const btnNode = await dom.findOne('a').node;

Chaining

For cross-compatibility, each method on a Portadom instance returns a Promise.

But this then leads to then / await hell when you need to call multiple methods in a row:

const employerName = (await (await el.findOne('.employer'))?.text()) ?? null;

To get around that, the results are wrapped in chainable instance. This applies to each method that returns a Portadom instance, or an array of Portadom instances.

So instead, we can call:

const employerName = await el.findOne('.employer').text();

You don't have to chain the commands. Instead, you can access the associated promise under promise property. For example this:

const mapPromises = await dom.findOne('ul')
  .parent()
  .findMany('li[data-id]')
  .map((li) => li.attr('data-id'));
const attrs = await Promise.all(mapResult);

Is the same as:

const ul = await dom.findOne('ul').promise;
const parent = await ul?.parent().promise;
const idEls = await parent?.findMany('li[data-id]').promise;
const mapPromises = idEls?.map((li) => li.attr('data-id')) ?? [];
const attrs = await Promise.all(mapPromises);

Reference

See the full documentation here.

Real life exampes