npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

unprint

v0.14.3

Published

Simplify common web scraping tasks while staying in control of the data.

Downloads

579

Readme

unprint

unprint is a web scraping utility built around JSDOM, providing convenience methods for quickly extracting common data types.

Install

npm install unprint

Usage

const unprint = require('unprint');

Global options

unprint.options({
	headers: {
		'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
	},
	limits: { // request throttling
		default: {
			concurrency: 10,
			interval: 10, // ms
		},
		[hostname]: {
			enable: true, // enabled by default
			concurrency: 1,
			interval: 1000,
		},
	},
})

Querying

For optimal flexibility, unprint query methods can be used with or without initialization. If you already have access to DOM elements using another library or unprint instance, you can query it by using the uninitialized query methods provided directly from the library, and passing the element as the first argument, as such:

unprint.query.element(element, 'h1#title') // HTMLHeadingElement

Both unprint.get() and unprint.init() return its query methods pre-initialized, removing the element argument in favor of the element retrieved or received. Initialized query methods therefore will not accept a custom element, usually expecting the selector as the first argument instead.

const result = await unprint.get('http://localhost:3101/html');
const { query } = result.context;

query.element('h1#title'); // HTMLHeadingElement
const result = await fetch('http://localhost:3101/html');
const body = await res.text();
const { query } = await unprint.init(body);

query.element('h1#title'); // HTMLHeadingElement

From here on, the query methods will be described in their initialized form. The API for the uninitialized methods is identical, except for the element passed as the first argument

Selector

The selector can be a CSS selector, an XPath selector starting with /, or an array of either or both acting as fallbacks. If the selector is falsy, the input element will be used.

Querying multiple elements

Most methods can be used in plural, returning an array of results, i.e. query.elements(), query.dates().

Query an element

  • query.element([selector], [options])

Returns the element node directly.

Query an attribute

query.attribute(selector, attribute, [options]) or query.attr()

Return the contents of an attribute. Alias for query.element([selector], { attribute: [attribute] }).

Query existence

query.exists(selector, [options])

Return the presence of an element as a boolean.

Query count

query.count(selector, [options])

Return the number of elements that match the selector.

Query the content

query.content([selector], [options])

Return the text contents of an element (.textContent).

Query a number

query.number([selector], [options])

Options

  • match: The regular expression to use to extract a number from text, default /\d+(\.\d+)?/ for decimal numbers.
  • matchIndex: The index of the match result, useful for expressions containing groups or a global flag, default 0.
  • separator: Whether to use . (Europe) or , (US) as the decimal separator, default .

Return the contents of the element or attribute as a Number primitive.

Query the HTML

query.content([selector], [options])

Return the HTML contents of an element (.innerHTML).

Query the text

query.text([selector], [options])

Return the text contents of an element, skipping non-text children, as opposed to querying content.

Options

  • join: Join text nodes into one string
  • trim: Remove excess whitespace
  • filter: Remove empty text nodes

Query a URL

query.url([selector], [options])

Options

  • origin: The hostname to prefix when it is not included in the URL (/path).
  • protocol: The protocol to use when it is not included in the URL (:www.example.com, default http).

Returns the href from an anchor element (or any other specified target) as a string.

Query an image

query.image([selector], [options]) or query.img()

Options:

  • All options supported by query.url().

Returns the src from an image element (or any other specified target) as a string.

Query a dataset

query.dataset(selector, property, [options]) or query.data()

Return the contents of a data- attribute.

Query a source set

query.sourceSet([selector], [options]) or query.srcSet()

Options:

  • includeDescriptor: Produce an array of { descriptor, url } instead of URL strings.
  • All options supported by query.url().

Returns an array of media URLs from the srcset of an media element as strings sorted by their descriptor from large to small.

Query a video

query.video([selector], [options])

Options:

  • All options supported by query.url().

Returns the src from an video source element (or any other specified target) as a string.

Query a date

query.date(selector, format, [options])

Arguments

  • format (string, array): The input format as a string or array of strings described by the Moment.js docs.

Options

  • match (RegExp): The text to extract before attempting to parse it as a date. The default expression will attempt to extract any of 01-01-1970, 1970-01-01, 01/01/1970 or January 1, 1970 with optional 00:00[:00] time.
  • matchIndex: The index of the match result, useful for expressions containing groups or a global flag, default 0.
  • timezone (string): The name of the input timezone, defaults to 'UTC'.

Returns a Date object.

Query a duration

query.duration(selector, format, [options]) or query.dur

Options

  • match (RegExp): The text to extract before attempting to parse it as a duration. The default expression will attempt to extract (hh:)mm:ss and PT##H##M##S.

Returns the duration in seconds as a number.

Query JSON

query.json([selector], [options])

Returns the parsed JSON content of an element as an object.

Query style

query.style([selector], [options])

Options

  • styleAttribute: the CSS style attribute to extract, returns an object with all properties by default.
  • attemptBugfix: Attempts to fix/bypass JSDOM quirks related in particular to style attributes containing url(), at the risk of losing some surrounding definitions (e.g. url() 0 0 no-repeat; may become url(). Try disabling this property if you require those definitions; it may break the attribute entirely, though.

Returns the CSS style attributes of an element as an object.

HTTP request

  • unprint.get(url, [options])
  • unprint.post(url, body, [options])

Options

  • select: Pre-query and initialize a specific element on the page
  • selectAll: Pre-query and initialize multiple specific element on the page

Returns

{
	context: {				// using select or no option
		query,				// (object)		unprint querying methods
	},
	context: [{				// using selectAll
		query,
	}],
	html,				// (string)		HTML body
	data,				// (object)		parsed JSON response
	status,				// (number)		HTTP status code
	ok,					// (boolean)	status code >= 200 and < 300
	response,			// (object)		the original axios response object, alias 'res'
	res,				// (object)		alias for 'response'
}

Proxy

unprint.options({ // or unprint.options();
	proxy: {
		enable: true,
		use: false, // don't use for all requests by default
		host: '127.0.0.1',
		port: 8888,
		hostnames: [
			'www.google.com',
			'www.example.com',
		],
	}
});

unprint.get({
	proxy: {
		use: true, // use proxy for this request
		// all other proxy options can be supplied here
	},
});

Feedback events

Usage:

  • unprint.on('trigger', callbackFn)
  • unprint.off('trigger', callbackFn)

Triggers:

  • requestInit: A HTTP request is about to be made
  • requestSuccess: The HTTP request completed with an OK status code
  • requestError: The HTTP request completed with an error status code
  • query: A query method was used