npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

kreepy

v0.0.2

Published

Simple web crawler which converts to Markdown

Downloads

3

Readme

kreepy

A simple Node.js web crawler which converts downloaded files to Markdown.

Why another crawler?

I just wanted something simple to convert my PmWiki into a set of Markdown files. I wanted to be able to transfer the special semantics I established in my wiki to a file-based system.

Kreepy allows you to do something similar, and with a concise code base is easily understandable and extendable.

Note that it's a very minimal implementation and a lot of edge cases aren't taken care of.

Getting started

With Kreepy, you put your crawling logic into app.js, overriding the functionality provided by the Engine class in engine.js. The provided app.js demonstrates how to do this, with the case of crawling my PmWiki.

Kreepy logic

Here's a little account for how Kreepy runs.

1. Load url

Start Kreepy via engine.start(), which adds the first URL to the queue which it continues to process. Processing can happen in parallel, if config.concurrency is above 1.

  • If we've reached the limit of URLs to fetch (set by config.maxUrlsToProcess) stop
  • Download content, and load up as a document object model (DOM)
  • Call engine's extractLinks function
  • If the link has been classified for full crawling, call extractAndSave to save the content

If response is an image or PDF and config.saveBinaries is true, it will be saved by saveBinary. The path for the image is established by urlToFile.

2. Extract links

Working with the DOM, find all 'A' (anchor) tags, and process them in series. This behaviour can be changed, for example only pulling out links from a certain area of the page by setting config.linkSelector (default: "a")

If you want to include images for example, you could use a selector such as a, img.

  • Normalise using urlPreprocess (ie transform the anchor href)
  • Classify the url with urlClassify: should the URL be processed in full, just its links traversed, or should it be skipped entirely?
  • If we've already seen the URL, or it shouldn't be processed, dont do anything more with it
  • Otherwise, add it to the list of seen URLs and add it to the worker queue (which is processed at step 1)

3. Extract and save content

  • Extract metadata for YAML front-matter using extractMeta
  • Extract the content as string using extractContent
  • Generate a path for the URL using urlToFile, creating directories if need be
  • Save the file

4. Extract metadata

By default, this does nothing. In my PmWiki crawler, I pull out the title and certain blocks of content.

5. Extract content

  • Re-write links so they point to the file system rather than the web using urlToFile. This in turn calls processUrlForFile for early pre-processing.
  • Process the content as a DOM with processDom (often easier to clean up or remove stuff when it is in DOM form)
  • Convert the DOM to HTML with domToContent (overriding this allows you to pull just an extract from the document)
  • Convert content to Markdown with toMd
  • Replace HTML entities
  • Clean up Markdown with processMd