npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

alexandria

v0.1.7

Published

A storage interface to store crawled content in Elasticsearch

Downloads

4

Readme

A content cache to store crawled content in Elasticsearch

Alexandria provides a simple interface to store resources downloaded from the web. It uses Elasticsearch to index downloaded content.

Example Usage

alexandria = require "../src/index.coffee"
{EventChannel} = require "mutual"
sleep = require "sleep"

#elasticsearch configuration
options = 
  indexName: "example_crawler_cache"
  host: "127.0.0.1"
  port: 9200
  secure: false

events = new EventChannel
events.on "error", (err) ->
  console.log "Error: #{err}"

run = () ->
  console.log "\nExample usage of alexandria..."
  do events.serially (go) ->
    go -> 
      console.log "Initializing alexandria"
      # initialize alexandria with elasticsearch server details 
      alexandria.initialize(options)
    go -> 
      console.log "Deleting store"
      # delete all of crawler cache, if you want to crawl from scratch
      alexandria.deleteStore()
    go -> 
      console.log "Creating store"
      # create crawler cache, if you want to crawl from scratch
      alexandria.createStore()
    go -> 
      console.log "Getting resource from cache"
      # sleep for a second to give elasticsearch enough time to replicate
      sleep.sleep 1
      # get a resource from cache
      # note the second argument (indicates whether to download if not in cache)
      alexandria.getResource("http://en.wikipedia.org/wiki/Main_Page", true)
    go ({contentType, content}) -> 
      # work with the downloaded content
    go ->
      console.log "Putting resource into cache"
      # sleep for a second to give elasticsearch enough time to replicate
      sleep.sleep 1
      # put a resource into cache
      alexandria.putResource(
        "http://example.com/test-resource.html"
        "text/html; charset=utf-8"
        "<html><head></head><body><div></div></body></html>"
      )
    go ->
      console.log "Getting urls of all resources in cache\n"
      # sleep for a second to give elasticsearch enough time to replicate
      sleep.sleep 1
      # get urls of all resources in cache from a given domain
      alexandria.getAllResourceUrls("wikipedia.org")

run()

Notes on Elasticsearch index and type mappings

  • Index where resources are stored can be configured via options passed to initialize

  • For every domain whose resources are stored, two type mappings are created

  • One type to store content type and the actual content as either plain text (for text content) or base64 encoding (for binary content). Id of documents of this type will be md5 digest of the content.

  • Another type to store url of the resource and a reference to the document that contains the actual content. Id of documents of this type will be md5 digest of the url.

  • The two types are named <domain_name_with_dot_replaced_with_underscore>_content and second type is named <domain_name_with_dot_replaced_with_underscore>, example: a domain such as example.com would have the two types named example_com_content and example_com.

  • Following are example mappings (in cson format) of the two types that would be created for domain example.com:

    • Mapping for type that stores the resource urls (content_ref field is md5 digest of the content itself)

      ```
      example_com:
          properties:
              url: type: "string"
              content_ref: type: "string", index: "not_analyzed"
      ```
          
    • Mapping for type that stores the content (text_content field contains plain text content if content type is text otherwise binary_content field stores binary content in base64 encoding)

      ```
      example_com_content
          properties:
              content_type: type: "string", index: "no"
              text_content: type: "string"
              binary_content: type: "binary"
      ```