hittp

v0.7.7

Published

3 years ago

HTTP library specifically designed for crawling the web. Built-in caching and per-domain queueing

Downloads

0High
0Medium
0Low

sunnypurewal

http get stream request crawl scrape webcrawl throttle cache

hittp

hittp is an HTTP library specifically designed for crawling the web, but can be used a general purpose HTTP library. It has built-in caching to make testing a web scraper fast and easy to replicate. It also has per-domain queueing which means multiple requests to the same host are delayed so that the server is not overloaded by your crawler.

See Turbo Crawl for a powerful web crawling library based on hittp.

Simple GET

When you just want to fetch an entire page into memory.

const hittp = require("hittp")

hittp.get("newyorktimes.com").then((html) => {
  console.log(html)
})
/* OR */
async getNYTimes = () => {
  const html = await hittp.get("newyorktimes.com")
  console.log(html)
}

Streaming GET

When you want to fetch a page and save it to file.

const fs = require("fs") //filesystem
const hittp = require("hittp")

hittp.stream("newyorker.com/sitemap.xml").then((httpstream) => {
  const f = fs.createWriteStream("./sitemap.xml")
  httpstream.pipe(f)
})
/* OR */
async getSitemap = () => {
  const httpstream =  await hittp.stream("newyorker.com/sitemap.xml")
  const file = fs.createWriteStream("./sitemap.xml")
  httpstream.pipe(file)
}

Web Crawling

hittp is especially useful when making many requests to one host. Requests will be queued and the same host will not be hit more than once every options.delay_ms. This ensures that the website you are crawling is not overloaded with requests.

const hittp = require("hittp")
const urls = /* Some long list of URLs */
const options = {
  delay_ms: 3000
}
for (let url of urls) {
  if (typeof(url) === "string") url = hittp.str2url(url)
  hittp.stream(url, options).then((httpstream) => {
    const file = fs.createWriteStream(`./${url.pathname}.html`)
    httpstream.pipe(file)
  })
}

Given a long list of URLs from many domains, hittp can fetch many webpages at once while still respecting each server's delay. This is a key aspect of building a web crawler and hittp takes care of it.

str2url

When you want to convert a string into a URL object with protocol, host, path automatically added. This will return null if it detects an invalid URL.

const hittp = require("hittp")

const url = hittp.str2url("vox.com")
console.log(url.href)
// 'https://vox.com'

Configuration

Default configuration can be overridden by with an options argument.

const hittp = require("hittp")
// Defaults:
const options = {
  timeout_ms: 10000,
  decoded: true,
  delay_ms: 0,
  cachePath: "./.hittp/cache"
}
hittp.get("qz.com", options).then((html) => {
  // Do something with the html
})

decoded

Setting this option to true will return html as a string. Setting this option to false will return html as a Buffer.

cachePath

Setting this option to null will disable caching.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

hittp

See Turbo Crawl for a powerful web crawling library based on hittp.

Simple GET

Streaming GET

Web Crawling

str2url

Configuration

decoded

cachePath

Don't forget to add your cache path to .gitignore! Default path is ./.hittp

Don't forget to add your cache path to .gitignore! Default path is `./.hittp`