google-crawler
v0.1.0
Published
Google crawler middleware (for SPA)
Downloads
5
Maintainers
Readme
Google Crawler
This project is an effort to turn a publicly available paste into an NPM package. The original paste is available at the following URL:
- http://pastebin.com/VPSC1ndf
It's an express middleware that will spit raw HTML to Google's crawler according to their specification:
- https://developers.google.com/webmasters/ajax-crawling/
It allows indexing Javascript heavy applications (SPA) by providing an HTML
rendering of pages when they are requested with a special _escaped_fragment_
parameter.
It relies on a PhantomJS backend to run the frontend's Javascript.
Installation
This module is available through NPM:
npm install --save google-crawler
Usage
var express = require('express');
var google_crawler = require('google-crawler');
var server = imports.express();
server.use(google_crawler({
scraper: 'http://scraper.example.com/img/'
}));
// Continue setting things up..
On your frontend, you'll want to include the following element:
<meta name="fragment" content="!">
Configuration
The middleware accepts the following parameters:
shebang
: a boolean to determine wheter or not to build URLs with a shebang.scraper
: an URL pointing to the PhantomJS backend.
Sample backend
PhantomJS backends are expected to be built with phantom-crawler
:
- https://bitbucket.org/wizzbuy/phantom-crawler
Here's a sample crawler:
phantom.injectJs('crawler/crawler.js');
new Crawler()
.chrome()
.debug()
.crawl(function () {
return [
'<!DOCTYPE html>',
'<html>',
document.head.outerHTML,
document.body.outerHTML,
'</html>'
].join('\n');
})
.serve(require('system').env.PORT || 8888);