google-docs-markup

v1.0.1

Published

3 years ago

Builds an Abstract Syntax Tree out of a Google Doc.

Downloads

0High
0Medium
0Low

adamhooper

google docs markup rendering template engine parser abstract syntax tree

Parse Google Docs

This is a parser for journo-developers. It's for this workflow:

Share a Google Doc in your newsroom
Collaborate on the story
Publish that story through your code
Repeat as necessary

Google Docs is fantastic for editing but icky for publishing. This library lets you decide how to convert your Google Doc to HTML (or any other format).

Usage

google-docs-markup is a parser that outputs a JSON Array. You can render its output with any programming language. To use the parser, though, you'll need NodeJS, at least version 5.11.

Using google-docs-markup

First, install it:

npm init -y # if you haven't already
npm install --save-dev google-docs-markup

Then pass it HTML:

const gdm = require('google-docs-markup')
const html = '' // get a document's HTML somehow
const blocks = gdm.parse_google_docs_html(html)

Now you have "Blocks"!

Downloading from the Google Drive API

We skipped the hard part: grabbing the HTML from the Google Drive API. Beware: you must use the API, because "published" Google Docs use different HTML conventions. The API uses OAuth, so it's confusing. We suggest you use google-docs-markup offline and store the resulting documents in your project's repository. Here's how we do it:

Install google-docs-console-download:

npm install --save-dev google-docs-console-download

Create a function to download a file:

const gdcd = require('google-docs-console-download')(null)

function downloadGoogleDocAsBlocks(docId, callback) {
  gdcd.download(docId, (err, html) => {
    if (err) return callback(err)
    const blocks = gdm.parse_google_docs_html(html)
    callback(null, blocks)
  })
}

And if your project uses multiple files, maybe you'll want a config file. Perhaps write "config/google-docs.json" with these contents:

[
  { "slug": "index", "googleId": "1qLoJYmUEJvpQdP4Xplp6I5JBsMpRY9RZTnak2gPhiEQ" },
  { "slug": "page1", "googleId": "1qLoJYmUEJvpQdP4Xplp6I5JBsMpRY9RZTnak2gPhiEQ" }
]

Write a script that downloads them all:

const docs_config = require('./config/google-docs')
const fs = require('fs')

function downloadAndSaveAll(docs, dirname, callback) {
  const todo = docs.slice()

  (function tick() {
    const doc = todo.shift()
    if (!doc) return callback()

    downloadGoogleDocAsBlocks(doc.googleId, (err, blocks) => {
      if (err) return callback(err)
      fs.writeFile(`${dirname}/${doc.slug}.json`, JSON.stringify(blocks), callback)
    })
  })()
}

downloadAndSaveAll(docs_config, '.', (err) => {
  if (err) console.error(err)
})

Run that script manually every time editors modify the Google Docs.

Now you have JSON markup in ./index.json and ./page1.json. If you're loading it in Node, it's easy:

const index_blocks = require('./index')
const page1_blocks = require('./page1')

"Blocks" Output Format

google-docs-markup output looks like this:

const blocks = [
  { type: 'h1', texts: [ { type: '', text: 'The title' } ] },
  { type: 'p', texts: [
    { text: 'This sentence contains ' },
    { text: 'italics', italic: true },
    { text: ', ' },
    { text: 'bold italics', italic: true, bold: true },
    { text: ' and ' },
    { text: 'underlines', underline: true },
    { text: '.' }
  ] },
  { type: 'page-break' },
  { type: 'hr' },
  { type: 'ul', blocks: [
    { type: 'li', texts: [ { text: 'List item' } ] }
  ] },
  { type: 'ol', blocks: [
    { type: 'li', texts: [ { text: 'Numbered list item' } ] }
  ] }
]

Blocks

Think of this tree as a super-duper-simple HTML document. A document is made of "blocks". Some blocks have "block" children. Other blocks have "text" children. We never mix the two.

Blocks can have these types, derived from HTML:

h1, h2, h3, h4: headers. Contain texts.
p: paragraph. Contains texts.
ul, ol: lists. Contain li blocks.
li: list item. Contains texts.
hr: horizontal rule. Contains nothing.
page-break: page break. Contains nothing. Not HTML.

There is no recursive nesting. You can't build a list of lists. You can't put a header in a list or a list in a paragraph.

There are no attributes. Blocks don't have HTML classes or any other formatting information. We leave that to texts.

There are no empty blocks. If there's no text, there's nothing.

Texts

Texts always have a text attribute. They can also have:

italic: true if the text is italicized in the Google Doc
bold: true if the text is bolded in the Google Doc
underline: true if the text is underlined in the Google Doc

There are no empty texts.

Tips and tricks

Use underline to build a templating engine

Here's the nifty thing: you probably aren't going to output underlined text. Who even does that these days? We underline code.

For example, in our Google Doc, we underline something like render(...). Then when we render our blocks and texts, we replace any text with underlined:true with eval(text).

Beware smart quotes: Google Docs usually turns render("foo") into render(“foo”), which is invalid in most programming languages. It does the same thing with single quotes. Try to keep those characters out of your underlined code, or your newsroom practices will probably break it:

If you're rendering with eval() in JavaScript, write Strings \like this`` (backticks are an alternative to single or double quotes).
If you're rendering with eval() in Ruby, write Strings %{like this} (that escape sequence is an alternative to single or double quotes).
If you're using a programming language with fewer features (like Python or a compiled language), don't use eval(): parse the underlying code using a mini-language you design yourself (one that doesn't use require single or double quotes).

Include instructions below a page break

We maintain an example Google Doc that shows off all features. It doubles as a page of instructions.

Copy/paste that page into a fresh Google Doc. Write your story above the instructions; then write any notes (a scratchpad), then leave the instructions. Add a page break after the story and before the notes.

The instructions will help journalists with syntax. And you can easily drop everything after the first page break. For instance:

const html = ...;
const blocks = gdm.parse_google_docs_html(html);
blocks.splice(blocks.find(b => b.type === 'page-break'))

Adding features and fixing bugs

Run node_modules/.bin/mocha -w and verify all tests pass
Write a new test in the test/ directory and verify that mocha reports failure
Edit code in index.js until tests pass
git commit -a with a descriptive message
Repeat from step 2 until this library has all the features you want and no bugs you know of
git push

Releasing

After you've developed, you probably want other people to use your changes.

So publish to npm:

Edit package.json and set a new semantic version number.
git commit package.json
git push
npm publish

Then make your other projects depend on this new version:

npm install --save google-docs-markup