hotmetal
v1.0.17
Published
Parse, hyphenate, slice, and render your HTML.
Downloads
9
Maintainers
Readme
Table of Contents generated with DocToc
HoTMetaL
Motivation
HoTMetaL has been developed to simplify the finding of good line breaks in HTML sources; this is a core ingredient for the MingKwai Typesetter.
The Problem
The MingKwai Typesetter is an application to typeset print pages from MarkDown sources that are converted to HTML5 / CSS3, and then rendered by the browser component of an nwjs (formerly node-webkit) app.
While the choice of HTML5, CSS3 and a web browser to typeset text is a natural one given that it is the one globally most widespread text rendering technology, has been under very intense and competitive development for a quarter of a century now, and has in the process become both highly optimized and internationalized for a wide range of languages and scripts.
However, producing print masters from a rendering in the browser window has never been very much in the focus of vendors, and, hence, many of the techniques developed by printers have received a rather negligent treatment, one example being fine control over how lines are broken into paragraphs, and the typesetting of columns.
Fortunately, there is a wonderful and versatile programming language—JavaScript—that is closely wedded to the Document Object Model (DOM) that can be used to fill out any gaps of HTML and CSS.
The particular problem that HoTMetaL is intended to solve can be stated as follows: Given the source of an HTML paragraph, some collection of CSS style rules and an HTML layout which contains a block element intended to receive lines of type, how can we make it so that
❶ we can tell whether a given portion of the paragraph fits into the receiving container without occupying more than a single line and without containing less material than would be possible, given the length of a line?
❷ we can control where line breaks occur to optimize the appearance of a paragraph (as has been pioneered by Donal Knuth's TeX typesetting system)?
❸ we can later distribute lines so that common taks in book layout—such as the production of balanced columns, possibly with intervening illustrations—become feasible?
The Solution
The answer to problem ❶ can only be: we must actually typeset a line under 'realistic' conditions, that is, we must actually put the pertinent HTML tags onto an actual web page and then test whether the line is too short, just right, or too long. For any attempt to do it 'the TeX way'—i.e. by considering font metrics instead of actual fonts—is bound to ultimately reconstruct more or less the entire browser rendering engine in JavaScript, which is certainly too hard to be solved in a satisfactory manner.
The (partial) answer to problem ❷ is that we must find all those positions in a given HTML source text where line breaks are permitted, given the combination of script and language at a given point. This seemingly simple task is surprisingly difficult when we consider just a few points:
In an English text, we require that properly formatted texts use hyphens at the end of lines where otherwise a long word would cause an overly short line; those hyphens must only occur where permitted by intricate rules (which may not entirely lent themselves to a formalization and may require lists of difficult cases and exceptions as dictated by common usage);
In more traditionally typeset Chinese texts, all the characters, including punctuation, are expected to take up the exact same space, so that the result displays a rigid grid. Line breaks may occur at any point between any two characters; it may even be permitted to have a trailing period as the first (and, at the end of a paragraph, only) character on a line (in more modern Chinese texts, the tendency seems to be to abandon the strict grid in favor of variable spacing between characters and give less room to punctuation).
Other languages may use other devices such as elongated characters or, (as in Thai) inner-word breaks without hyphens that may, however, only occur at syllable boundaries.
Text Partitioning
Fortunately, there has been done quite some work in the field of language processing. First, there is the
Unicode Line Break Algorithm (UAX #14) which has been implemented in
JavaScript as a NodeJS module called linebreak
and may be
installed as easy as npm install linebreak
.
There's yet another implementation of UAX avalaible which goes by the unassuming name of Unicode Tokenizer.
Second, there is a hyphenation module, hypher
, with quite a few
language-specific hyphenation patterns available.
The combination of hypher
and linebreak
allows us to find all positions where e.g. an English text
may be broken. For example, the nonsense text:
'Paragraph internationalization assignment (certainly) relativity.'
will be partioned as
[ 'Para★', 'graph ', 'in★', 'ter★', 'na★', 'tion★', 'al★', 'iza★', 'tion ',
'as★', 'sign★', 'ment ', '(cer★', 'tainly) ', 'rel★', 'a★', 'tiv★', 'ity.', ]
where the stars ★
indicate 'soft hyphens' (i.e. hyphens that will only be shown when occurring at the end
of the line).
Assuming the existence of method to test whether a given text takes up a single or more than a single line in the browser, we can, then, take such a partitioning and apply it successively to a web page:
➀ ✅ Para-
➅ ✅ Paragraph
➆ ✅ Paragraph in-
➇ ✅ Paragraph inter-
➈ ✅ Paragraph interna-
➉ ✅ Paragraph internation-
➁ ✅ Paragraph international-
➄ ❌ Paragraph internationaliza-
➃ ❌ Paragraph internationalization
➂ ❌ Paragraph internationalization as-
A naive method to distribute material accross lines then just tests consecutive lines of increasing lengths;
as soon as it finds the first line that occupies more than a single line, it will accept the 'last good
line' (i.e. the previous line) and re-start the cycle, beginning with the part that caused the line to
become too long (in our case, line ➄ will end up to be typeset, followed by a line that starts with
Paragraph internationaliza-
). Of course, there may always be unbreakable portions that are too long for a
single line; in such cases, we could typeset that line anyway and issue a quality warning so the user is
alerted and gets a chance to fix things whichever way they see fit.
It's easy to see that the naive method will sometimes produce a fair number of consecutive hyphens,
paragraphs with a lot of hyphenations where a slight adjustment would have yielded less hyphenations, and
paragraphs where spaces happen to occur at similar places in adjacent lines, which produces unsightly
'rivers' of whitespace. But its simplicity and unassuming generality are still attractive; also, it seems to
produce acceptable results in reasonable environments (where the length of words is not too long compared to
the length of the lines). Be it said that it appears to work correctly for English, Chinese, Tibetan, and
Korean; for Thai, a syllable-segmentizer would be needed. This is already quite an achievement given that
it was possible to do with installing a mere two open-source modules from npm
!
One development left for the future is the adaption of the TeX (Knuth & Plass) line breaking
algorithm for the use in HoTMetaL; as it stands, said package uses an
HTML <canvas>
element to test for line lengths, which is a limitation that has become unnecessary.
Another worthwhile future development may be to implement so-called optical margin alignment, also known as hanging indentation. Because punctuation (and parts of other characters) are allowed to occupy some space in the margin, optical alignment does not only achieve a smoother ocerall impression, it also ever so slightly the effective line lengths, which should contribute to a more balanced spacing.
HTML Partitioning
It has already been said that in order to correctly test for line lengths, we must produce (partial) lines
under 'realistic' conditions; in other words, it will be necessary to put all of the HTML tags onto the web
page that are in effect for the portion in question. To clarify the problem, let's have a look at another
nonsense snippet of text, this time with peppered wiht meaningless, random tags. In this sample, breakpoints
with soft hyphens are again indicated with ★
, while breakpoints without hyphens are marked ✚:
Lo ✚<div id='mydiv'><em><i>✚ar★cade ✚& ✚一 ✚il★lus★tra★tion ✚<img src='x.jpg'>
✚<b>bro★mance</b>✚ cy★ber★space ✚<span class='foo'></span>✚ nec★es★sar★ily</i></em>✚ com★ple★te★ly.</div>
Now, since we do not know beforehand anything about font metrics, lines could end up starting and ending just anywhere, depending on font faces, font sizes, font styles, borders, image sizes—in other words, we must assume that each potential breakpoint may become an actual breakpoint. Thus, for example,
✚il★lus★tra★tion ✚
may be one HTML fragement (not quite yet, as we'll see momentarily) to be tested, and
✚il★lus★tra★tion ✚<img src='x.jpg'>
may be another one. However, just slicing such a piece out of its HTML context will not do; as
inspection reveals, we must observe that illustration
appears inside of three
tags: <div id='mydiv'>
, <em>
, and <i>
. Without these tags, we can not be sure that the font selection,
its size and style will be correct (rather, we can almost be sure they all will be incorrect in this case).
Therefore, it becomes necessary to walk back through the HTML structure and look for all the closing and
opening tags. Since HTML tags must always be openend and closed in a symmetric fashion, we know that
we only have to look for openening tags that do not correspond to a closing text. Also, we must close
all tags that have not already been closed when we're done with the relevant text portion. An additional
complication comes with the so-called 'self-closing tags' of HTML5, such as <br>
, <hr>
, <img>
and
others; these act as indivisible units and do not have a corresponding closing tag.
For example, when testing the word gnu
as it appears in this HTML:
foo <b> bar <i>baz</i> gnu frob</b>
we see that it is preceded by a closing tag </i>
which, as long as the HTML source is grammatically
correct, means that we can safely ignore the next opening tag; indeed, the next tag turns out to be <i>
.
When we hit upon the <b>
tag, we have to take it into consideration as there is no other closing tag so
far. Therefore, our HTML fragment becomes <b>gnu</b>
(with the trailing space elided). Likewise, our two
earlier examples become
<div id='mydiv'><em><i>illustration</i></em></div>
and
<div id='mydiv'><em><i>illustration <img src='x.jpg'></i></em></div>
respectively.
The HoTMetaL Data Structure
In order to make slicing and dicing of HTML a straightforward matter, HoTMetaL will parse HTML and turn it into ordinary lists (i.e. JavaScript arrays). In such a list, each breakpoint corresponds to one list element which is represented by a triplet representing opening tags, the text (or lone tag), and the closing tags. Opening and closing tags are again represented as lists; all tags appear in the order they appear in the document. To clarify, here are three equivalent views on the HoTMetaL list that results from parsing this HTML:
<p><b>very</b> nice <i>and</i> also good <img src="x.jpg"></p>
Feeding this source to HOTMETAL.parse()
, we have three ways to print out the resulting structure:
html = """<p><b>very</b> nice <i>and</i> also good <img src="x.jpg"></p>"""
HOTMETAL.parse html, ( error, hotml ) ->
throw error if error?
console.log hotml
console.log HOTMETAL.rpr hotml
console.log HOTMETAL.as_html hotml
# list structure, formatted for readability
[ [ [ '<p>', '<b>' ], 'very', [ '</b>' ], ],
[ [], ' nice ', [], ],
[ [ '<i>' ], 'and', [ '</i>' ], ],
[ [], ' also ', [], ],
[ [], 'good ', [], ],
[ [], '<img src="x.jpg">', [ '</p>' ] ] ]
HOTMETAL.rpr()
(short for 'representation') gives a nice overview how the parse method organizes
the input:
# as rendered by `HOTMETAL.rpr()`:
0______ 1________________ 2___
<p>,<b> very_____________ </b>
_______ nice_____________ ____
<i>____ and______________ </i>
_______ also_____________ ____
_______ good_____________ ____
_______ <img src="x.jpg"> </p>
It can be seen that producing HTML from this structure is as easy as concatenating all the texts in the lists, and getting a slice of valid HTML is almost as easy: We start by copying the triplets between the start and stop indexes from the original list; then, we then Ⓐ walk backwards and, in each triplet we encounter on the way, keep a count of the closing tags; whenever we meet with an unmatched openening tag, we add it to the openening tags of the first triplet in the result. Then, we Ⓑ walk forward through the result triplets, pushing all the openening tags to a stack from which we then pop the closing tags; the remaining tags on the stack are those that still have to be closed:
API
Parsing
parse
H.parse = ( html, settings, handler ) ->
Given an HTML source, an optional settings
object and a callback handler
, produce a HoTMetaL object
(a nested list; henceforth referred to as hotml
).
Currectly, a single setting, settings[ 'hyphenation' ]
, is honored; this value is used as argument
to [PIPEDREAMS](https://github.com/loveencounterflow/pipedreams2#new_hyphenate).new_hyphenate()
.
$parse
H.$parse = ->
(not yet implemented)
Modifying
slice
H.slice = ( hotml, start = 0, stop = null ) ->
Return a slice of the hotml
list. The method with its start
and stop
arguments mimicks the behavior of
the Array::slice()
method. When both start
and stop
are given, the slice between index start
(inclusively) and stop
(exclusively) will be returned; the slice will always be derived from a deep copy
of hotml
, so modifying sublists in the slice will not affect the original list.
Special cases are treated as follows: With start
and stop
omitted, a deep copy of hotml
is returned.
Otherwise, start
and stop
are both confined to range [ 0 .. hotml.length ]
. When start
and
stop
coincide or stop
is less than start
, an empty list is returned.
Typesetting
break_lines
H.break_lines = ( html, test_line, set_line, handler ) ->
$break_lines
`H.$break_lines = ( test_line ) ->``
get_column_linecounts
H.get_column_linecounts = ( strategy, line_count, column_count ) ->
Rendering
as_html
H.as_html = ( hotml ) ->
Return the hotml
object as an HTML string.
rpr
H.rpr = ( hotml ) ->
Return the hotml
object as an ASCII-art table for display in the terminal.
H = require 'hotmetal'
html = """<img src='x.jpg'>lo <div id='mydiv'><em><i>arcade & 一 illustration
<b>bromance</b> cyberspace <span class='foo'></span> necessarily</i></em> completely.</div>"""
H.parse html, ( error, hotml ) =>
throw error if error?
for start in [ 0, 3, 10, ]
for delta in [ 0 .. 5 ]
stop = start + delta
# urge start, stop, H.rpr H.slice hotml, start, stop
info start, stop, H.as_html H.slice hotml, start, stop
urge JSON.stringify hotml
help H.rpr hotml
info H.as_html hotml