npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@amermathsoc/texml-to-html

v18.1.0

Published

A NodeJS library for converting AMS-style JATS XML to HTML

Downloads

36

Readme

texml-to-html

Converting texml-generated JATS/BITS-like XML to HTML.

Getting started

Quick example

For a first test run, try an example, e.g.,

  • Install via npm: $ npm i @amermathsoc/texml-to-html
  • Process a test file: $ node node_modules/@amermathsoc/texml-to-html/examples/cli.js node_modules/@amermathsoc/texml-to-html/text/article.xml > htmlOutput.html

Basic usage

import fs from "fs";
import path from "path";
import xml2html from "@amermathsoc/texml-to-html";

const article = xml2html(
  fs.readFileSync(path.resolve(process.argv[2])).toString()
).window.document;
console.log(article.toString());

Overview

Our general strategy for elements and attributes is to follow allow-lists and discard everything else.

We primarily we are recursing through the input DOM, building the output DOM. Rarely, we deviate from this approach for practical reasons (e.g., for metadata extraction).

preserved elements

Some elements in texml's XML output have the same name (and purpose) as in HTML. We preserve them in the output:

  • hr
  • p
  • pre
  • sup
  • sub
  • table
  • tbody
  • thead
  • th
  • tr
  • td

The following custom tag names are preserved:

  • cite-group (wrapper around citations)
  • cite-detail (wrapper for optional argument of \cite)

preseved attributes

Some attributes in texml's XML output have the same name (and purpose) in HTML.

As per our general strategy, we only preserve some attributes on some elements.

  • attribute names that are preserved
    • class
    • id
    • rowspan
    • colspan
    • hidden
  • element names where those attributes are preserved
    • all preserved elements (cf. above)
    • abstract
    • app
    • boxed-text
    • def-list
    • def
    • fig
    • fn
    • notes
    • p
    • preface
    • ref-list
    • sec-heading
    • statement
    • target
    • term

data-* attributes

Beyond HTML element and attributes, texml-to-html stores data in custom data-* attributes. The following lists should help as a guide to understand the this structural information.

:warning: This list can easily fall out of date. It should be automated.

data-* attribute values and origin

  • data-ams-doc
    • titlepage
    • subtitle
    • article
    • copyright-page
    • copyright
    • amsref
    • paragraph
    • subtitle
    • sec-meta
    • secheading
    • {@specific-use} [sec, ack, front-matter-part, dedication, custom-meta]
    • statement
    • graphic
    • inline-graphic
    • math inline
    • math block
    • math tex [added in ams-html output]
    • amsref
    • stringname
    • verse-group
    • label
    • title
    • notes [on section, from XML notes element, cf. AmerMathSoc/texml-to-html#329]
    • biblioentry [formerly role="doc-biblioentry" (deprecated)]
    • tags - container for (duplicated) equation tags
      • container is inside data-ams-doc="math block" elements
      • ams-html uses these to generate the math panel DTs for equations
    • app-group [from book-app-group]
  • data-ams-doc-contrib
    • {@content-type} [expected: "authors", "editors", "translators", "contributors"]
    • {@contrib-type} [expected: "author", "editor", "translator", "contributor"]
    • {@contrib-type} name
  • data-ams-doc-contrib-comment
    • STRING
  • data-ams-style
    • {styled-content@style-type} [expected: sans-serif]
    • roman
    • sc
    • monospace
    • underline
    • {disp-quote@specific-use}
    • (inline-)graphic{@specific-use}
    • {@style} [expected: theorem styles, sec styles]
    • boxed (from boxed-text)
  • data-ams-ref
    • {@ref-type} [expected: bibr, fn, disp-formula, sec, fig, table, algorithm, list, statement]
    • notrid
    • fn-return [added in ams-html output]
    • toc-entry@specific-use [expected: section, chapter, etc.]
  • data-ams-doc-level
    • [0-9]
  • data-ams-content-type
    • {@content-type }
    • { @notes-type } (for notes elements, cf. AmerMathSoc/texml-to-html#329)
  • data-ams-specific-use
    • {@specific-use}
  • data-ams-qed-box
    • BOOLEAN
  • data-ams-position
    • anchor
  • data-ams-width
    • graphic | inline-graphic @width
  • data-ams-height
    • graphic | inline-graphic @height
  • data-ams-doc-alttitle
    • alt-title (book only, for sectioning content only)
  • data-ams-href
    • stores href for span that avoids a nested link (cf. xref, ext-link)
downstream data-* attributes

While the vast majority of data attributes originate in texml-to-html, we have a few cases where downstream tooling introduces custom attributes. We list the attribute names, the related tools and purpose:

  • [deprecated] data-eqn-tag-#
    • superceded by data-ams-tags (cf. earlier)
    • ams-eqn-store used to use these numbered attribute names to store extracted equation tags for downstream use (e.g., ams-html math panel)
  • [deprecated] data-ams-tags
    • this attribute contained a stringified array of equation tag strings
    • while initially generated by texml-to-html, ams-eqn-store overwrote them with rendered output if TeX was present in the strings.
role values

The following ARIA-DPUB role attribute values are used:

  • doc-preface
  • doc-bibliography
  • doc-appendix
  • doc-dedication
  • doc-noteref
  • doc-biblioref
  • doc-footnote
  • doc-chapter
  • doc-abstract
  • doc-toc
  • doc-footnote

texml XML to data-* mappings

The following provide a list from the reverse point of view.

  • book
    • book-id[@book-id-type = 'publ_key'] => data-ams-doc="series"
    • book-meta => data-ams-doc="titlepage"
      • contains JSON blob with book-meta, collection-meta; cf. metadata section in this document
    • book-back//ref-list => data-ams-doc-level="1
    • book//sec/alt-title => data-ams-doc-alttitle
  • article => data-ams-doc="article" [this is a somewhat messy part as we pick and choose, sort of reversely; much of it repeated on copyright page without data attributes]
    • front => data-ams-doc="frontmatter" (some info in head element)
      • contains JSON blob with (most of) journal-meta, article-meta; cf. metadata section in this document.
      • the folllowing descendants create additional HTML (since they may contain tex-math):
        • article-meta>title-group => passthrough
        • notes => data-ams-doc=notes with data-ams-content-type (for @notes-type)
        • abstract => via role
        • kwd-group's => data-ams-doc with @vocab or "keywords"
        • funding-group => data-ams-doc=funding-group
  • styled-content => data-ams-style="{@style-type}"
    • roman => data-ams-style="roman"
      • or \textrm{...} (if inside text)
    • sc => data-ams-style="sc"
      • or $\mathsc{...}$ (if inside text)
    • monospace => data-ams-style="monospace"
      • or \texttt{...} (if inside text)
    • underline => data-ams-style="underline"
  • disp-quote => data-ams-style="{@specific-use}"
  • xref => data-ams-ref="{@ref-type}"
    • xref[not(@rid)] => data-ams-ref="notrid" [but note on ref-type=fn, bibr]
    • or \xhref[@ref-type]{#@rid}{...} (if inside disp/inline-formula)
  • p//p => data-ams-doc="paragraph"
  • fn/label => data-ams-doc="label"
  • statement/secheading/title | statement/secheading/label => data-ams-doc="secheading"
  • mixed-citation => data-ams-doc="biblioentry" [formerly role="doc-biblioentry" (deprecated)]
  • sec | ack | front-matter-part | front-matter/dedication => data-ams-doc="{@specific-use}"
    • [e.g., sec/title | app/title | sec/label | app/label | front-matter-part/title]
      • subtitle => data-ams-doc="subtitle"
      • sec-meta => data-ams-doc="sec-meta" + "data-ams-contributors" (json blob, cf. metadata section in this document) + "data-ams-byline" (pre-generated byline)
      • label => data-ams-doc="label"
      • title => data-ams-doc="title"
  • abstract => role="doc-abstract">
  • statement => data-ams-doc="statement"
  • graphic | inline-graphic => data-ams-doc="{name()}" data-ams-style="{@specific-use}"
    • graphic | inline-graphic @width => data-ams-width
    • graphic | inline-graphic @height => data-ams-height.
  • inline-formula => data-ams-doc="math inline" OR $...$
    • within disp/inline-formula, inline-formula may appear inside text. In that case, we have to create TeX strings for MathJax to process, wrapped in $
  • disp-formula => data-ams-doc="math block"
  • disp-formula-group => data-ams-doc="statement" data-ams-content-type="disp-formula-group"
  • raw-citation => data-ams-doc="amsref"
  • string-name => data-ams-doc="stringname"
  • verse-group => data-ams-doc="verse-group"
  • boxed-text => div@data-ams-style="boxed"
  • notes => section with data-ams-doc="notes"
    • @notes-type => @data-ams-content-type (and role=dedication for dedications)
    • use cases: dedication (articles), article and section notes (NOTI only), drm notice & epub note (books)
  • toc-entry@specific-use => data-ams-ref
  • attributes
    • @disp-level => data-ams-doc-level [data-ams-doc-level is also added to some elements that lack disp-level]
    • @content-type => data-ams-content-type
    • @style => data-ams-style [e.g., statement, sec]
    • @specific-use => data-ams-specific-use [sometimes mapped to other attributes, e.g., style]
    • @has-qed-box => data-ams-qed-box
    • @position => data-ams-position
    • @text-color, @background-color, @border-color => data-ams-style-color (as combined CSS declarations)
      • currently appears on: boxed-text, styled-content

metadata handling

Publication metadata (both journal/series and article/book metadata) is primarily stored in a JSON blob in a script tag in the frontmatter (for articles) and titlepage (for books) sections respectively.

Section metadata (contributor metadata and pre-generated byline) is stored similarly as a json blob inside the data-ams-contributors attribute.

The relevant components in texml-to-html (i.e., article-metadata-json.js, book-meta-json.js, sec-meta.js) should provide a (hopefully easy enough) overview how the XML metadata is mapped and stored. The snapshots in the test folder should also be helpful, alongside any (example) article's JSON blob.

The following are commonly found metadata items:

For journal articles:

  • <article-meta>
    • <contrib-group> etc (cf. "contributors" below)
    • <self-uri>
    • <title-group> etc
    • <pub-date> etc
    • <notes>
    • <kwd-group>
      • MSC (using <compound-kwd> etc)
      • article keywords (using <kwd>)
    • <related-article> etc (correction/erratum forward/backward)
    • <custom-meta-group> (communicated by, NOTI categories, NOTI titlepic)
    • <funding-group> / <funding-statement>
    • <permissions>
      • <copyright-statement>
    • <history> etc
    • <article-id> etc
    • <abstract> etc
    • <volume>, <issue>
  • <journal-meta>
    • <journal-id>
    • <journal-title-group> etc
    • <issn> etc
    • <publisher>
    • <self-uri> etc

For books:

  • <collection-meta>
    • <publisher> etc
    • <title-group>
    • <volume-in-collection>
      • <volume-number>
    • <custom-meta-group> (for subseries) <book-meta>
    • <book-id> etc
    • <book-title-group> etc
    • <book-volume-number>
    • <publisher>
    • <contrib-group> etc (cf. "contributors" below)
    • <self-uri>
    • <title-group> etc
    • <pub-date> etc
    • <notes>
    • <issn>
    • <isbn>
    • <permissions>
  • <front-matter>
    • <toc> etc
    • <notes>
    • <preface>
    • <front-matter-part>

math mode

Note. There is some overlap with other sections of this document. Ensure that updates are consistent across the document.

For math mode, texml creates MathJax-optimized TeX strings that may contain XML markup; for content not supported by MathJax it falls back to SVG creation. This mix requires extra processing.

xref

Math mode output may contain xref elements. This gets turned into something like \xhref[@ref-type]{#@rid}{...}; the custom xhref MathJax macro works in both (MathJax's) text and math mode.

text inside math mode

In the case where math mode contains text mode, texml creates text elements possibly containing text XML markup. We turn this into MathJax-compatible TeX strings.

  • text only appears inside tex-math; it is converted to \\text{...}
  • italic creates \textit{}
  • bold creates \textbf{}
  • roman creats \textrm{...}
  • sc creates $\mathsc{...}$
  • monospace creates \texttt{...}
  • ext-link creates \href{}{} as(which works in both text and math mode)

nested math mode

In the case where math mode is nested (math mode inside text mode inside math mode), we have to adjust our processing to create MathJax-compatible TeX strings.

  • nested math mode can only appear within text and will only be inline-formula which is (essentially) converted to $...$

algorithm layout

Note. Since the markup and attributes are heavily scoped, we do not reproduce the attributes in other sections.

Texml creates pseudo-namespaced elements for algorithm layout (e.g., from algorithmicx pacakge).

We convert the markup to HTML custom elements with attributes. Further processing happens downstream to enable adequate styling.

  • alg:algorithm => alg-algorithm
  • *alg:line => alg-line
    • @lineno => alg-lineno (preceding sibling of alg-line)
    • data-ams-alg-spanslineno (if first child is alg:require or alg:ensure)
  • alg:block => alg-block
    • data-ams-alg-blocklevel (calculated from nesting)
  • alg:statement, alg:require, alg:ensure, alg:globals => alg-statement
  • alg:comment => alg-comment
  • pass through:
    • alg:body
    • alg:outputs
    • alg:inputs
    • alg:condition
    • alg:if
    • alg:elsif
    • alg:else
    • alg:for
    • alg:forall
    • alg:while
    • alg:repeat
    • alg:until
    • alg:loop