npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@unified-myst/core-parse

v0.0.5

Published

The core entry point for MyST parsing in unified.

Downloads

3

Readme

@unified-myst/core-parse

The core entry point for MyST parsing in unified.

Quickstart

import { Processor } from '@unified-myst/core-parse'

const parser = new Processor()
const result = parser.toAst('Hello world!')
console.log(JSON.stringify(result.ast, null, 2))

yields:

{
  "type": "root",
  "children": [
    {
      "type": "paragraph",
      "children": [
        {
          "type": "text",
          "value": "Hello world!",
          "position": {
            "start": {
              "line": 1,
              "column": 1,
              "offset": 0
            },
            "end": {
              "line": 1,
              "column": 13,
              "offset": 12
            }
          }
        }
      ],
      "position": {
        "start": {
          "line": 1,
          "column": 1,
          "offset": 0
        },
        "end": {
          "line": 1,
          "column": 13,
          "offset": 12
        }
      }
    }
  ],
  "position": {
    "start": {
      "line": 1,
      "column": 1,
      "offset": 0
    },
    "end": {
      "line": 1,
      "column": 13,
      "offset": 12
    }
  }
}

Parsing process

The parsing process is as follows:

  • Run all beforeConfig event hooks, by priority order.

    • beforeConfig processors are operations which modify the config, before it is validated.
  • Run all beforeRead event hooks, by priority order.

    • beforeRead processors are operations which initialise global state adn can also modify the source text.
  • Parse the source text into micromark tokens.

    • These can be loosely regarded as a Concrete Syntax Tree (CST), directly mapping to the original source text.
    • The tokenizer is based on the CommonMark specification, with the additional core syntax extensions:
    • At this point roles and directives are single tokens, and their content is not yet processed.
  • Compile the tokens into an MDAST syntax tree

  • Walk the syntax tree and process all roles and directives, into additional syntax nodes.

  • Run all afterRead event hooks, by priority order.

    • afterRead processors are operations which modify the syntax tree.
  • Run all afterTransforms event hooks, by priority order.

    • afterTransforms processors are operations which extract information from the syntax tree to the global state.

Extension mechanism

Everything is an extension!

import { u } from 'unist-builder'
import { toc } from 'mdast-util-toc'
import { Processor, RoleProcessor, DirectiveProcessor } from '@unified-myst/core-parse'

class RoleAbbr extends RoleProcessor {
    run() {
        const abbr = u('abbr', [])
        abbr.children = this.nestedInlineParse(this.node.content)
        return [abbr]
    }
}

class DirectiveNote extends DirectiveProcessor {
    static has_content = true
    run() {
        const note = u('note', [])
        note.children = this.nestedParse(this.node.body)
        return [note]
    }
}

function addToc(ast, config) {
    if (config.myExtension?.addtoc) {
        const table = toc(ast)
        ast.children.unshift(table.map)
    }
}

myExtension = {
  name: 'myExtension',
  roles: { abbr: { processor: RoleAbbr } },
  directives: { note: { processor: DirectiveNote } },
  hooks: { afterRead: { addtoc: { priority: 100, processor: addToc } } },
  config: { addtoc: { default: false, type: 'boolean' } },
}
parser = Processor().use(myExtension)
parser.setConfig({myExtension: {addtoc: true}})
result = parser.toAst('hallo')

.use calls can be chained, to add multiple extensions.

Configuration

Each extension can supply its own configuration, as a "stub" for the properties key of a JSON schema.

import { Processor } from '@unified-myst/core-parse'

const processor = new Processor()
processor.use({
    name: 'core',
    config: {
        cname1: { default: '', type: 'string' },
        cname2: { default: [], type: 'array' },
    },
})

From these configuration stubs, the full configuration schema is generated.

console.log(JSON.stringify(processor.getConfigSchema(), null, 2))
// {
//   "type": "object",
//   "properties": {
//     "core": {
//       "type": "object",
//       "properties": {
//         "cname1": {
//           "default": "",
//           "type": "string"
//         },
//         "cname2": {
//           "default": [],
//           "type": "array"
//         }
//       },
//       "additionalProperties": false
//     }
//   },
//   "additionalProperties": false
// }

A configuration can be set using the [setConfig] method, then the [getConfig] method can be used to retrieve the full configuration, which merges this config with any defaults.

processor.setConfig({ core: { cname1: 'test' } })
console.log(JSON.stringify(processor.getConfig(), null, 2))
// {
//   "core": {
//     "cname1": "test",
//     "cname2": []
//   }
// }

Design decisions

The design is intended to quite closely mirror that of docutils and Sphinx. Their documentation generation and extension mechanism has been developed over many years, and has a relatively large community. So the similar API will facilitate for porting of existing Sphinx extensions.

It diverges from docutils/Sphinx though, in a number of key ways, to address some design shortfalls (in my opinion) of that system, as detailed below. It is also focussed on facilitating scientific writing and publishing, as opposed to documentation of software.

Firstly, the underlying AST is based on MDAST, rather than docutils nodes. The key improvement of MDAST, is that it is JSONable, allowing for serialisation into a language agnostic format. Together with myst-spec, this allows for a better separation of concerns, between AST generation (e.g. parsing from Markdown) and rendering (e.g. outputting HTML). It can also be inspected and manipulated by mdast's existing ecosystem of utilities.

Similarly, for configuration, this is parsed in a JSON format, and extensions can add their own configuration options, that include a JSON Schema "stub" to validate a specific configuration key. In this way, a schema can be auto-generated, to validate the entire configuration in a language agnostic manner. Configuration variables are also name-spaced by extension name, to make it clearer and avoid key clashes.

Improvements to the extension API... extensions are first-class citizens

transforms -> afterRead event hooks (https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx-core-events)

non-global roles and directives

Introspectable parser: get config schema, see what roles/directives/transforms are loaded, ...

TODO

  • Is there any difference between GFM footnotes and Pandoc footnotes (which is also the basis for markdown-it footnotes)?

  • Add Logging (and create error nodes)

    • Allow directive/role/hook processors to write to a log object
  • Errors with node-resolve when trying to build the browser bundle

  • Errors with workspace build of types, because of wrong order (since core-parse depends on other packages)

  • Minimise AST walks:

    • Concept of transforms that are purely data collectors
      • Then they can be run at the same time, rather than performing multiple AST walks
      • Maybe change the signature of afterTransforms, so that it is called on a single walk through the AST (for every node)
  • Disabling extensions, and even specific directives/roles/transforms within an extension.

  • How to handle conversion to output formats?

    • For HTML, basically we want for extensions to be able to supply https://github.com/syntax-tree/mdast-util-to-hast#optionshandlers, and this would likely be similar for other formats
    • But do we also include mdast-util-to-hast as a dependency here, since this would not be good for package size when it it not used?
    • so then I guess we have a package that builds on this, to add them.
  • Better propagation of position for nested parsing

    • Ideally, this requires declarative directive structure, so that we can parse argument/options/body directly to CST tokens, and thus capture their positions
      • Indentation also gets a bit weird in code fences https://spec.commonmark.org/0.30/#example-131, i.e. currently we are indenting all body lines by the same amount as the indentation of the opening, but this is not strictly correct.
    • For roles it would be good to get the offset of the start of the content, from the start of the role
      • It becomes problematic though, if the content spans multiple lines and the role is within another construct, and thus "indented".
    • Another, bigger change, would be to have a separate syntax for "directives/roles that just wrap nested content", then you could do proper token parsing: