megadoc-corpus
v7.1.0
Published
The corpus is megadoc's model for indexing documents. It is a database (or, a graph) of _abstract_ representations of documentation.
Downloads
26
Readme
megadoc-corpus
The corpus is megadoc's model for indexing documents. It is a database (or, a graph) of abstract representations of documentation.
The corpus has enough semantics to provide a powerful and intuitive implementation for resolving links, among other things like plotting dependency and co-relation graphs between documents.
Megadoc's HTML serializer is entirely powered by the corpus and exposes CorpusAPI an API for interacting with it - so as an upstream developer, learning how to use and utilize the corpus will pay off greatly.
Design
The corpus is formed initially as a tree of nodes, each representing either a document or an entity inside a document, grouped by what are called namespace nodes.
What a document or document entity is is context-specific and, in practice, irrelevant to the corpus - all it cares about is an abstract representation of those documents that allows us to treat the database in a generic manner.
The distinction in the codebase and this documentation between documents or document entities themselves and their corpus representation is made by the use of the word "node". So, when we're talking about a node, it would mean the abstract node in the corpus and not the underlying document. The actual documents and document entities themselves referred to as, simply, a document, a document entity, or an entity for brevity.
The Abstract Document Tree (ADT)
Each node in the corpus must have an id and a type. Types are pre-defined and
may be one of Namespace
, Document
, or DocumentEntity
.
At the root of the corpus, we have a singleton node of type Corpus
that
houses a set of Namespace
nodes. Each plugin that registers with the corpus
has to define its own unique Namespace
node and nest its documents underneath
it.
[<frame>Corpus Model|
[Corpus]
[Corpus] -> [JavaScripts]
[Corpus] -> [Markdown]
[JavaScripts] -> [Core]
[Core] -> [X]
[X] -> [#add]
[JavaScripts] -> [Y]
[Y] -> [@name]
[Y] -> [#speak]
[Markdown] -> [Article A]
]
A Namespace
node is the direct owner of a plugin's Document
nodes. However,
from that point down, the hierarchy is flexible in that a Document
node may
contain either Document
nodes as children (namespacing at the plugin level)
or leaf DocumentEntity
nodes.
This architecture is flexible enough to support a wide array of documentation schemes; most kinds of textual documentation, public API references, library documentation, etc., can be structured within this model.
UIDs
Each node added to the corpus is stamped with a uid
- an identifier that is
guaranteed to be unique among all nodes and can always be used to reference
the exact document, although it's usually not human-friendly. To address that,
we get to indexing...
Indexing
A lot of care has gone into making the algorithm and heuristics behind resolving links to documents provide for an intuitive and natural experience for the author. The effect it strives to achieve is that "It Just Works". For this reason, the implementation is heavily influenced by the node from which a link is being made - the context node.
For every node added to the corpus, a list of indices is built. An index represents a term that can be used to link to a node - but only from a specific context. Some indices are private, a semantic for indicating that the term may only be used by the document itself (i.e. when referencing its entities), entities amongst each other, and friends which we'll touch upon later.
The indices generated for a node are customizable. We cover the process in great detail over in the /doc/dev/using-the-corpus.md#tuning-the-indexer corpus usage guide.
The resolver
The corpus resolver performs a depth-first search in the ADT to resolve the node that is being linked to. The parameters needed for resolving are the term, and the context node.
Constraints
- Document identifiers MUST NOT begin with any of the following characters:
./
(a dot followed by a forward slash)/
(a forward slash)
Private contextual resolving
A link could be formed using a short-hand notation when it points to an entity within the same document (aka, a private link). The index used for resolving such links is kept private in the sense that external documents will not be able to reference the entities using this notation.
For example, within the context of a document X
, linking to #someMethod
should resolve to X#someMethod
if that entity exists, but linking to that
same entity from a different document, Y
, yields nothing.
Resolving by (relative) filepath
Any link that begins with ./
is expected to point to a document that can be
found relative from the current document's filepath:
\[[./path/to/file]]
Resolving by (absolute) filepath
Any link that begins with /
is expected to point to a document that can be
found at that path, relative to the [[Config.assetRoot]]:
\[[/path/to/file]]
Let's look at an example corpus with the following contents:
- corpus
|
| - megadoc-plugin-markdown (namespace id = MD)
| |
| |-- X (doc/X.md)
| |-- Y (doc/Y.md)
| |-- Z (doc/Z.md)
|
| - megadoc-plugin-js (namespace id = JS)
| |
| |-- X (filePath = js/lib/X.js)
| |-- Core.X (filePath = js/lib/core/X.js)
| | |-- @id
| | |-- #add
| |-- Core.Y (filePath = js/lib/core/Y.js)
| | |-- @id
| |-- Z (filePath = js/lib/Z.js)
The following table shows the resolving behavior:
Context Node | Link | Resolved Node ---------------- | ------------ | ----------------- JS/Core.X@id | X | JS/Core.X JS/Core.X@id | JS/X | JS/X JS/Core.X@id | MD/X | MD/X JS/Core.X#add | @id | JS/Core.X@id JS/Core.X#add | Y@id | JS/Core.Y@id JS/Core.X#add | Core.Y@id | JS/Core.Y@id JS/Core.X#add | JS/Core.Y@id | JS/Core.Y@id JS/Core.Y | X | JS/Core.X JS/Core.Y | #add | ! (unknown) JS/Core.Y | X#add | JS/Core.X#add JS/Core.Y | X.js | ! (ambiguous) JS/Core.Y | ./X.js | JS/Core.X JS/Core.Y | ../X.js | JS/X JS/Z | X | JS/X MD/X | X | MD/X MD/X | Core.X | JS/Core.X MD/Y | X | MD/X
Here's a visual example (although it does not fully conform to the table above):
Friend nodes
When resolving an index from a certain context node, we will consider the
indices in a binary fashion: either indices that are private or public. Private
indices are ones that are semantically too specific to be linked to outside the
scope they were defined in; in our example above, JS/Core.X@id
would have
a private index of @id
which would not make much sense once we're outside of
JS/Core.X
.
However, this rule is broken in one case, and is modeled as a friendship between nodes. Nodes that are considered "friends" can peek into each others' private indices but not their entity indices. The reason for this will be explained after we look at the following table:
| -- JS
| |-- X
| |-- Y
| |-- Core
| |-- X <- friends: [Core], can access Core.Y using "Y"
| | |-- @id <- friends: [X, Core], can access Core.X#add using "#add"
| | |-- #add
| |-- Y <- friends: [Core], can access Core.X using "X"
| | |-- @name <- friends: [Y, Core], can NOT access Core.X#add using "#add"!!!
| |-- Z <- friends: [NS1]
It so happens that when you're authoring docs in a context like the Core.X
document above, you would naturally want to link to other related documents by
their "short names", which to the corpus are private indices. For this reason,
Core.X
is considered a friend of Core.Y and may call it by its "short name",
Y
.
In order to support this style of linking, but still protect from invalid
links, we prohibit nodes from accessing their friends' entity indices. So,
in Core.X
, you may not link to @name
which resides in Y
- only using
Y@name
.
ADT Types
def("Corpus", {
fields: {
namespaces: array("Namespace")
}
});
def("Namespace", {
fields: {
id: String,
symbol: or(String, null), // defaults to "/"
corpusContext: or(String, null),
documents: or(array("Document"), null),
parentNode: "Corpus"
}
});
def("Node", {
fields: {
id: String,
href: or(String, null),
title: or(String, null),
summary: or(String, null),
filePath: or(String, null),
properties: or(array("Property"), null)
}
});
def("Document", {
base: "Node",
fields: {
symbol: or(String, null),
parentNode: or("Namespace", "Document"),
documents: or(array("Document"), null),
entities: or(array("DocumentEntity"), null),
}
});
def("DocumentEntity", { // terminal
base: "Node",
fields: {
parentNode: "Document"
}
});
def("Property", {
build: [ 'key', 'value' ],
fields: {
key: String,
value: or(String, Number, Boolean, RegExp, Array, Object, null)
}
});
A sample AST for the example above:
// JS/Core.X@id -> \[[X]] -> JS/Core.X
// JS/Core.X@id -> \[[@id]] -> JS/Core@id
// JS/Core.X@id -> \[[#add]] -> JS/Core#add
{ // DocumentEntity
type: "DocumentEntity",
id: "@id",
uid: "JS/Core.X@id",
indices: { "@id": 1, "X@id": 2, "Core.X@id": 3 },
parentNode: { // Document
type: "Document",
id: "X",
uid: "JS/Core.X",
indices: [ "X", "Core.X" ],
entities: [
...
],
parentNode: { // Document
type: "Document",
id: "Core",
uid: "JS/Core",
symbol: ".",
entities: [
...
],
documents: [
...
],
properties: [
{
key: "namespace",
value: true
}
],
parentNode: { // Namespace
type: "Namespace",
id: "JS",
uid: "JS",
symbol: "/",
corpusContext: "JavaScript Modules",
documents: [
...
],
parentNode: { // Corpus
type: "Corpus",
namespaces: [
...
]
}
}
}
}
};
TODO
- resolved link title should be tuned based on the contextNode's scope - private scopes should not use the FQN-index and so on