curtiz-japanese-nlp
v2.0.1
Published
Annotate Curtiz2 Markdown files with Japanese natural language parsing
Downloads
7
Maintainers
Readme
curtiz-japanese-nlp — WIP ☣️☢️🧪🧫🧬⚗️☢️☣️
N.B. All references to “Curtiz” are to version 2 of the Curtiz format (using
@
symbols), and not version 1 (using◊
lozenges).
Curtiz version 2 soft-specification
The Curtiz Markdown format for defining Japanese flashcards uses Markdown headers, e.g., the following header:
@ 私 @ わたし
which is ### @ 私 @ わたし
in the original Markdown, as flashcards. That is, a flashcard-header has #
symbols, whitespace, @
symbol, whitespace, and then arbitrary text, separated by one or more @
separators. (@
was chosen because it is easy to type on mobile and real keyboards, in Japanese and English.) The first text is treated as the prompt: that’s what the flashcard app will show. Text after the prompt are taken as acceptable answers.
So the following header accepts three answers for the same prompt:
@ 私 @ わたし @ わたくし @ あたし
Next, any bullets immediately following the at-header that are themselves followed by @
are treated as Curtiz-specific metadata.
Example:
@ 僕の花の色 @ ぼくのはなのいろ
- @fill 僕[の]花
- @fill 花[の]色
- @ 僕 @ ぼく @pos pronoun
- @ 花 @ はな @pos noun-common-general
- @ 色 @ いろ @pos noun-common-general
This example demonstrates both sub-quizzes that are currently supported:
@fill
allows for a fill-in-the-blank (perhaps where the prompt is shown, minus the text to be filled in), and@
indicates a flashcard just like the@
-headers:prompt @ response
. These are amenable to plain flashcards on their own as well as fill-in-the-blank in the sentence. If the sub-prompt (in this bullet) cannot be found or uniquely determined in the header's prompt, then an@omit
adverb can be optionally used to indicate the portion of the header prompt to be hidden. The optional@pos
adverb contains the part-of-speech (as determined by MeCab), and facilitates disambiguiation of flashcards.
Both these optional adverbs are demonstrated below.
@ このおはなしを話す @ このおはなしをはなす
- @fill を
- @ 話 @ はなし @pos noun-common-verbal_suru @omit はなし
- @ 話す @ はなす @pos verb-general
This module's features
This module uses MeCab with the UniDic dictionary, and J.DepP bunsetsu chunker to add readings, @fill
-in-the-blank quizzes, and @
flashcards into a Curtiz Markdown file.
Make sure you have these three applications installed before attempting to use this!
It will add a reading to the header if none exists.
It will add sub-quizzes (@fill
and @
) if there is a special bullet under the header: - @pleaseParse
.
Usage
This package provides:
- a command-line utility that will consume an input file or standard input, and spit out the Markdown file annotated with readings and sub-quizzes; and
- a JavaScript library to do this programmatically.
Command-line utility
The command-line utility can be invoked on a file or can consume standard input. Make sure you have Node.js installed, then in your terminal (Terminal.app in macOS, Command Prompt in Windows, xterm in Linux, etc.), run either of the following:
$ npx curtiz-japanese-nlp README.md
and replace README.md
with the path to your Markdown file, or
$ cat README.md | npx curtiz-japanese-nlp
Library API
Install this package into your JavaScript/TypeScript package via
$ npm install curtiz-japanese-nlp
Then in your JavaScript code, you may:
const curtiz = require('curtiz-japanese-nlp');
In TypeScript or with ES5 modules, you may:
import * as curtiz from 'curtiz-japanese-nlp';
The following functions will then be available under the curtiz
namespace.
async function parseHeaderBlock(block: string[]): Promise<string[]>
A block
is an array of strings, one line per element, with the first line assumed to contain a Markdown header block (something starting with one or more #
hash symbols).
parseHeaderBlock
returns a promise of an array of strings, which will contain annotated Markdown.
This is the core function provided by this library.
The remaining functions below are helper utility functions.
function splitAtHeaders(text: string): string[][]
This is intended to split the contents of a file (a single string text
) into an array of blocks (each block being an array of strings itself, each string being a line of Markdown).
async function parseAllHeaderBlocks(blocks: string[][], concurrentLimit: number = 8)
This is intended to annotate an array of blocks (blocks
), each block being an array of strings and each string being a line of Markdown.
The concurrentLimit
argument allows you to limit the number of concurrent system calls to mecab
and jdepp
that are being made.
A promise for an array of blocks is returned.
You can use both these helper functions along with the primary function as follows, assuming you are inside an async
function:
let annotated = await curtiz.parseAllHeaderBlocks(curtiz.splitAtHeaders(fs.readFileSync('README.md', 'utf8')));
console.log(annotated.map(s => s.join('\n')).join('\n'));
The first line slurps the contents of README.md
and splits it into blocks at Markdown header boundaries, then annotates them all.
The second line logs the entire annotated Markdown.