usfm-grammar
v3.0.0-alpha.10
Published
Uses the tree-sitter-usfm3 parser to convert USFM files to other formats such as USJ, USX, and CSV, and converts them back to USFM
Downloads
353
Readme
USFM Grammar
Description
USFM Grammar is a JavaScript library for parsing and converting USFM (Unified Standard Format Markers) to/from USJ (Unified Standard JSON) format. This library provides functionalities to parse USFM strings into a syntax tree and convert them into a JSON-like structure (USJ), and vice versa.
Installation
You can install USFM Grammar via npm:
npm install usfm-grammar
Usage
Importing, parsing USFM, checking errors
const {USFMParser} = require('usfm-grammar');
const USFM = '\\id GEN\n\\c 1\n\\p\n\\v 1 In the begining..\\v 2 some more text'
const usfmParser = new USFMParser(USFM);
console.log(usfmParser.errors)
USJ
Here's how you can use USFM Grammar in your JavaScript projects to work with the JSON format, USJ:
const {USFMParser} = require('usfm-grammar');
const USFM = '\\id GEN\n\\c 1\n\\p\n\\v 1 In the begining..\\v 2 some more text'
const usfmParser = new USFMParser(USFM);
const USJ = usfmParser.toUSJ() // USFM to USJ
console.log(JSON.stringify(USJ, null, 2));
const usfmParser2 = new USFMParser(usfmString=null, fromUsj=USJ) // USJ to USFM
const usfmGen = usfmParser2.usfm;
console.log(usfmGen);
Working with USJ, also gives options like filtering selected markers, to edit the original USFM content. To understand more about how exclude_markers
, include_markers
, combine_texts
and Filter
works refer the section on filtering on USJ
USX
To work with the XML format, USX:
const {USFMParser} = require('usfm-grammar');
const { DOMImplementation, XMLSerializer } = require('xmldom');
const USFM = '\\id GEN\n\\c 1\n\\p\n\\v 1 In the begining..\\v 2 some more text'
const usfmParser = new USFMParser(USFM);
const usxElem = usfmParser.toUSX() // USFM to USX
const usxSerializer = new XMLSerializer();
const usx = usxSerializer.serializeToString(usxElem);
console.log(usx);
const usfmParser2 = new USFMParser(usfmString=null, fromUsj=null, fromUsx=usxElem) // USX to USFM
const usfmGen = usfmParser2.usfm;
console.log(usfmGen);
Autofix and Validation
Experimental Validation and Autofix feature for USFM:
const {Validator} = require("usfm-grammar");
const wrongUSFM="\\id GEN\n\\c 1\n\\v 1 test verse"
const checker = new Validator();
const resp = checker.isValidUSFM(wrongUSFM); // true or false
console.log(checker.message) // List of errors if present
const editedUSFM = checker.autoFixUSFM(wrongUSFM);
console.log(checker.message); // Report on autofix attempt
Validation of USJ format:
const {Validator} = require("usfm-grammar");
const simpleUSJ = {
type: 'USJ',
version: '0.3.0',
content: [
{ type: 'book', marker: 'id', code: 'GEN', content: [] },
{ type: 'chapter', marker: 'c', number: '1', sid: 'GEN 1' },
{ type: 'para', marker: 'p', content: [
{type: 'verse', marker: 'v', number: 1 },
"In the begining..",
{type: 'verse', marker: 'v', number: 2 }
] }
]
}
const checker = new Validator();
console.log(checker.isValidUSJ(simpleUSJ));
console.log(checker.message);
From ESM Project
When using in an ESModule, if import {USFMParser} from 'usfm-grammar
doesnt work for you, you could try:
import pkg from 'usfm-grammar';
const {USFMParser} = pkg;
...
Filtering on USJ
The filtering on USJ, the JSON output, is a feature incorporated to allow data extraction, markup cleaning etc. The arguments exclude_markers
and include_markers
in the methods USFMParser.toUSJ()
makes this possible. Also the USFMParser.toList()
, can accept these inputs and perform similar operations. There is CLI versions also for these arguments to replicate the filtering feature there.
excludeMarkers
The first input parameter to
toUSJ()
andtoList
ofUSFMParser
class. Defaults tonull
. When proivded, all markers except those listed will be included in the output.includeMarkers
The second input parameter to
toUSJ()
andtoList
ofUSFMParser
class. Defaults tonull
. When proivded, only those markers listed will be included in the output.includeMarkers
is applied before applyingexcludeMarkers
.combineTexts
Fourth input parameter to
toUSJ()
andtoList
ofUSFMParser
class. Defaults totrue
. After filtering out makers like paragraphs and characters, we are left with texts from within them, if 'text-in-excluded-parent' is also not excluded. These text snippets may come as separate components in the contents list. When this option isTrue
, the consequetive text snippets will be concatinated together. The text concatination is done in a puctuation and space aware manner. If users need more control over the space handling or for any other reason, would prefer the texts snippets as different components in the output, this can be set toFalse
.usfm_grammar.Filter
This Class provides a set of enums that would be useful in providing in the
excludeMarkers
andincludeMarkers
inputs rather than users listing out individual markers. The class has following optionsBOOK_HEADERS : identification and introduction markers TITLES : section headings and associated markers COMMENTS : comment markers like \rem PARAGRAPHS : paragraph markers like \p, poetry markers, list table markers CHARACTERS : all character level markups like \em, \w, \wj etc and their nested versions with + NOTES : foot note, cross-reference and their content markers STUDY_BIBLE : \esb and \cat BCV : \id, \c and \v TEXT : 'text-in-excluded-parent'
To inspect which are the markers in each of these options, it could be just printed out,
print(Filter.TITLES)
. These could be used individually or concatinated to get the desired filtering of markers and data:output = usfmParser.toUSJ(null, include_markers=Filter.BCV) output = usfmParser.toUSJ(null, include_markers=Filter.BCV+Filter.TEXT) output = usfmParser.toUSJ(exclude_markers=Filter.PARAGRAPHS+Filter.CHARACTERS)
Inner contents of excluded markers
For markers like
\p
\q
etc, by excluding them, we only remove them from the heirachy and retain the inner contents like\v
, text etc that would be coming inside it. But for certain other markers like\f
,\x
,\esb
etc, if they are excluded their inner contents are also excluded. Following is the set of all markers, who inner contents are discarded if they are mentioned inexcludeMarkers
or not included inincludeMarkers
.BOOK_HEADERS, TITLES, COMMENTS, NOTES, STUDY_BIBLE
:warning: Generally, it is recommended to NOT use both
exclude_markers
andincludeMarkers
together as it could lead to unexpected behavours and data loss. For instance ifinclude_makers
has\fk
andexcludeMarkers
has\f
, the output will not contain\fk
as all inner contents of\f
will be discarded.
Contributing
Contributions are welcome! If you find any issues or have suggestions for improvements, feel free to open an issue or create a pull request on GitHub.
License
This project is licensed under the MIT License - see the LICENSE file for details.