@biblioteksentralen/xml-utils
v0.0.1
Published
XML parsing utils
Downloads
4
Keywords
Readme
@biblioteksentralen/xml-utils
XML parsing utils based on @xmldom/xmldom and xpath.
Usage example:
import fs from "node:fs";
import { parseXml } from "@biblioteksentralen/xml-utils";
const data = fs.readFileSync("test-fixtures/marcxchange-v1.xml", "utf-8");
const xml = parseXml(data, {
namespaces: {
marc: "info:lc/xmlns/marcxchange-v1",
},
});
const recordIds = xml
.elements("//marc:record") // returns XmlElement[]
.map((record) => record.text("./marc:controlfield[@tag='001']"));
console.log(recordIds);
A helper method is also included to remove namespaces. This uses quite simple regular expressions, so use at your own risk, it should be safe for most documents, but is likely to fail in edge cases.
import { parseXml, stripNamespaces } from "@biblioteksentralen/xml-utils";
const xml = parseXml(stripNamespaces(data));
const recordIds = xml
.elements("//record")
.map((record) => record.text("./controlfield[@tag='001']"));
Why not use fast-xml-parser (or something similar)?
fast-xml-parser
outputs nice and friendly JSON for simple XML documents, but that is because it ignores attributes and namespaces and doesn't preserve ordering by default. So it excels at converting XML documents that should probably have been JSON in the first place. It can be configured to not ignore attributes and namespaces and to preserve ordering, but then the output is suddenly quite verbose since all elements then get an extra level – and no longer something that can easily be used to infer a nice-looking JSON schema.- Any XML element can be repeated and you cannot see from the structure alone which ones are
repeated and not.
fast-xml-parser
solves this by guessing, so the author field for a book with one author becomes an object, while the same field for a book with multiple authors becomes an array field. It can be configured to always parse specific fields as arrays, but it's hard to know if you have an exhaustive list without knowing the source really well. - (We don't need the "fast" part since I/O will usually be the bottleneck and we're not doing time-sensitive stuff)