@fgv/ts-bcp47
v5.0.2
Published
BCP-47 Tag Utilities
Downloads
818
Maintainers
Readme
Summary
Typescript utilities for parsing, manipulating and comparing BCP-47 language tags.
Installation
with npm:
npm install @fgv/ts-bcp47API Documentation
Extracted API documentation is here
Overview
Classes and functions to:
- parse and validate BCP-47 (RFC 5646) language tags
- normalize BCP-47 language tags into canonical or preferred form.
- compare BCP-47 language tags
TL; DR
For those who already understand BCP-47 language tags and just want to get started, here are a few examples:
import { Bcp47 } from '@fgv/ts-bcp47';
// parse a tag to extract primary language and region
const {primaryLanguage, region} = Bcp47.tag('en-us').orThrow().subtags;
// primaryLanguage is 'en', region is 'us'
// parse a tag to extract primary language and region in canonical form
const {primaryLanguage, region} = Bcp47.tag('en-us', { normalization: 'canonical' }).orThrow().subtags;
// primary language is 'en', region is 'US'
// normalize a tag to fully-preferred form
const preferred = Bcp47.tag('art-lojban', { normalization: 'preferred' }).orThrow().tag;
// preferred is "jbo"
// tags match regardless of case
const match = Bcp47.similarity('es-MX', 'es-mx').orThrow(); // 1.0 (exact)
// suppressed script matches explicit script
const match = Bcp47.similarity('es-MX', 'es-latn-mx').orThrow(); // 1.0 (exact)
// macro-region matches contained region well
const match = Bcp47.similarity('es-419', 'es-MX').orThrow(); // 0.7 (macroRegion)
const match = Bcp47.similarity('es-419', 'es-ES').orThrow(); // 0.3 (sibling)
// region matches neutral fairly well
const match = Bcp47.similarity('es', 'es-MX').orThrow(); // 0.6 (neutral)
// unlike tags do not match
const match = Bcp47.similarity('en', 'es').orThrow(); // 0.0 (none)
// different scripts do not match
const match = Bcp47.similarity('zh-Hans', 'zh-Hant').orThrow(); // 0.0 (none)Note: This library uses the Result pattern, so the return value from any method that might fail is a Result object that must be tested for success or failure. These examples use either orThrow or orDefault to convert an error result to either an exception or undefined.
Anatomy of a BCP-47 language tag.
As specified in RFC 5646, a language tag consists of a series of subtags (mostly optional), each of which describes some aspect of the language being referenced.
Subtags
The full set of subtags that make up a language tag are:
Grandfathered Tags
The RFC allows for a handful of grandfathered tags which do not meet the current specification. Those tags are recognized in their entirety and are not composed of subtags, so for grandfathered tags only, even primary language is undefined.
Validation
Tag validation considers the tag in its current form and never changes the tag itself.
The specification defines two levels of conformance for language, and this library defines a third.
Well-Formed Tags
A well-formed tag meets the basic syntactic requirements of the specification, but might not be valid in terms of content.
Valid Tags
A valid tag meets both the syntactic and semantic requirements of the specification, meaning that either all subtags or full tag (in the case of grandfathered tags) are registered in the IANA language subtag registry, and neither extension nor variant tags are repeated.
Strictly Valid Tags
A strictly valid tags is valid according to the specification and also meets the rules for variant and extlang prefixes defined by the specification and recorded in the language registry.
Examples
Some examples:
eng-USis well-formed because it meets the language tag syntax but is not valid becauseengis not a registered language subtag.en-USis both well-formed and valid, becauseenis a registered language subtag.es-valencia-valenciais well-formed but not valid, because thevalenciaextension subtag is repeated.es-valenciais well-formed and valid, but it is not strictly-valid because language subtag registry defines acaprefix for thevalenciasubtag.ca-valenciais well-formed, valid, and strictly valid.
Normalization
Normalization transforms a tag to produce a new tag which is semantically identical, but preferred for some reason.
Not-normalized
A non-normalized must be well-formed and might be valid or strictly-valid but it does not use the letter case conventions recommended in the spec.
Canonical Form
A tag in canonical form meets all of the letter case conventions recommended by the specification, in addition to being at least well-formed.
Preferred Form
In addition to being strictly-valid and canonical, tags
in preferred form do not have any deprecated, redundant or suppressed subtags.
Examples
zh-cmn-hansis strictly valid, but not canonical or preferred.zh-cmn-Hansis strictly valid and canonical, but not preferred, because the subtag registry listszh-cmn-Hansas redundant, with the preferred valuecmn-Hans.cmn-Hansis strictly valid, canonical and preferred.en-latn-usis strictly valid, but not canonical or preferred.en-Latn-USis strictly valid and canonical, but not preferred, because the subtag registry listsLatnas the suppressed script for theenlanguage.en-USis strictly valid, canonical and preferred.
Tag Matching
The match function matches language tags, using semantic similarity, unlike RFC 4647, which relies on purely syntactic rules. This semantic match yields much better results in many cases.
For any given language tag pair, the match function returns a similarity score in the range 0.0 (no similarity) to 1.0 (exact match).
The degrees of similarity are (from most to least similar):
exact(1.0) - The two language tags are semantically identical.variant(0.9) - The tags vary only in extension or private subtags.region(0.8) - The tags match on language, script and region but vary in variant, extension or private-use subtags.macroRegion(0.7) - The tags match on language and script, and one of the region subtags is a macro-region (e.g.419for Latin America) which encompasses the second region tag.neutralRegion(0.6) - The tags match on language and script, and only one of the tags contains a region subtag.affinity(0.5) - The tags match on language and script, and two region subtags have an orthographic affinity. Orthographic affinity is defined in this package in theoverrides.jsonfile.preferredRegion(0.4) - The tags match on language and script, and one of the tags is the preferred region subtag for the language. Preferred region is also defined in this package inoverrides.json.sibling(0.3) - The tags match on language and script but both have region tags that are otherwise unrelated.undetermined(0.2) - One of the languages is the special languageund.none(0.0) - The tags do not match at all.
See Also
RFC 5646 - Tags for Identifying Languages IANA Language Subtag Registry
