@fgv/ts-bcp47
v4.0.2
Published
BCP-47 Tag Utilities
Downloads
20
Maintainers
Readme
Summary
Typescript utilities for parsing, manipulating and comparing BCP-47 language tags.
Installation
with npm:
npm install @fgv/ts-bcp47
API Documentation
Extracted API documentation is here
Overview
Classes and functions to:
- parse and validate BCP-47 (RFC 5646) language tags
- normalize BCP-47 language tags into canonical or preferred form.
- compare BCP-47 language tags
TL; DR
For those who already understand BCP-47 language tags and just want to get started, here are a few examples:
import { Bcp47 } from '@fgv/ts-bcp47';
// parse a tag to extract primary language and region
const {primaryLanguage, region} = Bcp47.tag('en-us').orThrow().subtags;
// primaryLanguage is 'en', region is 'us'
// parse a tag to extract primary language and region in canonical form
const {primaryLanguage, region} = Bcp47.tag('en-us', { normalization: 'canonical' }).orThrow().subtags;
// primary language is 'en', region is 'US'
// normalize a tag to fully-preferred form
const preferred = Bcp47.tag('art-lojban', { normalization: 'preferred' }).orThrow().tag;
// preferred is "jbo"
// tags match regardless of case
const match = Bcp47.similarity('es-MX', 'es-mx').orThrow(); // 1.0 (exact)
// suppressed script matches explicit script
const match = Bcp47.similarity('es-MX', 'es-latn-mx').orThrow(); // 1.0 (exact)
// macro-region matches contained region well
const match = Bcp47.similarity('es-419', 'es-MX').orThrow(); // 0.7 (macroRegion)
const match = Bcp47.similarity('es-419', 'es-ES').orThrow(); // 0.3 (sibling)
// region matches neutral fairly well
const match = Bcp47.similarity('es', 'es-MX').orThrow(); // 0.6 (neutral)
// unlike tags do not match
const match = Bcp47.similarity('en', 'es').orThrow(); // 0.0 (none)
// different scripts do not match
const match = Bcp47.similarity('zh-Hans', 'zh-Hant').orThrow(); // 0.0 (none)
Note: This library uses the Result
pattern, so the return value from any method that might fail is a Result
object that must be tested for success or failure. These examples use either orThrow or orDefault to convert an error result to either an exception or undefined.
Anatomy of a BCP-47 language tag.
As specified in RFC 5646, a language tag consists of a series of subtags
(mostly optional), each of which describes some aspect of the language being referenced.
Subtags
The full set of subtags that make up a language tag are:
Grandfathered Tags
The RFC allows for a handful of grandfathered tags which do not meet the current specification. Those tags are recognized in their entirety and are not composed of subtags, so for grandfathered tags only, even primary language
is undefined.
Validation
Tag validation considers the tag in its current form and never changes the tag itself.
The specification defines two levels of conformance for language, and this library defines a third.
Well-Formed Tags
A well-formed
tag meets the basic syntactic requirements of the specification, but might not be valid in terms of content.
Valid Tags
A valid
tag meets both the syntactic and semantic requirements of the specification, meaning that either all subtags or full tag (in the case of grandfathered tags) are registered in the IANA language subtag registry, and neither extension nor variant tags are repeated.
Strictly Valid Tags
A strictly valid
tags is valid according to the specification and also meets the rules for variant and extlang prefixes defined by the specification and recorded in the language registry.
Examples
Some examples:
eng-US
is well-formed because it meets the language tag syntax but is not valid becauseeng
is not a registered language subtag.en-US
is both well-formed and valid, becauseen
is a registered language subtag.es-valencia-valencia
is well-formed but not valid, because thevalencia
extension subtag is repeated.es-valencia
is well-formed and valid, but it is not strictly-valid because language subtag registry defines aca
prefix for thevalencia
subtag.ca-valencia
is well-formed, valid, and strictly valid.
Normalization
Normalization transforms a tag to produce a new tag which is semantically identical, but preferred for some reason.
Not-normalized
A non-normalized must be well-formed
and might be valid
or strictly-valid
but it does not use the letter case conventions recommended in the spec.
Canonical Form
A tag in canonical form meets all of the letter case conventions recommended by the specification, in addition to being at least well-formed
.
Preferred Form
In addition to being strictly-valid
and canonical, tags
in preferred form do not have any deprecated, redundant or suppressed subtags.
Examples
zh-cmn-hans
is strictly valid, but not canonical or preferred.zh-cmn-Hans
is strictly valid and canonical, but not preferred, because the subtag registry listszh-cmn-Hans
as redundant, with the preferred valuecmn-Hans
.cmn-Hans
is strictly valid, canonical and preferred.en-latn-us
is strictly valid, but not canonical or preferred.en-Latn-US
is strictly valid and canonical, but not preferred, because the subtag registry listsLatn
as the suppressed script for theen
language.en-US
is strictly valid, canonical and preferred.
Tag Matching
The match
function matches language tags, using semantic similarity, unlike RFC 4647, which relies on purely syntactic rules. This semantic match yields much better results in many cases.
For any given language tag pair, the match
function returns a similarity score in the range 0.0
(no similarity) to 1.0
(exact match).
The degrees of similarity are (from most to least similar):
exact
(1.0
) - The two language tags are semantically identical.variant
(0.9
) - The tags vary only in extension or private subtags.region
(0.8
) - The tags match on language, script and region but vary in variant, extension or private-use subtags.macroRegion
(0.7
) - The tags match on language and script, and one of the region subtags is a macro-region (e.g.419
for Latin America) which encompasses the second region tag.neutralRegion
(0.6
) - The tags match on language and script, and only one of the tags contains a region subtag.affinity
(0.5
) - The tags match on language and script, and two region subtags have an orthographic affinity. Orthographic affinity is defined in this package in theoverrides.json
file.preferredRegion
(0.4
) - The tags match on language and script, and one of the tags is the preferred region subtag for the language. Preferred region is also defined in this package inoverrides.json
.sibling
(0.3
) - The tags match on language and script but both have region tags that are otherwise unrelated.undetermined
(0.2
) - One of the languages is the special languageund
.none
(0.0
) - The tags do not match at all.
See Also
RFC 5646 - Tags for Identifying Languages IANA Language Subtag Registry