ucd-full

v16.0.0

Published

5 months ago

Encoding the Unicode Character Database as json files

Downloads

1,790

0High
0Medium
0Low

ehoogerbeets

internationalization i18n globalization g11n locale Unicode

UCD-full

The full Unicode Character Database files encoded as json.

This project publishes the exact same set of data files as the Unicode Character Database files, but parsed and encoded in a more convenient format for javascript developers to consume.

To Install

In your package.json:

    devDependencies: {
        ...
        "ucd-full": "^16.0.0",
        ...
    }

The major and minor version of this package is the same as the version of UCD that it supports. The 3rd level version number may change, however, as bugs or inaccuracies are discovered and fixed in this package. (ie. version 13.0.2 still encodes UCD 13.0.0, but has had 2 bug fixes.)

The UCD files are very large, so it is not recommended to use this data directly. Also, they are not whitespace compressed to make it easier for development. Instead, put ucd-full in your devDependencies and write some code to extract at build time the subset of data that your application needs.

To Use

To use the files, require them directly:

var scripts = require("ucd-full/Scripts.json");
var caseFolding = require("ucd-full/CaseFolding.json");
var graphemeBreakProperty = require("ucd-full/auxiliary/GraphemeBreakProperty.json");

The set of files and their relative directory structure as the same as the UCD files.

Data Types

Codepoints

Codepoints are given by their hexadecimal value as they are in the UCD files. They are not processed into actual Unicode values. This means any json parser should be able to parse the files easily.

Ranges

In general, for any field in any file that involves a range of Unicode code points, it is encoded as an array:

[ "042D" ],             // a single codepoint
[ "0620", "063F" ],     // a range of codepoints between and including U+0620 and U+063F

The range is encoded as a tuple with a start and end, and are inclusive of the start and end characters.

Lists or Sequences

For any field that is a sequence or a list, the data is also encoded as an array of discrete codepoints. Example entry from NormalizationTest.json:

        {
            "sourceSequence": [
                "1E0C",
                "0307"
            ],
            "NFCSequence": [
                "1E0C",
                "0307"
            ],
            "NFDSequence": [
                "0044",
                "0323",
                "0307"
            ],
            "NFKCSequence": [
                "1E0C",
                "0307"
            ],
            "NFKDSequence": [
                "0044",
                "0323",
                "0307"
            ]
        },

One Column Files

For data files that only have one column, they are encoded in json as a simple array of values. Example snippet from the LineBreakTest.json file:

{
    "LineBreakTest": [
        "× 0023 × 0023 ÷",
        "× 0023 × 0020 ÷ 0023 ÷",
        "× 0023 × 0308 × 0023 ÷",
        "× 0023 × 0308 × 0020 ÷ 0023 ÷",
        "× 0023 ÷ 2014 ÷",
        ...
    }
}

Two Column Files

For data files that only have two columns, especially when it is a mapping between two values, they are encoded in json as a simple object, mapping one to the other. Example snippet from BidiMirroring.json:

{
    "BidiMirroring": {
        "2039": "203A",
        "2045": "2046",
        "2046": "2045",
        "2208": "220B",
        "2209": "220C",
        "2215": "29F5",
        "2220": "29A3",
        "2221": "299B",
        ...
    }
}

Multi-Column Files

For files with multiple fields, they are encoded as an array of objects with property names and values. Typically every entry in these type of files have the same schema, though some entries may miss one or more of the properties. Example from ArabicShaping.json:

{
    "ArabicShaping": [
        {
            "codepoint": "0600",
            "name": "ARABIC NUMBER SIGN",
            "type": "U",
            "joiningGroup": "No_Joining_Group"
        },
        {
            "codepoint": "0601",
            "name": "ARABIC SIGN SANAH",
            "type": "U",
            "joiningGroup": "No_Joining_Group"
        },
        ...
    }
}

For the Unihan_* files, the data in the .txt files are given as denormalized. That is, all of the properties for a particular codepoint that are given on separate lines. In the json encoding, we put them all together in a single object, making it much easier to process. Typically the schema for entries in such files is variable. Example from Unihan_Variants.json:

{
    "Unihan_Readings": [
        {
            "codepoint": "U+3400",
            "kCantonese": "jau1",
            "kDefinition": "(same as U+4E18 丘) hillock or mound",
            "kMandarin": "qiū"
        },
        {
            "codepoint": "U+3401",
            "kCantonese": "tim2",
            "kDefinition": "to lick; to taste, a mat, bamboo bark",
            "kHanyuPinyin": "10019.020:tiàn",
            "kMandarin": "tiàn"
        },
        {
            "codepoint": "U+3402",
            "kCantonese": "hei2",
            "kDefinition": "(J) non-standard form of U+559C 喜, to like, love, enjoy; a joyful thing"
        },
        {
            "codepoint": "U+3404",
            "kMandarin": "kuà"
        },
        ...
    }
}

Schema

The schema for each file is typically given in the comments at the header of the file, though some files have special html descriptions because they are a bit more complicated. (Like NamesList.txt for example.)

These are all the files and their schema. For more information as to what the fields and their values mean, click on the link to read the original Unicode source txt files:

| File | Fields | | -------- | ---------- | | json/auxiliary/GraphemeBreakProperty.json | range, property | | json/auxiliary/GraphemeBreakTest.json | one column | | json/auxiliary/LineBreakTest.json | one column | | json/auxiliary/SentenceBreakProperty.json | range, property | | json/auxiliary/SentenceBreakTest.json | one column | | json/auxiliary/WordBreakProperty.json | range, property | | json/auxiliary/WordBreakTest.json | one column | | json/BidiBrackets.json | codepoint, bracket, type | | json/BidiCharacterTest.json | codePointSequence, direction, embeddingLEvel, resolvedLevelList, indexList | | json/BidiMirroring.json | two column codepoint map | | json/BidiTest.json | input, bitset, levels | | json/Blocks.json | range, block | | json/CaseFolding.json | codepoint, status, mapping | | json/CJKRadicals.json | radical, character, unified | | json/CompositionExclusions.json | one column | | json/DerivedAge.json | range, unicodeVersion | | json/DerivedCoreProperties.json | range, property | | json/DerivedNormalizationProps.json | range, property, normalized | | json/EastAsianWidth.json | range, width | | json/emoji | range, property | | json/emoji/emoji-data.json | range, property | | json/emoji/emoji-variation-sequences.json | variationSequence, style | | json/EmojiSources.json | codepointSequence, docomo, kddi, softbank | | json/EquivalentUnifiedIdeograph.json | range, unified | | json/extracted/DerivedBidiClass.json | range, class | | json/extracted/DerivedBinaryProperties.json | range, property | | json/extracted/DerivedCombiningClass.json | range, combiningClass | | json/extracted/DerivedDecompositionType.json | range, type | | json/extracted/DerivedEastAsianWidth.json | range, width | | json/extracted/DerivedGeneralCategory.json | range, category | | json/extracted/DerivedJoiningGroup.json | range, group | | json/extracted/DerivedJoiningType.json | range, type | | json/extracted/DerivedLineBreak.json | range, property | | json/extracted/DerivedName.json | range, name | | json/extracted/DerivedNumericType.json | range, type | | json/extracted/DerivedNumericValues.json | range, decimalValue, whole | | json/HangulSyllableType.json | range, hangulType | | json/Index.json | two column name map | | json/IndicPositionalCategory.json | range, positionalCategory | | json/IndicSyllabicCategory.json | range, syllabicCategory | | json/Jamo.json | two column codepoint map | | json/LineBreak.json | range, lineBreakProperty | | json/NameAliases.json | codepoint, alias, type | | json/NamedSequences.json | name, codepointSequence | | json/NamedSequencesProv.json | name, codepointSequence | | json/NormalizationCorrections.json | codepoint, original, corrected, unicodeVersion | | json/NormalizationTest.json | sourceSequence, NFCSequence, NFDSequence, NFKCSequence, NFKDSequence | | json/NushuSources.json | codepoint, kSrc_NushuDuben, kReading | | json/PropertyAliases.json | shortName, longName | | json/PropertyValueAliases.json | property, value1short, value1long, value2short, value2long | | json/PropList.json | range, property | | json/ScriptExtensions.json | range, extension | | json/Scripts.json | range, script | | json/SpecialCasing.json | codepoint, lowerSequence, titleSequence, upperSequence | | json/StandardizedVariants.json | variationSequence, description | | json/TangutSources.json | codepoint, kTGT_MergedSrc, kRSTUnicode | | json/UnicodeData.json | codepoint, name, category, canonicalCombiningClass, bidirectionalCategory, mirrored, unicode1.0Name | | json/Unihan_DictionaryIndices.json | codepoint, variety of other fields | | json/Unihan_DictionaryLikeData.json | codepoint, variety of other fields | | json/Unihan_IRGSources.json | codepoint, variety of other fields | | json/Unihan_NumericValues.json | codepoint, variety of other fields | | json/Unihan_OtherMappings.json | codepoint, variety of other fields | | json/Unihan_RadicalStrokeCounts.json | codepoint, variety of other fields | | json/Unihan_Readings.json | codepoint, variety of other fields | | json/Unihan_Variants.json | codepoint, variety of other fields | | json/USourceData.json | sourceId, status, codepoint, radicalStrokeCount, dictionaryPosition, source, comments, totalStrokes, firstResidualStroke | | json/VerticalOrientation.json | range, verticalOrientation |

NamesList

The file NamesList.json is special because of its complicated schema. Multiple lines in the NamesList file describe a single codepoint. There may be multiple values for the aliases, for example, so aliases and most other properties are encoded as arrays of strings that contain all of these multiple values. For compatibility mappings and decompositions, each mapping or decomposition is further parsed into an array of codepoints where possible.

Each entry in the names list is an object with the following properties:

{
    codepoint: "string",
    name: "string",
    aliases: "array of string",
    comments: "array of string",
    crossReferences: "array of string",
    compatibilityMappings: "array of array of string",
    decompositions: "array of array of string",
    variations: "array of string"
}

License

This derivative work, ucd-full, is covered under the Apache2 license.

The Unicode data in this package is covered by the Unicode Copyright and Terms of Use and by the Unicode, Inc. License Agreement - Data Files and Software. The Unicode, Inc. License Agreement - Data Files and Software is as follows:

UNICODE, INC. LICENSE AGREEMENT - DATA FILES AND SOFTWARE

See Terms of Use for definitions of Unicode Inc.'s Data Files and Software.

NOTICE TO USER: Carefully read the following legal agreement. BY DOWNLOADING, INSTALLING, COPYING OR OTHERWISE USING UNICODE INC.'S DATA FILES ("DATA FILES"), AND/OR SOFTWARE ("SOFTWARE"), YOU UNEQUIVOCALLY ACCEPT, AND AGREE TO BE BOUND BY, ALL OF THE TERMS AND CONDITIONS OF THIS AGREEMENT. IF YOU DO NOT AGREE, DO NOT DOWNLOAD, INSTALL, COPY, DISTRIBUTE OR USE THE DATA FILES OR SOFTWARE.

Permission is hereby granted, free of charge, to any person obtaining a copy of the Unicode data files and any associated documentation (the "Data Files") or Unicode software and any associated documentation (the "Software") to deal in the Data Files or Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, and/or sell copies of the Data Files or Software, and to permit persons to whom the Data Files or Software are furnished to do so, provided that either

(a) this copyright and permission notice appear with all copies of the Data Files or Software, or

(b) this copyright and permission notice appear in associated Documentation.

THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THE DATA FILES OR SOFTWARE.

Except as contained in this notice, the name of a copyright holder shall not be used in advertising or otherwise to promote the sale, use or other dealings in these Data Files or Software without prior written authorization of the copyright holder.