multibyte
v1.0.4
Published
Unicode-aware string functions
Downloads
138
Maintainers
Readme
multibyte
multibyte provides common string functions that respect multibyte Unicode characters.
npm install multibyte
The problem and the solution
On one hand, JavaScript strings use UTF-16 encoding, and on the other hand, JavaScript strings behave like an Array of code points. Unicode characters that take more than 2 bytes (like newer emoji) get split into 2 code points in many situations.
If you display Unicode text from a UTF-8 source, you need these multibyte
functions that take advantage of the fact that Array.from(string)
is Unicode safe.
import {
charAt,
codePointAt,
length,
slice,
split,
truncateBytes,
} from 'multibyte';
// JavaScript String.prototype.charAt() can return a UTF-16 surrogate
'a🚀c'.charAt(1); // ❌ "\ud83d" (half a rocket)
charAt('a🚀c', 1); // ✅ "🚀"
// JavaScript String.prototype.codePointAt() can return a UTF-16 surrogate
'🚀abc'.codePointAt(1); // ❌ 56960 (surrogate pair of rocket emoji)
codePointAt('🚀abc', 1); // ✅ 97 (the letter a)
// JavaScript returns length in UTF-16, not Unicode characters
'a🚀c'.length; // ❌ 4
length('a🚀c'); // ✅ 3
// JavaScript slices along UTF-16 boundaries, not Unicode characters
'a🚀cdef'.slice(2, 3); // ❌ "\ude80" (half a rocket)
slice('a🚀cdef', 2, 3); // ✅ "c"
// JavaScript splits along UTF-16 boundaries, not Unicode characters
'a🚀c'.split(''); // ❌ ["a", "\ud83d", "\ude80", "c"]
split('a🚀c', ''); // ✅ ["a", "🚀", "c"] ✅
// JavaScript slices strings along UTF-16 boundaries, not Unicode characters
'a🚀cdef'.slice(0, 2); // ❌ "a\ud83d" (half a rocket)
truncateBytes('a🚀cdef', 2); // ✅ "a" (including the rocket would be 3 total bytes)
BOM (Byte order mark) - U+FEFF
Under the hood, all these functions strip a leading BOM if present.