codepoint-iterator
v1.1.1
Published
Fast uint8array to utf-8 codepoint iterator for streams and array buffers by @okikio & @jonathantneal
Downloads
126
Maintainers
Readme
codepoint-iterator
codepoint-iterator
is a utility library that provides functions for converting an iterable of UTF-8 filled Uint8Array's into Unicode code points. The library supports both synchronous and asynchronous iterables and offers different ways to produce code points, including as an async generator, as an array, or by invoking a callback for each code point.
Installation
Node
npm install codepoint-iterator
yarn add codepoint-iterator
or
pnpm install codepoint-iterator
import { asCodePointsIterator, asCodePointsArray, asCodePointsCallback } from "codepoint-iterator";
Deno
import { asCodePointsIterator, asCodePointsArray, asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";
Web
<script src="https://unpkg.com/codepoint-iterator" type="module"></script>
You can also use it via a CDN, e.g.
import { asCodePointsIterator } from "https://esm.run/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://esm.sh/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://unpkg.com/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://cdn.skypack.dev/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://deno.bundlejs.com/file?q=codepoint-iterator";
// or any number of other CDN's
API
asCodePointsIterator(iterable)
Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.
asCodePointsArray(iterable)
Converts an iterable of UTF-8 filled Uint8Array's into an array of Unicode code points.
asCodePointsCallback(iterable, cb)
Processes an iterable of UTF-8 filled Uint8Array's and invokes a callback for each code point.
Examples
Check out the examples/ folder on GitHub.
Using asCodePointsIterator
with an async iterable tokenizer
import { asCodePointsIterator } from "codepoint-iterator";
// or
// import { asCodePointsIterator } from "https://deno.land/x/codepoint_iterator/mod.ts";
async function* tokenizer(input) {
// Simulate an async iterable that yields chunks of UTF-8 bytes
for (const chunk of input) {
yield new TextEncoder().encode(chunk);
}
}
(async () => {
const input = ["Hello", " ", "World!"];
for await (const codePoint of asCodePointsIterator(tokenizer(input))) {
console.log(String.fromCodePoint(codePoint));
}
})();
Using asCodePointsArray
with ChatGPT or another AI workload
import { asCodePointsArray } from "codepoint-iterator";
// or
// import { asCodePointsArray } from "https://deno.land/x/codepoint_iterator/mod.ts";
// Simulate an AI workload that returns a response as an array of Uint8Array chunks
async function getAIResponse() {
return [new TextEncoder().encode("Hello, "), new TextEncoder().encode("I am an AI.")];
}
(async () => {
const responseChunks = await getAIResponse();
const codePoints = await asCodePointsArray(responseChunks);
const responseText = String.fromCodePoint(...codePoints);
console.log(responseText);
})();
Using asCodePointsCallback
for a CSS tokenizer
import { asCodePointsCallback } from "codepoint-iterator";
// or
// import { asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";
async function tokenizeCSS(css: string) {
const tokens: string[] = [];
let currentToken = "";
// Create an array containing the Uint8Array object
const cssChunks = [new TextEncoder().encode(css)];
await asCodePointsCallback(cssChunks, (codePoint: number) => {
const char = String.fromCodePoint(codePoint);
if (char === '{' || char === '}') {
if (currentToken) {
tokens.push(currentToken.trim());
currentToken = "";
}
tokens.push(char);
} else {
currentToken += char;
}
});
return tokens;
}
const css = `
body {
background-color: white;
color: black;
}
h1 {
font-size: 24px;
}
`;
const tokens: string[] = await tokenizeCSS(css);
console.log(tokens);
// Output: [ 'body', '{', 'background-color: white;', 'color: black;', '}', 'h1', '{', 'font-size: 24px;', '}' ]
Using asCodePointsCallback
for Text Manupalation
import { asCodePointsCallback } from "codepoint-iterator";
// or
// import { asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";
// Text Analysis
const frequencyMap = new Map<number, number>();
const updateFrequency = (codePoint: number) => {
const count = frequencyMap.get(codePoint) || 0;
frequencyMap.set(codePoint, count + 1);
};
const text1 = new TextEncoder().encode('Hello, World!');
await asCodePointsCallback([text1], updateFrequency);
console.log("Text Analysis", frequencyMap);
// Text Filtering
const filteredText: string[] = [];
const filterControlCharacters = (codePoint: number) => {
if (codePoint >= 32) {
filteredText.push(String.fromCodePoint(codePoint));
}
};
const text2 = new TextEncoder().encode(`Hello,
World!`);
await asCodePointsCallback([text2], filterControlCharacters);
console.log("Text Filtering", filteredText.join(''));
// Character Set Validation
const validateAscii = (codePoint: number) => {
if (codePoint > 127) {
throw new Error(`Non-ASCII character found: ${String.fromCodePoint(codePoint)}`);
}
};
const text3 = new TextEncoder().encode('Hello, 世界!');
try {
await asCodePointsCallback([text3], validateAscii);
console.error("Character Set Validation", "passed");
} catch (error) {
console.error("Character Set Validation", error.message);
}
// Text Transformation
const transformedText: string[] = [];
const toUpperCase = (codePoint: number) => {
transformedText.push(String.fromCodePoint(codePoint).toUpperCase());
};
const text4 = new TextEncoder().encode('Hello, World!');
await asCodePointsCallback([text4], toUpperCase);
console.log("Text Transformation", transformedText.join(''));
// Unicode Normalization
const normalizedText: string[] = [];
const accumulateCodePoints = (codePoint: number) => {
normalizedText.push(String.fromCodePoint(codePoint));
};
const text5 = new TextEncoder().encode('Café');
await asCodePointsCallback([text5], accumulateCodePoints);
console.log("Unicode Normalization", normalizedText.join('').normalize('NFD'));
// Text Encoding Conversion
const utf16Buffer: number[] = [];
const toUtf16 = (codePoint: number) => {
if (codePoint <= 0xFFFF) {
utf16Buffer.push(codePoint);
} else {
const highSurrogate = Math.floor((codePoint - 0x10000) / 0x400) + 0xD800;
const lowSurrogate = ((codePoint - 0x10000) % 0x400) + 0xDC00;
utf16Buffer.push(highSurrogate, lowSurrogate);
}
};
const text6 = new TextEncoder().encode('Hello, 世界!');
await asCodePointsCallback([text6], toUtf16);
const utf16Array = new Uint16Array(utf16Buffer);
console.log("Text Encoding Conversion", new TextDecoder('utf-16le').decode(utf16Array));
Usage with Async Iterables
The functions in codepoint-iterator
support both synchronous and asynchronous iterables. This means you can use them with data sources that produce chunks of bytes asynchronously, such as file streams, network streams, or other async generators.
Here's an example of using asCodePointsIterator
with an async iterable that reads chunks from a file:
Node
import { asCodePointsIterator } from "codepoint-iterator";
// or
// import { asCodePointsIterator } from "https://deno.land/x/codepoint_iterator/mod.ts";
import { createReadStream } from "node:fs";
const fileStream = createReadStream("example.txt");
(async () => {
for await (const codePoint of asCodePointsIterator(fileStream)) {
console.log(String.fromCodePoint(codePoint));
}
})();
In this example, we use the createReadStream
function from the fs
module to create a readable stream for a file. We then pass the stream to asCodePointsIterator
, which processes the chunks of bytes and yields the corresponding Unicode code points.
Deno
import { asCodePointsIterator, getIterableStream } from "https://deno.land/x/codepoint_iterator/mod.ts";
(async () => {
const file = await Deno.open("example.txt", { read: true })
for await (const codePoint of asCodePointsIterator(getIterableStream(file.readable))) {
console.log(String.fromCodePoint(codePoint));
}
})();
In this example, we are using the asCodePointsIterator
function from the codepoint-iterator
library to read the contents of a file named example.txt
and print each Unicode code point as a character. The getIterableStream
function is used to convert a Deno readable stream into an iterable of Uint8Array
chunks.
Showcase
A couple sites/projects that use codepoint-iterator
:
- Your site/project here...
Benchmarks
The asCodePointsIterator
, asCodePointsArray
, and asCodePointsCallback
functions been thorougly tested to make sure they are the most performant variants for iterators, arrays, and callbacks possible. You can check the latest benchmark results in the GitHub Actions page.
Machine: GitHub Action ubuntu-latest
As of Monday December 4, 2023
on Deno v1.38.4
here are the results:
Note: I recommend using
asCodePointsCallback
whenever possible as it's the fastest variant.
Conclusion
codepoint-iterator
is a versatile library that makes it easy to work with Unicode code points in JavaScript and TypeScript. Whether you're tokenizing text, processing AI responses, or working with file streams, codepoint-iterator
provides a simple and efficient way to handle UTF-8 encoded data. Give it a try and see how it can simplify your code!
Contributing
Thanks @jonathantneal for the assistance with developing the
codepoint-iterator
library.
This package is written Deno first, so you will have to install Deno.
Run tests
deno task test
Run benchmarks
deno task bench
Build package
deno task build
Note: This project uses Conventional Commits standard for commits, so, please format your commits using the rules it sets out.
Licence
See the LICENSE file for license rights and limitations (MIT).