npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

codepoint-iterator

v1.1.1

Published

Fast uint8array to utf-8 codepoint iterator for streams and array buffers by @okikio & @jonathantneal

Downloads

126

Readme

codepoint-iterator

Open Bundle

NPM | GitHub | Docs | Licence

codepoint-iterator is a utility library that provides functions for converting an iterable of UTF-8 filled Uint8Array's into Unicode code points. The library supports both synchronous and asynchronous iterables and offers different ways to produce code points, including as an async generator, as an array, or by invoking a callback for each code point.

Bundle Size

Installation

Node

npm install codepoint-iterator
yarn add codepoint-iterator

or

pnpm install codepoint-iterator
import { asCodePointsIterator, asCodePointsArray, asCodePointsCallback } from "codepoint-iterator";

Deno

import { asCodePointsIterator, asCodePointsArray, asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";

Web

<script src="https://unpkg.com/codepoint-iterator" type="module"></script>

You can also use it via a CDN, e.g.

import { asCodePointsIterator } from "https://esm.run/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://esm.sh/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://unpkg.com/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://cdn.skypack.dev/codepoint-iterator";
// or
import { asCodePointsIterator } from "https://deno.bundlejs.com/file?q=codepoint-iterator";
// or any number of other CDN's

API

asCodePointsIterator(iterable)

Converts an iterable of UTF-8 filled Uint8Array's into an async generator of Unicode code points.

asCodePointsArray(iterable)

Converts an iterable of UTF-8 filled Uint8Array's into an array of Unicode code points.

asCodePointsCallback(iterable, cb)

Processes an iterable of UTF-8 filled Uint8Array's and invokes a callback for each code point.

Examples

Check out the examples/ folder on GitHub.

Using asCodePointsIterator with an async iterable tokenizer

import { asCodePointsIterator } from "codepoint-iterator";
// or 
// import { asCodePointsIterator } from "https://deno.land/x/codepoint_iterator/mod.ts";

async function* tokenizer(input) {
  // Simulate an async iterable that yields chunks of UTF-8 bytes
  for (const chunk of input) {
    yield new TextEncoder().encode(chunk);
  }
}

(async () => {
  const input = ["Hello", " ", "World!"];
  for await (const codePoint of asCodePointsIterator(tokenizer(input))) {
    console.log(String.fromCodePoint(codePoint));
  }
})();

Using asCodePointsArray with ChatGPT or another AI workload

import { asCodePointsArray } from "codepoint-iterator";
// or 
// import { asCodePointsArray } from "https://deno.land/x/codepoint_iterator/mod.ts";

// Simulate an AI workload that returns a response as an array of Uint8Array chunks
async function getAIResponse() {
  return [new TextEncoder().encode("Hello, "), new TextEncoder().encode("I am an AI.")];
}

(async () => {
  const responseChunks = await getAIResponse();
  const codePoints = await asCodePointsArray(responseChunks);
  const responseText = String.fromCodePoint(...codePoints);
  console.log(responseText);
})();

Using asCodePointsCallback for a CSS tokenizer

import { asCodePointsCallback } from "codepoint-iterator";
// or 
// import { asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";

async function tokenizeCSS(css: string) {
  const tokens: string[] = [];
  let currentToken = "";

  // Create an array containing the Uint8Array object
  const cssChunks = [new TextEncoder().encode(css)];

  await asCodePointsCallback(cssChunks, (codePoint: number) => {
    const char = String.fromCodePoint(codePoint);
    if (char === '{' || char === '}') {
      if (currentToken) {
        tokens.push(currentToken.trim());
        currentToken = "";
      }
      tokens.push(char);
    } else {
      currentToken += char;
    }
  });

  return tokens;
}

const css = `
  body {
    background-color: white;
    color: black;
  }

  h1 {
    font-size: 24px;
  }
`;

const tokens: string[] = await tokenizeCSS(css);
console.log(tokens);
// Output: [ 'body', '{', 'background-color: white;', 'color: black;', '}', 'h1', '{', 'font-size: 24px;', '}' ]

Using asCodePointsCallback for Text Manupalation

import { asCodePointsCallback } from "codepoint-iterator";
// or 
// import { asCodePointsCallback } from "https://deno.land/x/codepoint_iterator/mod.ts";

// Text Analysis
const frequencyMap = new Map<number, number>();
const updateFrequency = (codePoint: number) => {
  const count = frequencyMap.get(codePoint) || 0;
  frequencyMap.set(codePoint, count + 1);
};
const text1 = new TextEncoder().encode('Hello, World!');
await asCodePointsCallback([text1], updateFrequency);
console.log("Text Analysis", frequencyMap);



// Text Filtering
const filteredText: string[] = [];
const filterControlCharacters = (codePoint: number) => {
  if (codePoint >= 32) {
    filteredText.push(String.fromCodePoint(codePoint));
  }
};
const text2 = new TextEncoder().encode(`Hello,
World!`);
await asCodePointsCallback([text2], filterControlCharacters);
console.log("Text Filtering", filteredText.join(''));



// Character Set Validation
const validateAscii = (codePoint: number) => {
  if (codePoint > 127) {
    throw new Error(`Non-ASCII character found: ${String.fromCodePoint(codePoint)}`);
  }
};
const text3 = new TextEncoder().encode('Hello, 世界!');
try {
  await asCodePointsCallback([text3], validateAscii);
  console.error("Character Set Validation", "passed");
} catch (error) {
  console.error("Character Set Validation", error.message);
}



// Text Transformation
const transformedText: string[] = [];
const toUpperCase = (codePoint: number) => {
  transformedText.push(String.fromCodePoint(codePoint).toUpperCase());
};
const text4 = new TextEncoder().encode('Hello, World!');
await asCodePointsCallback([text4], toUpperCase);
console.log("Text Transformation", transformedText.join(''));



// Unicode Normalization
const normalizedText: string[] = [];
const accumulateCodePoints = (codePoint: number) => {
  normalizedText.push(String.fromCodePoint(codePoint));
};
const text5 = new TextEncoder().encode('Café');
await asCodePointsCallback([text5], accumulateCodePoints);
console.log("Unicode Normalization", normalizedText.join('').normalize('NFD'));



// Text Encoding Conversion
const utf16Buffer: number[] = [];
const toUtf16 = (codePoint: number) => {
  if (codePoint <= 0xFFFF) {
    utf16Buffer.push(codePoint);
  } else {
    const highSurrogate = Math.floor((codePoint - 0x10000) / 0x400) + 0xD800;
    const lowSurrogate = ((codePoint - 0x10000) % 0x400) + 0xDC00;
    utf16Buffer.push(highSurrogate, lowSurrogate);
  }
};
const text6 = new TextEncoder().encode('Hello, 世界!');
await asCodePointsCallback([text6], toUtf16);
const utf16Array = new Uint16Array(utf16Buffer);
console.log("Text Encoding Conversion", new TextDecoder('utf-16le').decode(utf16Array));

Usage with Async Iterables

The functions in codepoint-iterator support both synchronous and asynchronous iterables. This means you can use them with data sources that produce chunks of bytes asynchronously, such as file streams, network streams, or other async generators.

Here's an example of using asCodePointsIterator with an async iterable that reads chunks from a file:

Node

import { asCodePointsIterator } from "codepoint-iterator";
// or 
// import { asCodePointsIterator } from "https://deno.land/x/codepoint_iterator/mod.ts";
import { createReadStream } from "node:fs";

const fileStream = createReadStream("example.txt");

(async () => {
  for await (const codePoint of asCodePointsIterator(fileStream)) {
    console.log(String.fromCodePoint(codePoint));
  }
})();

In this example, we use the createReadStream function from the fs module to create a readable stream for a file. We then pass the stream to asCodePointsIterator, which processes the chunks of bytes and yields the corresponding Unicode code points.

Deno

import { asCodePointsIterator, getIterableStream } from "https://deno.land/x/codepoint_iterator/mod.ts";

(async () => {
  const file = await Deno.open("example.txt", { read: true })

  for await (const codePoint of asCodePointsIterator(getIterableStream(file.readable))) {
    console.log(String.fromCodePoint(codePoint));
  }
})();

In this example, we are using the asCodePointsIterator function from the codepoint-iterator library to read the contents of a file named example.txt and print each Unicode code point as a character. The getIterableStream function is used to convert a Deno readable stream into an iterable of Uint8Array chunks.

Showcase

A couple sites/projects that use codepoint-iterator:

  • Your site/project here...

Benchmarks

The asCodePointsIterator, asCodePointsArray, and asCodePointsCallback functions been thorougly tested to make sure they are the most performant variants for iterators, arrays, and callbacks possible. You can check the latest benchmark results in the GitHub Actions page.

Machine: GitHub Action ubuntu-latest

As of Monday December 4, 2023 on Deno v1.38.4 here are the results:

An image displaying the results of a full run of the benchmark

Note: I recommend using asCodePointsCallback whenever possible as it's the fastest variant.

Conclusion

codepoint-iterator is a versatile library that makes it easy to work with Unicode code points in JavaScript and TypeScript. Whether you're tokenizing text, processing AI responses, or working with file streams, codepoint-iterator provides a simple and efficient way to handle UTF-8 encoded data. Give it a try and see how it can simplify your code!

Contributing

Thanks @jonathantneal for the assistance with developing the codepoint-iterator library.

This package is written Deno first, so you will have to install Deno.

Run tests

deno task test

Run benchmarks

deno task bench

Build package

deno task build

Note: This project uses Conventional Commits standard for commits, so, please format your commits using the rules it sets out.

Licence

See the LICENSE file for license rights and limitations (MIT).