segment-string
v0.0.8
Published
A lightweight wrapper around Intl.Segmenter for segment-aware string operations
Downloads
605
Readme
Key Features
- Intuitive
Intl.Segmenter
Wrapper: Simplifies text segmentation with a clean API. - Standards-Based: Built on native
Intl.Segmenter
for robust compatibility. - Lightweight & Tree-Shakeable: Minimal footprint with optimal bundling (836B minified + gzipped).
- Highly Performant: Uses iterators for efficient, on-demand processing.
- Full TypeScript Support: Strict types for safe, predictable usage.
Installation
npm install segment-string
Getting Started
segment-string
is a lightweight wrapper for Intl.Segmenter
, designed to simplify locale-sensitive text segmentation in JavaScript and TypeScript. It lets you easily segment and manipulate text by graphemes, words, or sentences, ideal for handling complex cases like multi-character emojis or language-specific boundaries.
import { SegmentString } from "segment-string";
const str = new SegmentString("Hello, world! 👩👩👧👦🌍🌈");
// Segment by grapheme
console.log([...str.graphemes()]); // ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', '👩👩👧👦', '🌍', '🌈']
SegmentString Class
The SegmentString
class encapsulates a string and provides methods for segmentation, counting, and retrieving segments at specified indices with locale and granularity options.
Constructor
new SegmentString(str: string, locales?: Intl.LocalesArgument);
- str: The string to segment.
- locales: Optional locales argument for segmentation.
Methods
segments(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Iterable<string>
Segments the string by the specified granularity and returns the segments as strings.
rawSegments(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Intl.Segments | Iterable<Intl.SegmentData>
Returns raw Intl.SegmentData
objects based on granularity and options.
segmentCount(granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): number
Counts segments in the string based on the specified granularity.
segmentAt(index: number, granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): string | undefined
Retrieves the segment at a specific index, supporting negative indices.
rawSegmentAt(index: number, granularity: Granularity, options?: SegmentationOptions | WordSegmentationOptions): Intl.SegmentData | undefined
Returns the raw segment data at a specific index, supporting negative indices.
graphemes(options?: SegmentationOptions): Iterable<string>
Returns an iterable of grapheme segments as strings.
rawGraphemes(options?: SegmentationOptions): Iterable<Intl.SegmentData>
Returns an iterable of raw grapheme segments.
graphemeCount(options?: SegmentationOptions): number
Counts grapheme segments in the string.
graphemeAt(index: number, options?: SegmentationOptions): string | undefined
Returns the grapheme at a specific index, supporting negative indices.
rawGraphemeAt(index: number, options?: SegmentationOptions): Intl.SegmentData | undefined
Returns the raw grapheme data at a specific index, supporting negative indices.
words(options?: WordSegmentationOptions): Iterable<string>
Returns an iterable of word segments, with optional filtering for word-like segments.
rawWords(options?: WordSegmentationOptions): Iterable<Intl.SegmentData>
Returns an iterable of raw word segments, with optional filtering for word-like segments.
wordCount(options?: WordSegmentationOptions): number
Counts word segments in the string.
wordAt(index: number, options?: WordSegmentationOptions): string | undefined
Returns the word at a specific index, supporting negative indices.
rawWordAt(index: number, options?: WordSegmentationOptions): Intl.SegmentData | undefined
Returns the raw word data at a specific index, supporting negative indices.
sentences(options?: SegmentationOptions): Iterable<string>
Returns an iterable of sentence segments.
rawSentences(options?: SegmentationOptions): Iterable<Intl.SegmentData>
Returns an iterable of raw sentence segments.
sentenceCount(options?: SegmentationOptions): number
Counts sentence segments in the string.
sentenceAt(index: number, options?: SegmentationOptions): string | undefined
Returns the sentence at a specific index, supporting negative indices.
rawSentenceAt(index: number, options?: SegmentationOptions): Intl.SegmentData | undefined
Returns the raw sentence data at a specific index, supporting negative indices.
[Symbol.iterator](): Iterator<string>
Returns an iterator over the graphemes of the string.
Example Usage
import { SegmentString } from "segment-string";
const text = new SegmentString("Hello, world! 👩👩👧👦🌍🌈");
// Segmenting by words
for (const word of text.words()) {
console.log(word); // 'Hello', ',', ' ', 'world', '!', ' 👩👩👧👦🌍🌈'
}
// Segmenting graphemes and counting
console.log([...text.graphemes()]); // ['H', 'e', 'l', 'l', 'o', ',', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', '👩👩👧👦', '🌍', '🌈']
console.log("Grapheme count:", text.graphemeCount()); // 17
console.log("String length:", text.toString().length); // 29
// Accessing a specific word
const secondWord = text.wordAt(1, { isWordLike: true }); // 'world'
console.log(secondWord);
SegmentSplitter Class
Alternatively, the SegmentSplitter
class allows you to create an instance that can be directly used with JavaScript's String.prototype.split
method for basic segmentation.
Constructor
new SegmentSplitter<T extends Granularity>(granularity: T, options?: SegmentationOptions<T>);
- granularity: Specifies the segmentation granularity level (
'grapheme'
,'word'
,'sentence'
, etc.). - options: Optional settings to customize the segmentation for the given granularity.
Example Usage
const str = "Hello, world!";
const wordSplitter = new SegmentSplitter("word", { isWordLike: true });
const words = str.split(wordSplitter);
console.log(words); // ["Hello", "world"]
Individual Functions
getRawSegments
function getRawSegments(
str: string,
granularity: Granularity,
options?: SegmentationOptions | WordSegmentationOptions,
): Intl.Segments | Iterable<Intl.SegmentData>;
- Description: Returns raw
Intl.SegmentData
objects based on granularity and options. - Parameters:
str
: The string to segment.granularity
: Specifies the segmentation level ('grapheme'
,'word'
, or'sentence'
).options
: Includeslocales
for specifying locale andisWordLike
for filtering word-like segments.
- Returns: An iterable of raw
Intl.SegmentData
.
getSegments
function getSegments(
str: string,
granularity: Granularity,
options?: SegmentationOptions | WordSegmentationOptions,
): Iterable<string>;
- Description: Returns segments of the string as plain strings.
- Parameters: Similar to
getRawSegments
. - Returns: An iterable of segments as strings.
segmentCount
function segmentCount(
str: string,
granularity: Granularity,
options?: SegmentationOptions | WordSegmentationOptions,
): number;
- Description: Returns the count of segments based on granularity and options.
- Parameters: Similar to
getRawSegments
. - Returns: Number of segments.
rawSegmentAt
function rawSegmentAt(
str: string,
index: number,
granularity: Granularity,
options?: SegmentationOptions | WordSegmentationOptions,
): Intl.SegmentData | undefined;
- Description: Returns the raw segment data at a specified index, supporting negative indices.
- Parameters: Similar to
getRawSegments
, plus anindex
parameter. - Returns: The
Intl.SegmentData
at the specified index, orundefined
if out of bounds.
segmentAt
function segmentAt(
str: string,
index: number,
granularity: Granularity,
options?: SegmentationOptions | WordSegmentationOptions,
): string | undefined;
- Description: Returns the segment at a specified index, supporting negative indices.
- Parameters: Similar to
getRawSegments
, plus anindex
parameter. - Returns: The segment at the specified index or
undefined
if out of bounds.
filterRawWordLikeSegments
function filterRawWordLikeSegments(
segments: Intl.Segments,
): Iterable<Intl.SegmentData>;
- Description: Filters and returns an iterable of raw word-like segment data where
isWordLike
is true. - Parameters:
segments
: The segments to filter.
- Returns: An iterable of
Intl.SegmentData
for each word-like segment.
filterWordLikeSegments
function filterWordLikeSegments(segments: Intl.Segments): Iterable<string>;
- Description: Filters and returns an iterable of word-like segments as strings where
isWordLike
is true. - Parameters:
segments
: The segments to filter.
- Returns: An iterable of strings for each word-like segment.
💙 This package was templated with
create-typescript-app
.