@nlptools/nlptools
v0.0.2
Published
Main NLPTools package - Complete suite of NLP algorithms, text distance, similarity, splitting, and tokenization utilities
Maintainers
Readme
@nlptools/nlptools
Main NLPTools package - Complete suite of NLP algorithms and utilities
This is the main NLPTools package (@nlptools/nlptools) that exports all algorithms and utilities from the entire toolkit. It provides a single entry point to access all string distance, similarity algorithms, text splitting, and tokenization utilities.
Features
- 🎯 All-in-One: Complete access to all NLPTools algorithms
- 📦 Convenient: Single import for all functionality
- ✂️ Text Splitting: Document chunking and text processing utilities
- 🪙 Tokenization: Fast text encoding and decoding for LLM models
- 📏 Distance & Similarity: Comprehensive string comparison algorithms
- 🚀 Performance Optimized: Automatically uses the fastest implementations available
- 📝 TypeScript First: Full type safety with comprehensive API
- 🔧 Easy to Use: Consistent API across all algorithms
Installation
# Install with npm
npm install @nlptools/nlptools
# Install with yarn
yarn add @nlptools/nlptools
# Install with pnpm
pnpm add @nlptools/nlptoolsUsage
Basic Setup
import * as nlptools from "@nlptools/nlptools";
// All algorithms are available as named functions
console.log(nlptools.levenshtein("kitten", "sitting")); // 3
console.log(nlptools.jaro("hello", "hallo")); // 0.8666666666666667
console.log(nlptools.cosine("abc", "bcd")); // 0.6666666666666666Distance vs Similarity
Most algorithms have both distance and normalized versions:
// Distance algorithms (lower is more similar)
const distance = nlptools.levenshtein("cat", "bat"); // 1
// Similarity algorithms (higher is more similar, 0-1 range)
const similarity = nlptools.levenshtein_normalized("cat", "bat"); // 0.6666666666666666Text Splitting
This package includes text splitters from @nlptools/splitter:
import { RecursiveCharacterTextSplitter } from "@nlptools/nlptools";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const text = "Your long document text here...";
const chunks = await splitter.splitText(text);
console.log(chunks);Tokenization
This package includes tokenization utilities from @nlptools/tokenizer:
import { Tokenizer } from "@nlptools/nlptools";
// Load tokenizer from HuggingFace Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`,
).then((res) => res.json());
const tokenizerConfig = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`,
).then((res) => res.json());
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Encode text
const encoded = tokenizer.encode("Hello World");
console.log(encoded.ids); // [9906, 4435]
console.log(encoded.tokens); // ['Hello', 'ĠWorld']
// Get token count
const tokenCount = tokenizer.encode("This is a sentence.").ids.length;
console.log(`Token count: ${tokenCount}`);Available Algorithm Categories
This package includes all algorithms from @nlptools/distance, @nlptools/splitter, and @nlptools/tokenizer:
Edit Distance Algorithms
levenshtein- Classic edit distancefastest_levenshtein- High-performance Levenshtein distancedamerau_levenshtein- Edit distance with transpositionsmyers_levenshtein- Myers bit-parallel algorithmjaro- Jaro similarityjarowinkler- Jaro-Winkler similarityhamming- Hamming distance for equal-length stringssift4_simple- SIFT4 algorithm
Sequence-based Algorithms
lcs_seq- Longest common subsequencelcs_str- Longest common substringratcliff_obershelp- Gestalt pattern matchingsmith_waterman- Local sequence alignment
Token-based Algorithms
jaccard- Jaccard similaritycosine- Cosine similaritysorensen- Sørensen-Dice coefficienttversky- Tversky indexoverlap- Overlap coefficient
Bigram Algorithms
jaccard_bigram- Jaccard similarity on character bigramscosine_bigram- Cosine similarity on character bigrams
Naive Algorithms
prefix- Prefix similaritysuffix- Suffix similaritylength- Length-based similarity
Text Splitters
RecursiveCharacterTextSplitter- Splits text recursively using different separatorsCharacterTextSplitter- Splits text by character countMarkdownTextSplitter- Specialized splitter for Markdown documentsTokenTextSplitter- Splits text by token countLatexTextSplitter- Specialized splitter for LaTeX documents
Tokenization Utilities
Tokenizer- Main tokenizer class for encoding and decoding textencode()- Convert text to token IDs and tokensdecode()- Convert token IDs back to texttokenize()- Split text into token stringsAddedToken- Custom token configuration class
Universal Compare Function
const result = nlptools.compare("hello", "hallo", "jaro");
console.log(result); // 0.8666666666666667Performance
The package automatically selects the fastest implementation available:
- WebAssembly algorithms: 10-100x faster than pure JavaScript
- High-performance implementations: Including fastest-levenshtein for optimal speed
