stringzilla
v2.0.4
Published
Crunch multi-gigabyte strings with ease
Downloads
15
Readme
StringZilla 🦖
StringZilla is the Godzilla of string libraries, splitting, sorting, and shuffling large textual datasets faster than you can say "Tokyo Tower" 😅
- ✅ Single-header pure C 99 implementation docs
- ✅ Direct CPython bindings with minimal call latency docs
- ✅ SWAR and SIMD acceleration on x86 (AVX2) and ARM (NEON)
- ✅ Radix-like sorting faster than C++
std::sort
- ✅ Memory-mapping to work with larger-than-RAM datasets
- ✅ Memory-efficient compressed arrays to work with sequences
- 🔜 JavaScript bindings are on their way.
This library saved me tens of thousands of dollars pre-processing large datasets for machine learning, even on the scale of a single experiment.
So if you want to process the 6 Billion images from LAION, or the 250 Billion web pages from the CommonCrawl, or even just a few million lines of server logs, and haunted by Python's open(...).readlines()
and str().splitlines()
taking forever, this should help 😊
Performance
StringZilla is built on a very simple heuristic:
If the first 4 bytes of the string are the same, the strings are likely to be equal. Similarly, the first 4 bytes of the strings can be used to determine their relative order most of the time.
Thanks to that it can avoid scalar code processing one char
at a time and use hyper-scalar code to achieve memcpy
speeds.
The implementation fits into a single C 99 header file and uses different SIMD flavors and SWAR on older platforms.
Substring Search
| Backend \ Device | IoT | Laptop | Server |
| :----------------------- | ---------------------: | -----------------------: | ------------------------: |
| Speed Comparison 🐇 | | | |
| Python for
loop | 4 MB/s | 14 MB/s | 11 MB/s |
| C++ for
loop | 520 MB/s | 1.0 GB/s | 900 MB/s |
| C++ string.find
| 560 MB/s | 1.2 GB/s | 1.3 GB/s |
| Scalar StringZilla | 2 GB/s | 3.3 GB/s | 3.5 GB/s |
| Hyper-Scalar StringZilla | 4.3 GB/s | 12 GB/s | 12.1 GB/s |
| Efficiency Metrics 📊 | | | |
| CPU Specs | 8-core ARM, 0.5 W/core | 8-core Intel, 5.6 W/core | 22-core Intel, 6.3 W/core |
| Performance/Core | 2.1 - 3.3 GB/s | 11 GB/s | 10.5 GB/s |
| Bytes/Joule | 4.2 GB/J | 2 GB/J | 1.6 GB/J |
Split, Partition, Sort, and Shuffle
Coming soon.
Quick Start: Python 🐍
- Install via pip:
pip install stringzilla
- Import the classes you need:
from stringzilla import Str, Strs, File
Basic Usage
StringZilla offers two mostly interchangeable core classes:
from stringzilla import Str, File
text_from_str = Str('some-string')
text_from_file = Str(File('some-file.txt'))
The Str
is designed to replace long Python str
strings and wrap our C-level API.
On the other hand, the File
memory-maps a file from persistent memory without loading its copy into RAM.
The contents of that file would remain immutable, and the mapping can be shared by multiple Python processes simultaneously.
A standard dataset pre-processing use case would be to map a sizeable textual dataset like Common Crawl into memory, spawn child processes, and split the job between them.
Basic Operations
- Length:
len(text) -> int
- Indexing:
text[42] -> str
- Slicing:
text[42:46] -> Str
- String conversion:
str(text) -> str
- Substring check:
'substring' in text -> bool
- Hashing:
hash(text) -> int
Advanced Operations
text.contains('substring', start=0, end=9223372036854775807) -> bool
text.find('substring', start=0, end=9223372036854775807) -> int
text.count('substring', start=0, end=9223372036854775807, allowoverlap=False) -> int
text.splitlines(keeplinebreaks=False, separator='\n') -> Strs
text.split(separator=' ', maxsplit=9223372036854775807, keepseparator=False) -> Strs
Collection-Level Operations
Once split into a Strs
object, you can sort, shuffle, and reorganize the slices.
lines: Strs = text.split(separator='\n')
lines.sort()
lines.shuffle(seed=42)
Need copies?
sorted_copy: Strs = lines.sorted()
shuffled_copy: Strs = lines.shuffled(seed=42)
Basic list
-like operations are also supported:
lines.append('Pythonic string')
lines.extend(shuffled_copy)
Low-Level Python API
The StringZilla CPython bindings implement vector-call conventions for faster calls.
import stringzilla as sz
contains: bool = sz.contains("haystack", "needle", start=0, end=9223372036854775807)
offset: int = sz.find("haystack", "needle", start=0, end=9223372036854775807)
count: int = sz.count("haystack", "needle", start=0, end=9223372036854775807, allowoverlap=False)
levenstein: int = sz.levenstein("needle", "nidl")
Quick Start: C 🛠️
There is an ABI-stable C 99 interface, in case you have a database, an operating system, or a runtime you want to integrate with StringZilla.
#include "stringzilla.h"
// Initialize your haystack and needle
sz_string_view_t haystack = {your_text, your_text_length};
sz_string_view_t needle = {your_subtext, your_subtext_length};
// Perform string-level operations
sz_size_t character_count = sz_count_char(haystack.start, haystack.length, "a");
sz_size_t substring_position = sz_find_substring(haystack.start, haystack.length, needle.start, needle.length);
// Hash strings
sz_u32_t crc32 = sz_hash_crc32(haystack.start, haystack.length);
// Perform collection level operations
sz_sequence_t array = {your_order, your_count, your_get_start, your_get_length, your_handle};
sz_sort(&array, &your_config);
Contributing 👾
Future development plans include:
- [x] Replace PyBind11 with CPython, blog
- [x] Bindings for JavaScript
- [ ] Faster string sorting algorithm
- [ ] Reverse-order operations in Python
- [ ] Splitting with multiple separators at once
- [ ] Splitting CSV rows into columns
- [ ] UTF-8 validation.
- [ ] Arm SVE backend
- [ ] Bindings for Java and Rust
Here's how to set up your dev environment and run some tests.
Development
CPython:
# Clean up, install, and test!
rm -rf build && pip install -e . && pytest scripts/ -s -x
# Install without dependencies
pip install -e . --no-index --no-deps
NodeJS:
npm install && npm test
Benchmarking
To benchmark on some custom file and pattern combinations:
python scripts/bench.py --haystack_path "your file" --needle "your pattern"
To benchmark on synthetic data:
python scripts/bench.py --haystack_pattern "abcd" --haystack_length 1e9 --needle "abce"
Packaging
To validate packaging:
cibuildwheel --platform linux
Compiling C++ Tests
cmake -B ./build_release -DSTRINGZILLA_BUILD_TEST=1 && make -C ./build_release -j && ./build_release/stringzilla_test
On MacOS it's recommended to use non-default toolchain:
# Install dependencies
brew install libomp llvm
# Compile and run tests
cmake -B ./build_release \
-DCMAKE_C_COMPILER="/opt/homebrew/opt/llvm/bin/clang" \
-DCMAKE_CXX_COMPILER="/opt/homebrew/opt/llvm/bin/clang++" \
-DSTRINGZILLA_USE_OPENMP=1 \
-DSTRINGZILLA_BUILD_TEST=1 \
&& \
make -C ./build_release -j && ./build_release/stringzilla_test
License 📜
Feel free to use the project under Apache 2.0 or the Three-clause BSD license at your preference.
If you like this project, you may also enjoy USearch, UCall, UForm, UStore, SimSIMD, and TenPack 🤗