strofa

v0.3.1

Published

2 years ago

Compression for short messages using Markov-Huffman encoding

Downloads

0High
0Medium
0Low

rolftimmermans

strofa – Compress short messages

Use strofa to compress very short messages. It encodes data by using a statistical model to predict each byte based on the previous byte. The same model is used during decompression.

Compression is based on Markov-Huffman coding. High probability byte sequences can be compressed into very few bits. Models work best if they match the type of message to be compressed. A compression model is domain-specific.

With strofa it is trivial to construct compression models based on a set of sample data that you provide. A compression model can be serialized to a binary representation of roughly 10-30K, with an absolute upper limit of 82K. Compression models are built-in for:

Installation

The strofa comression algorithm is written in Javascript. Use it with node Node.js:

npm install strofa

For browsers a minified version is available. It requires Javascript Uint8Array typed array support, which qualifies Chrome 7+, Firefox 4+, Internet Explorer 10+, Opera 11.6+ and Safari 5.1+.

This distributable does not ship with any of the standard compression models. You can download the models in binary format if you need them.

Usage

English text

/* Compress as Buffer/Uint8Array. */
strofa.english.encode("All which is not prose is verse...");
// <Buffer 41 ec 2f 68 5f fa af 6f a5 18 d3 7d 44 78 cf 98 c4>

/* Compress as URL-safe base64. */
strofa.english.encodeBase64("and all which is not verse is prose.");
// 'YfgyF7Qv_Ve30jxm-ooxp8g'

The English compression model is created from a number of English books.

Email addresses

/* Compress as Buffer/Uint8Array. */
strofa.email.encode("[email protected]");
//=> <Buffer 5e 3a f0 d9 e8 e5 da d5 40 c7 c0>

/* Compress as URL-safe base64. */
strofa.email.encodeBase64("[email protected]");
//=> 'Xjrw2ejl2tVAx8'

The email address compression model is based on 150 million email addresses that were part of the leaked Adobe accounts database. Needless to say the email addresses are not included in this repository, nor are they recoverable from the compression model.

Benefits:

Commonly used domains are compressed very efficiently: gmail.com uses just 12 bits.
Predictable patterns in the username part of an address occur. They can be compressed by about half.
Email addresses do not have to be valid.

Caveats:

There may be a strong bias towards Western email addresses.
Email addresses should be lower case. Upper case characters are unexpected and a single one takes about 3-4 bytes to encode (although any that follow will use exactly 1 byte).
No UTF-8 email addresses are included.
The username and the domain parts follow different patterns but the compressor makes no distinction.

Host names

URLs

Customized compression models

var model = new strofa.Model;

/* Provide as much sample data as possible. */
model.push("Hello world!");
// model.push("...")

/* Create a compressor based on your sample data. */
var coder = model.createCoder();

/* Compression is best when messages resemble your model. */
coder.encode("Hello");
// <Buffer ef>

coder.encode("Hi!");
// <Buffer b6 08 60>

coder.encodeBase64("Hello");
// '7w'

API

When used with Node.js, require the strofa module first:

var strofa = require("strofa");

Built-in compression

`strofa.email`

Loads and returns the built-in compressor for email addresses. Loading is synchronous, you should call this during your app's initialization. Returns the same compressor when accessed multiple times.

`strofa.english`

Loads and returns the built-in compressor for English text. Loading is synchronous, you should call this during your app's initialization. Returns the same compressor when accessed multiple times.

Model

`new strofa.Model`

Creates and returns a new compression model.

`model.push(string)`

Improves the model with the given sample message string. You should add as many samples as possible.

`model.createCoder()`

Creates and returns a new Coder based on the model. The coder can be used to compress and decompress messages.

`model.toBuffer()`

Serializes the model into a binary format that can be used at a later stage to instantiate a Coder. Returns a Buffer. The browser version returns a Uint8Array with a few additional functions mimicking a Buffer.

`model.toJSON()`

Serializes the model into a JSON format that can be used at a later stage to instantiate a Coder. When serialized to a string the JSON format is significantly larger than the binary format, so you should almost always use toBuffer() if you want to store a compression model.

Coder

`strofa.Coder.fromBuffer(buffer)`

Creates and returns a new Coder base on the serialized model stored in the given buffer. This requires serialization format compatibility.

`strofa.Coder.fromJSON(json)`

Creates and returns a new Coder base on the serialized model stored as JSON. This requires serialization format compatibility.

`coder.encode(string)`

Compresses the given string. The string is stored internally as UTF-8. Compression happens per byte. Non-ASCII characters are encoded but generally don't compress very well, because most context information is unavailable. Returns a Buffer. The browser version returns a Uint8Array with a few additional functions mimicking a Buffer. Use decode() to decompress.

`coder.encodeBase64(string)`

Same as encode(), but returns the compressed representation as an URL-safe base64 encoded string. Use decodeBase64() to decompress.

`coder.decode(buffer)`

Decompresses the given buffer returned by encode(). Returns the original string when used with the same coder object or with a coder object created from the exact same model serialization. This requires compression format compatibility.

No validation is performed. That means that a single incorrect bit may completely alter the message beyond recognition. You should add validation checks yourself if required. Checksums are best appended to the end of the message.

`coder.decodeBase64(string)`

Decompresses the given URL-safe base64 encoded string returned by encodeBase64(). Returns the original string when used with the same coder object or with a coder object created from the exact same model serialization. This requires compression format compatibility.

Compatibility

Releases of strofa follow semantic versioning. Compression output and compression model compatibility is treated as an API compatibility. That means:

Before 1.0 the compression output, the model serialization format and the API may not be compatible between minor versions. You should be explicit when adding a dependency in your package.json file, for example: "strofa": "0.2.x".
Starting at 1.0 the compression output, the model serialization format and the API will be compatible between minor versions and patch versions.
Starting at 1.0 breaking changes will be limited to major versions only. You should be explicit when adding a dependency in your package.json file, for example: "strofa": "1.x".

License

The strofa compression library and algorithm are licensed under the Apache License, Version 2.0; you may not use this project except in compliance with the License. See the file LICENSE for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

strofa – Compress short messages

Installation

Usage

English text

Email addresses

Benefits:

Caveats:

Host names

URLs

Customized compression models

API

Built-in compression

strofa.email

strofa.english

Model

new strofa.Model

model.push(string)

model.createCoder()

model.toBuffer()

model.toJSON()