parquet-wasm
v0.6.1
Published
WebAssembly Parquet reader and writer.
Downloads
97,128
Readme
WASM Parquet
WebAssembly bindings to read and write the Apache Parquet format to and from Apache Arrow using the Rust parquet
and arrow
crates.
This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical JS Arrow library.
Including read and write support and all compression codecs, the brotli-compressed WASM bundle is 1.2 MB. Refer to custom builds for how to build a smaller bundle. A minimal read-only bundle without compression support can be as small as 456 KB brotli-compressed.
Install
parquet-wasm
is published to NPM. Install with
yarn add parquet-wasm
or
npm install parquet-wasm
API
Parquet-wasm has both a synchronous and asynchronous API. The sync API is simpler but requires fetching the entire Parquet buffer in advance, which is often prohibitive.
Sync API
Refer to these functions:
readParquet
: Read a Parquet file synchronously.readSchema
: Read an Arrow schema from a Parquet file synchronously.writeParquet
: Write a Parquet file synchronously.
Async API
readParquetStream
: Create a ReadableStream that emits Arrow RecordBatches from a Parquet file.ParquetFile
: A class for reading portions of a remote Parquet file. UsefromUrl
to construct from a remote URL orfromFile
to construct from aFile
handle. Note that when you're done using this class, you'll need to callfree
to release any memory held by the ParquetFile instance itself.
Both sync and async functions return or accept a Table
class, an Arrow table in WebAssembly memory. Refer to its documentation for moving data into/out of WebAssembly.
Entry Points
| Entry point | Description | Documentation |
| ------------------------------------------------------------------------- | ------------------------------------------------------- | -------------------- |
| parquet-wasm
, parquet-wasm/esm
, or parquet-wasm/esm/parquet_wasm.js
| ESM, to be used directly from the Web as an ES Module | Link |
| parquet-wasm/bundler
| "Bundler" build, to be used in bundlers such as Webpack | Link |
| parquet-wasm/node
| Node build, to be used with synchronous require
in NodeJS | Link |
ESM
The esm
entry point is the primary entry point. It is the default export from parquet-wasm
, and is also accessible at parquet-wasm/esm
and parquet-wasm/esm/parquet_wasm.js
(for symmetric imports directly from a browser).
Note that when using the esm
bundles, you must manually initialize the WebAssembly module before using any APIs. Otherwise, you'll get an error TypeError: Cannot read properties of undefined
. There are multiple ways to initialize the WebAssembly code:
Asynchronous initialization
The primary way to initialize is by awaiting the default export.
import wasmInit, {readParquet} from "parquet-wasm";
await wasmInit();
Without any parameter, this will try to fetch a file named 'parquet_wasm_bg.wasm'
at the same location as parquet-wasm
. (E.g. this snippet input = new URL('parquet_wasm_bg.wasm', import.meta.url);
).
Note that you can also pass in a custom URL if you want to host the .wasm
file on your own servers.
import wasmInit, {readParquet} from "parquet-wasm";
// Update this version to match the version you're using.
const wasmUrl = "https://cdn.jsdelivr.net/npm/[email protected]/esm/parquet_wasm_bg.wasm";
await wasmInit(wasmUrl);
Synchronous initialization
The initSync
named export allows for
import {initSync, readParquet} from "parquet-wasm";
// The contents of esm/parquet_wasm_bg.wasm in an ArrayBuffer
const wasmBuffer = new ArrayBuffer(...);
// Initialize the Wasm synchronously
initSync(wasmBuffer)
Async initialization should be preferred over downloading the Wasm buffer and then initializing it synchronously, as WebAssembly.instantiateStreaming
is the most efficient way to both download and initialize Wasm code.
Bundler
The bundler
entry point doesn't require manual initialization of the WebAssembly blob, but needs setup with whatever bundler you're using. Refer to the Rust Wasm documentation for more info.
Node
The node
entry point can be loaded synchronously from Node.
const {readParquet} = require("parquet-wasm");
const wasmTable = readParquet(...);
Using directly from a browser
You can load the esm/parquet_wasm.js
file directly from a CDN
const parquet = await import(
"https://cdn.jsdelivr.net/npm/[email protected]/esm/+esm"
)
await parquet.default();
const wasmTable = parquet.readParquet(...);
This specific endpoint will minify the ESM before you receive it.
Debug functions
These functions are not present in normal builds to cut down on bundle size. To create a custom build, see Custom Builds below.
setPanicHook
setPanicHook(): void
Sets console_error_panic_hook
in Rust, which provides better debugging of panics by having more informative console.error
messages. Initialize this first if you're getting errors such as RuntimeError: Unreachable executed
.
The WASM bundle must be compiled with the console_error_panic_hook
feature for this function to exist.
Example
import * as arrow from "apache-arrow";
import initWasm, {
Compression,
readParquet,
Table,
writeParquet,
WriterPropertiesBuilder,
} from "parquet-wasm";
// Instantiate the WebAssembly context
await initWasm();
// Create Arrow Table in JS
const LENGTH = 2000;
const rainAmounts = Float32Array.from({ length: LENGTH }, () =>
Number((Math.random() * 20).toFixed(1))
);
const rainDates = Array.from(
{ length: LENGTH },
(_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)
);
const rainfall = arrow.tableFromArrays({
precipitation: rainAmounts,
date: rainDates,
});
// Write Arrow Table to Parquet
// wasmTable is an Arrow table in WebAssembly memory
const wasmTable = Table.fromIPCStream(arrow.tableToIPC(rainfall, "stream"));
const writerProperties = new WriterPropertiesBuilder()
.setCompression(Compression.ZSTD)
.build();
const parquetUint8Array = writeParquet(wasmTable, writerProperties);
// Read Parquet buffer back to Arrow Table
// arrowWasmTable is an Arrow table in WebAssembly memory
const arrowWasmTable = readParquet(parquetUint8Array);
// table is now an Arrow table in JS memory
const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream());
console.log(table.schema.toString());
// Schema<{ 0: precipitation: Float32, 1: date: Date64<MILLISECOND> }>
Published examples
(These may use older versions of the library with a different API).
Performance considerations
Tl;dr: When you have a Table
object (resulting from readParquet
), try the new
Table.intoFFI
API to move it to JavaScript memory. This API is less well tested than the Table.intoIPCStream
API, but should be
faster and have much less memory overhead (by a factor of 2). If you hit any bugs, please
create a reproducible issue.
Under the hood, parquet-wasm
first decodes a Parquet file into Arrow in WebAssembly memory. But
then that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The "normal"
conversion APIs (e.g. Table.intoIPCStream
) use the Arrow IPC
format to get the data back to JavaScript. But this
requires another memory copy inside WebAssembly to assemble the various arrays into a single
buffer to be copied back to JS.
Instead, the new Table.intoFFI
API uses Arrow's C Data
Interface to be able to copy or view
Arrow arrays from within WebAssembly memory without any serialization.
Note that this approach uses the arrow-js-ffi
library to parse the Arrow C Data Interface definitions. This library has not yet been tested in
production, so it may have bugs!
I wrote an interactive blog post on this approach and the Arrow C Data Interface if you want to read more!
Example
import * as arrow from "apache-arrow";
import { parseTable } from "arrow-js-ffi";
import initWasm, { wasmMemory, readParquet } from "parquet-wasm";
// Instantiate the WebAssembly context
await initWasm();
// A reference to the WebAssembly memory object.
const WASM_MEMORY = wasmMemory();
const resp = await fetch("https://example.com/file.parquet");
const parquetUint8Array = new Uint8Array(await resp.arrayBuffer());
const wasmArrowTable = readParquet(parquetUint8Array).intoFFI();
// Arrow JS table that was directly copied from Wasm memory
const table: arrow.Table = parseTable(
WASM_MEMORY.buffer,
wasmArrowTable.arrayAddrs(),
wasmArrowTable.schemaAddr()
);
// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it
// to release the Wasm memory.
// Note that any access to the pointers in this table is undefined behavior after this call.
// Calling any `wasmArrowTable` method will error.
wasmArrowTable.drop();
Compression support
The Parquet specification permits several compression codecs. This library currently supports:
- [x] Uncompressed
- [x] Snappy
- [x] Gzip
- [x] Brotli
- [x] ZSTD
- [x] LZ4_RAW
- [ ] LZ4 (deprecated)
LZ4 support in Parquet is a bit messy. As described here, there are two LZ4 compression options in Parquet (as of version 2.9.0). The original version LZ4
is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:
It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable
LZ4_RAW
codec.
It's currently unknown how widespread the ecosystem support is for LZ4_RAW
. As of pyarrow
v7, it now writes LZ4_RAW
by default and presumably has read support for it as well.
Custom builds
In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of parquet-wasm
to keep bundle size at a minimum. If you install the Rust toolchain and wasm-pack
(see Development), you can create a custom build with only the compression codecs you require.
The minimum supported Rust version in this project is 1.60. To upgrade your toolchain, use rustup update stable
.
Example custom builds
Reader-only bundle with Snappy compression:
wasm-pack build --no-default-features --features snappy --features reader
Writer-only bundle with no compression support, targeting Node:
wasm-pack build --target nodejs --no-default-features --features writer
Bundle with reader and writer support, targeting Node, using arrow
and parquet
crates with all their supported compressions, with console_error_panic_hook
enabled:
wasm-pack build \
--target nodejs \
--no-default-features \
--features reader \
--features writer \
--features all_compressions \
--features debug
# Or, given the fact that the default feature includes several of these features, a shorter version:
wasm-pack build --target nodejs --features debug
Refer to the wasm-pack
documentation for more info on flags such as --release
, --dev
, target
, and to the Cargo documentation for more info on how to use features.
Available features
By default, all_compressions
, reader
, writer
, and async
features are enabled. Use --no-default-features
to remove these defaults.
reader
: Activate read support.writer
: Activate write support.async
: Activate asynchronous read support.all_compressions
: Activate all supported compressions.brotli
: Activate Brotli compression.gzip
: Activate Gzip compression.snappy
: Activate Snappy compression.zstd
: Activate ZSTD compression.lz4
: Activate LZ4_RAW compression.debug
: Expose thesetPanicHook
function for better error messages for Rust panics.
Node <20
On Node versions before 20, you'll have to polyfill the Web Cryptography API.
Future work
- [ ] Example of pushdown predicate filtering, to download only chunks that match a specific condition
- [ ] Column filtering, to download only certain columns
- [ ] More tests
Acknowledgements
A starting point of my work came from @my-liminal-space's read-parquet-browser
(which is also dual licensed MIT and Apache 2).
@domoritz's arrow-wasm
was a very helpful reference for bootstrapping Rust-WASM bindings.