npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@eduardocalazansjr/parquetjs

v0.0.12

Published

fully asynchronous, pure JavaScript implementation of the Parquet file format

Downloads

16

Readme

parquet.js

fully asynchronous, pure node.js implementation of the Parquet file format

License: MIT npm version

This package contains a fully asynchronous, pure JavaScript implementation of the Parquet file format. The implementation conforms with the Parquet specification and is tested for compatibility with Apache's Java reference implementation.

What is Parquet?: Parquet is a column-oriented file format; it allows you to write a large amount of structured data to a file, compress it and then read parts of it back out efficiently. The Parquet format is based on Google's Dremel paper.

Forked Notice

This is a forked repository with code from various sources:

Installation

parquet.js requires node.js >= 14.16.0

  $ npm install @dsnp/parquetjs

NodeJS

To use with nodejs:

import parquetjs from "@dsnp/parquetjs"

Browser with Bundler

To use in a browser with a bundler, depending on your needs, write the appropriate plugin or resolver to point to either the Common JS or ES Module version:

// Common JS
"node_modules/@dsnp/parquetjs/dist/browser/parquetjs.cjs"
// ES Modules
"node_modules/@dsnp/parquetjs/dist/browser/parquetjs.esm"

or:

// Common JS
import parquetjs from "@dsnp/parquetjs/dist/browser/parquetjs.cjs"
// ES Modules
import parquetjs from "@dsnp/parquetjs/dist/browser/parquetjs.esm"

Browser Direct: ES Modules

To use directly in the browser without a bundler using ES Modules:

  1. Build the package: npm install && npm run build:browser
  2. Copy to dist/browser/parquetjs.esm.js the server
  3. Use it in your html or other ES Modules:
    <script type="module">
      import parquetjs from '../parquet.esm.js';
      // Use parquetjs
    </script>

Browser Direct: Plain Ol' JavaScript

To use directly in the browser without a bundler or ES Modules:

  1. Build the package: npm install && npm run build:browser
  2. Copy to dist/browser/parquetjs.js the server
  3. Use the global parquetjs variable to access parquetjs functions
    <script>
     // console.log(parquetjs)
     </script>

Usage: Writing files

Once you have installed the parquet.js library, you can import it as a single module:

var parquet = require('@dsnp/parquetjs');

Parquet files have a strict schema, similar to tables in a SQL database. So, in order to produce a Parquet file we first need to declare a new schema. Here is a simple example that shows how to instantiate a ParquetSchema object:

Native Schema Definition

// declare a schema for the `fruits` table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64' },
  price: { type: 'DOUBLE' },
  date: { type: 'TIMESTAMP_MILLIS' },
  in_stock: { type: 'BOOLEAN' }
});

Helper Functions

var schema = new parquet.ParquetSchema({
  name: parquet.ParquetFieldBuilder.createStringField(),
  quantity: parquet.ParquetFieldBuilder.createIntField(64),
  price: parquet.ParquetFieldBuilder.createDoubleField(),
  date: parquet.ParquetFieldBuilder.createTimestampField(),
  in_stock: parquet.ParquetFieldBuilder.createBooleanField()
});

JSON Schema

// declare a schema for the `fruits` JSON Schema
var schema = new parquet.ParquetSchema.fromJsonSchema({
  "type": "object",
  "properties": {
    "name": {
      "type": "string"
    },
    "quantity": {
      "type": "integer"
    },
    "price": {
      "type": "number"
    },
    "date": {
      "type": "string"
    },
    "in_stock": {
      "type": "boolean"
    }
  },
  "required": ["name", "quantity", "price", "date", "in_stock"]
});

Note that the Parquet schema supports nesting, so you can store complex, arbitrarily nested records into a single row (more on that later) while still maintaining good compression.

Once we have a schema, we can create a ParquetWriter object. The writer will take input rows as JSON objects, convert them to the Parquet format and store them on disk.

// create new ParquetWriter that writes to 'fruits.parquet`
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

// append a few rows to the file
await writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true});
await writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true});

Once we are finished adding rows to the file, we have to tell the writer object to flush the metadata to disk and close the file by calling the close() method:

Adding bloom filters

Bloom filters can be added to multiple columns as demonstrated below:

  const options = {
    bloomFilters: [
      {
        column: "name",
        numFilterBytes: 1024,
      },
      {
        column: "quantity",
        numFilterBytes: 1024,
      },
    ]
  };

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet', options);

By default, not passing any additional options calculates the optimal number of blocks according to the default number of distinct values (128*1024) and default false positive probability (0.001), which gives a filter byte size of 29,920.

The following options are provided to have the ability to adjust the split-block bloom filter settings.

numFilterBytes - sets the desire size of bloom filter in bytes. Defaults to 128 * 1024 * 1024 bits.

falsePositiveRate - set the desired false positive percentage for bloom filter. Defaults to 0.001.

numDistinct - sets the number of distinct values. Defaults to 128 * 1024 bits.

Note that if numFilterBytes is provided then falsePositiveRate and numDistinct options are ignored.

Usage: Reading files

A parquet reader allows retrieving the rows from a parquet file in order. The basic usage is to create a reader and then retrieve a cursor/iterator which allows you to consume row after row until all rows have been read.

You may open more than one cursor and use them concurrently. All cursors become invalid once close() is called on the reader object.

// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

// create a new cursor
let cursor = reader.getCursor();

// read all records from the file and print them
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

When creating a cursor, you can optionally request that only a subset of the columns should be read from disk. For example:

// create a new cursor that will only return the `name` and `price` columns
let cursor = reader.getCursor(['name', 'price']);

It is important that you call close() after you are finished reading the file to avoid leaking file descriptors.

await reader.close();

Reading a bloom filter

Bloom filters can be fetched from a parquet file by creating a reader and calling getBloomFiltersFor.

// create new ParquetReader that reads from 'fruits.parquet`
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

// fetches bloom filter for the columns provided.
const bloomFilters = reader.getBloomFiltersFor(['name']);

=> {
  name: [
    {
      rowGroupIndex: 0
      columnName: 'name',
      sbbf: SplitBlockBloomFilter<instance>
    }
  ]
}

Calling getBloomFiltersFor on the reader returns an object with the keys being a column name and value being an array of length equal to the number of row groups that the column spans.

Given the SplitBlockBloomFilter inclusion of a value in the filter can be checked as follows:

const sbbf = bloomFilters.name[0].ssbf;

sbbf.check('apples') ===> true

Reading data from a url

Parquet files can be read from a url without having to download the whole file. You will have to supply the request library as a first argument and the request parameters as a second argument to the function parquetReader.openUrl.

const request = require('request');
let reader = await parquet.ParquetReader.openUrl(request,'https://domain/fruits.parquet');

Reading data from S3

Parquet files can be read from an S3 object without having to download the whole file. You will have to supply the aws-sdk client as first argument and the bucket/key information as second argument to the function parquetReader.openS3.

const AWS = require('aws-sdk');
const client = new AWS.S3({
  accessKeyId: 'xxxxxxxxxxx',
  secretAccessKey: 'xxxxxxxxxxx'
});

const params = {
  Bucket: 'xxxxxxxxxxx',
  Key: 'xxxxxxxxxxx'
};

let reader = await parquet.ParquetReader.openS3(client,params);

Reading data from a buffer

If the complete parquet file is in buffer it can be read directly from memory without incurring any additional I/O.

const file = fs.readFileSync('fruits.parquet');
let reader = await parquet.ParquetReader.openBuffer(file);

Encodings

Internally, the Parquet format will store values from each field as consecutive arrays which can be compressed/encoded using a number of schemes.

Plain Encoding (PLAIN)

The most simple encoding scheme is the PLAIN encoding. It simply stores the values as they are without any compression. The PLAIN encoding is currently the default for all types except BOOLEAN:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8', encoding: 'PLAIN' },
});

Run Length Encoding (RLE)

The Parquet hybrid run length and bitpacking encoding allows to compress runs of numbers very efficiently. Note that the RLE encoding can only be used in combination with the BOOLEAN, INT32 and INT64 types. The RLE encoding requires an additional bitWidth parameter that contains the maximum number of bits required to store the largest value of the field.

var schema = new parquet.ParquetSchema({
  age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },
});

Optional Fields

By default, all fields are required to be present in each row. You can also mark a field as 'optional' which will let you store rows with that field missing:

var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  quantity: { type: 'INT64', optional: true },
});

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
await writer.appendRow({name: 'apples', quantity: 10 });
await writer.appendRow({name: 'banana' }); // not in stock

Nested Rows & Arrays

Parquet supports nested schemas that allow you to store rows that have a more complex structure than a simple tuple of scalar values. To declare a schema with a nested field, omit the type in the column definition and add a fields list instead:

Consider this example, which allows us to store a more advanced "fruits" table where each row contains a name, a list of colours and a list of "stock" objects.

// advanced fruits table
var schema = new parquet.ParquetSchema({
  name: { type: 'UTF8' },
  colours: { type: 'UTF8', repeated: true },
  stock: {
    repeated: true,
    fields: {
      price: { type: 'DOUBLE' },
      quantity: { type: 'INT64' },
    }
  }
});

// the above schema allows us to store the following rows:
var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');

await writer.appendRow({
  name: 'banana',
  colours: ['yellow'],
  stock: [
    { price: 2.45, quantity: 16 },
    { price: 2.60, quantity: 420 }
  ]
});

await writer.appendRow({
  name: 'apple',
  colours: ['red', 'green'],
  stock: [
    { price: 1.20, quantity: 42 },
    { price: 1.30, quantity: 230 }
  ]
});

await writer.close();

// reading nested rows with a list of explicit columns
let reader = await parquet.ParquetReader.openFile('fruits.parquet');

let cursor = reader.getCursor([['name'], ['stock', 'price']]);
let record = null;
while (record = await cursor.next()) {
  console.log(record);
}

await reader.close();

It might not be obvious why one would want to implement or use such a feature when the same can - in principle - be achieved by serializing the record using JSON (or a similar scheme) and then storing it into a UTF8 field:

Putting aside the philosophical discussion on the merits of strict typing, knowing about the structure and subtypes of all records (globally) means we do not have to duplicate this metadata (i.e. the field names) for every record. On top of that, knowing about the type of a field allows us to compress the remaining data more efficiently.

Nested Lists for Hive / Athena

Lists have to be annotated to be queriable with AWS Athena. See parquet-format for more detail and a full working example with comments in the test directory (test/list.js)

List of Supported Types & Encodings

We aim to be feature-complete and add new features as they are added to the Parquet specification; this is the list of currently implemented data types and encodings:

Buffering & Row Group Size

When writing a Parquet file, the ParquetWriter will buffer rows in memory until a row group is complete (or close() is called) and then write out the row group to disk.

The size of a row group is configurable by the user and controls the maximum number of rows that are buffered in memory at any given time as well as the number of rows that are co-located on disk:

var writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');
writer.setRowGroupSize(8192);

Dependencies

Parquet uses thrift to encode the schema and other metadata, but the actual data does not use thrift.

Notes

Currently parquet-cpp doesn't fully support DATA_PAGE_V2. You can work around this by setting the useDataPageV2 option to false.