@bluehemoth/csvjsonify

v1.2.0

Published

2 years ago

A package which handles data transformation from csv format to json format

Downloads

0High
0Medium
0Low

bluehemoth

transform csv json

csv2json

Description

A simple package meant for loading csv data, transforming it to json format and then outputting the transformed data.

Usage

    Description
    
    A simple package which transforms data from csv to json format.
    
    Options
    
    --sourceFile    Absolute path of the file to be transformed.
    --resultFile    Absolute path of the file where the transformed data will be stored.
    --separator     Symbol, which is used in source file to separate values. The value of the should be either of , | ; \t (tab).
                    Defaults to comma if not provided
    
    Examples
    
    csvToJson --sourceFile "D:\source.csv" --resultFile "D:\result.json" --separator ","

Environment variables

  TRANSFORMER_CHOICE  Feature flag which indicates which data transformed should be used
  
  GOOGLE_DRIVE_STORAGE  Feature flag which enables the upload of transformation result to google drive
  
  GOOGLE_APPLICATION_CREDENTIALS_FILE Name of the Google api service account key file
  
  SHARED_FOLDER_ID  Id of the google drive account that is shared with the Google API service account, transformation result will be uploaded to this folder
   
  DATA_DIR  Absolute path of the test data directory
  
  CREDENTIALS_DIR Absolute path of the folder which contains the credentials file
  
  SOURCE_FILE Name of the source file
  
  RESULT_FILE Name of the results file
  
  LOGGING_LEVEL Application logging level

Feature flags

The package allows the customization of its operation via env file feature flags. The package supports these flags:

    TRANSFORMER_CHOICE:
        description:    Decides which transformer will be used to transform the pipe data
        values:
            legacy_csv: Transformer which transforms csv to json by building a JSON string via simple foreach operation
            optimized_csv:  Transformer which extends the legacy transformer and builds JSON strings via .reduce() method
    
    GOOGLE_DRIVE_STORAGE:
        description:    Decides if the transformed file should be stored in google drive
        values:
            enabled: Enables the storage service
            disabled: Disables the storage service

Google drive storage requirements

To use the google drive storage service the user must provide the required authentification credentials. The following steps describe the process of the authentication:

First follow the provided steps and create a service account and a key assigned to this account
After key creation a credential file should be automatically downloaded to your system - move this file to the root directory of this package
Assign the GOOGLE_APPLICATION_CREDENTIALS_FILE environment variable the path of credentials file (relative to root directory of the package)
Create a google drive folder and share it with the service account (share -> type service account email -> editor -> done)
Copy the id of the shared folder and save its value in SHARED_FOLDER_ID environment variable
Set GOOGLE_DRIVE_STORAGE environment variable to enabled

Running the docker image

From release v1.2.0 the package supports its use in docker containers. To run the package in a container follow these steps:

Create a .env in package directory by following the env.example file and the descriptions of the environment variables in Environment variables section
Run docker-compose up --build to run the package container in a detached mode
After successful run the transformed result will be available in the directory specified in DATA_DIR environment variable

Note: built image is also available here You can run this image via docker run with the following command: sudo docker run -v <absolute path of source/result files directory>:/app/testData -v <absolute path of credentials directory>:/app/credentials --env-file <relative path to env> mind33z/csv2json:<version> npm run start -- --sourceFile "/app/testData/<source file name>" --resultFile "/app/testData/<result file name>"

Benchmarks

During performance measuring two metrics were tracked - execution time and memory. The screenshots below demonstrate the results of converting a sample 0.8 MB test file the full bloated 13 GB test file.

V1.1

For this version, only the execution time metric was tracked, as the results of the previous version showed that there is no need to optimise memory usage. The first screenshot shows the results of the test that was run after the _buildJSONStringFromLine function was enhanced. The second screenshot shows the results of the testing after the code in the _transform function has been converted to asynchronous. Both tests were done with 13GB bloated data file.

enhanced build json execution time

async execution time

Enhacement of the _buildJSONStringFromLine had a positive influence on the execution time - the total time of the function decreased by roughly 10x which in the end led to total runtime decreasing by roughly 30 seconds. Converting _transform had an awful effect on package runtime - the total time of each key transform function (except _buildJSONStringFromLine) increased by 2x. This may have happened due to event loop encountering difficulties because it received too many simple task promises. Only the _buildJSONStringFromLine enchancement will be carried over into later versions.

V1.0

Execution time

Sample data (0.8 MB):

Sample execution time

Bloat data (13 GB):

Bloat execution time

The results of the profiler show that functions _buildJSONStringFromLine, _removeEscapeSlashes, _splitLineToArr influence the execution time the most (apart from node's own functions). It should be noted that on bloated dataset _splitLineToArr method overtakes the _buildJSONStringFromLine method in terms of execution time. The following releases should prioritize improving the highlighted methods.

Memory

Sample data (0.8 MB):

sample memory

Bloat data (13 GB):

bloat memory

The results of memory tracking show that even though the package has to process big amounts of data the memory used remains roughly the same. This can be attributed to the use of streams. No further improvements in memory usage are required.

Changelog

v1.2.0 - (2022-10-24)

Added

Upload to google drive functionality
Autodetect separator if no --separator argument is provided
docker-compose.yml, Dockerfile , and .dockerignore files
Workflow job for building a docker image from the project and pushing it to DockerHub
Csv to json transformer tests and a workflow that runs these tests on push
Custom logger implementation

Updated

Added Feature flags, Google drive storage requirements , and Running the docker image sections to the README.md file
Fixed a separator symbol bug in optimized JSON building method
Fixed JSON formatting issues

v1.1.0 - (2022-10-19)

Added

Feature flag toggling functionality via .env
CsvToJsonOptimizedStream a transform stream class which acts as an improved iteration of the previous transform stream
Refactored the project structure
TransformerFactory a factory which handles the creation of different transformers

Updated

Added benchmarks of the current version to benchmarks section in README.md

v1.0.0 - (2022-10-18)

Added

Implemented CsvToJsonStream class. This class:
- Transforms CSV to JSON data in chunks
- Handles the case of chunk having an incomplete line
- Checks if the CSV line was parsed into an array correctly and that no unescaped separators were used in the data itself
Measured the execution time and the memory usage of the converter when using the bloated 13 GB data file and sample 0.8 MB test data file.
Created a pipe out of ReadStream, CsvToJsonStream , and WriteStream and achieved the basic functionality of the package

Updated

README.md

v0.1.1 - (2022-10-14)

Added

Input Handling
Github actions workflow (on release bump if tag and package version mismatch and publish to npm)
Test data file generation function
README.md

Package structure

index.js contains the main code of the package
handleArgs.js contains logic related to handling input arguments
generate.js contains test data file generation logic
transformers/CsvToJsonStream.js contains the extended transform class used for transforming data from json format to csv
transformers/CsvToJsonOptimizedStream.js contains the enhanced transform methods of the CsvToJsonStream class
factories/TransformerFactory.jscontains a factory which handles the creation of different transformers
uploadToGoogleDrive.js contains a function which uploads the transformed result file to the shared folder provided in .env file
tests/ directory contains the tests of the package and a custom test runner
utils/Logger.js contains a custom logger implementation

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

csv2json

Description

Usage

Environment variables

Feature flags

Google drive storage requirements

Running the docker image

Benchmarks

V1.1

V1.0

Execution time

Memory

Changelog

Package structure