@bluehemoth/csvjsonify
v1.2.0
Published
A package which handles data transformation from csv format to json format
Downloads
18
Readme
csv2json
Description
A simple package meant for loading csv data, transforming it to json format and then outputting the transformed data.
Usage
Description
A simple package which transforms data from csv to json format.
Options
--sourceFile Absolute path of the file to be transformed.
--resultFile Absolute path of the file where the transformed data will be stored.
--separator Symbol, which is used in source file to separate values. The value of the should be either of , | ; \t (tab).
Defaults to comma if not provided
Examples
csvToJson --sourceFile "D:\source.csv" --resultFile "D:\result.json" --separator ","
Environment variables
TRANSFORMER_CHOICE Feature flag which indicates which data transformed should be used
GOOGLE_DRIVE_STORAGE Feature flag which enables the upload of transformation result to google drive
GOOGLE_APPLICATION_CREDENTIALS_FILE Name of the Google api service account key file
SHARED_FOLDER_ID Id of the google drive account that is shared with the Google API service account, transformation result will be uploaded to this folder
DATA_DIR Absolute path of the test data directory
CREDENTIALS_DIR Absolute path of the folder which contains the credentials file
SOURCE_FILE Name of the source file
RESULT_FILE Name of the results file
LOGGING_LEVEL Application logging level
Feature flags
The package allows the customization of its operation via env file feature flags. The package supports these flags:
TRANSFORMER_CHOICE:
description: Decides which transformer will be used to transform the pipe data
values:
legacy_csv: Transformer which transforms csv to json by building a JSON string via simple foreach operation
optimized_csv: Transformer which extends the legacy transformer and builds JSON strings via .reduce() method
GOOGLE_DRIVE_STORAGE:
description: Decides if the transformed file should be stored in google drive
values:
enabled: Enables the storage service
disabled: Disables the storage service
Google drive storage requirements
To use the google drive storage service the user must provide the required authentification credentials. The following steps describe the process of the authentication:
- First follow the provided steps and create a service account and a key assigned to this account
- After key creation a credential file should be automatically downloaded to your system - move this file to the root directory of this package
- Assign the
GOOGLE_APPLICATION_CREDENTIALS_FILE
environment variable the path of credentials file (relative to root directory of the package) - Create a google drive folder and share it with the service account (share -> type service account email -> editor -> done)
- Copy the id of the shared folder and save its value in
SHARED_FOLDER_ID
environment variable - Set
GOOGLE_DRIVE_STORAGE
environment variable toenabled
Running the docker image
From release v1.2.0 the package supports its use in docker containers. To run the package in a container follow these steps:
- Create a
.env
in package directory by following theenv.example
file and the descriptions of the environment variables inEnvironment variables
section - Run
docker-compose up --build
to run the package container in a detached mode - After successful run the transformed result will be available in the directory specified in
DATA_DIR
environment variable
Note: built image is also available here
You can run this image via docker run
with the following command: sudo docker run -v <absolute path of source/result files directory>:/app/testData -v <absolute path of credentials directory>:/app/credentials --env-file <relative path to env> mind33z/csv2json:<version> npm run start -- --sourceFile "/app/testData/<source file name>" --resultFile "/app/testData/<result file name>"
Benchmarks
During performance measuring two metrics were tracked - execution time and memory. The screenshots below demonstrate the results of converting a sample 0.8 MB test file the full bloated 13 GB test file.
V1.1
For this version, only the execution time metric was tracked, as the results of the previous version showed that there is no need to optimise memory usage. The first screenshot shows the results of the test that was run after the _buildJSONStringFromLine
function was enhanced. The second screenshot shows the results of the testing after the code in the _transform
function has been converted to asynchronous. Both tests were done with 13GB bloated data file.
Enhacement of the _buildJSONStringFromLine
had a positive influence on the execution time - the total time of the function decreased by roughly 10x which in the end led to total runtime decreasing by roughly 30 seconds. Converting _transform
had an awful effect on package runtime - the total time of each key transform function (except _buildJSONStringFromLine
) increased by 2x. This may have happened due to event loop encountering difficulties because it received too many simple task promises. Only the _buildJSONStringFromLine
enchancement will be carried over into later versions.
V1.0
Execution time
Sample data (0.8 MB):
Bloat data (13 GB):
The results of the profiler show that functions _buildJSONStringFromLine
, _removeEscapeSlashes
, _splitLineToArr
influence the execution time the most (apart from node's own functions). It should be noted that on bloated dataset _splitLineToArr
method overtakes the _buildJSONStringFromLine
method in terms of execution time. The following releases should prioritize improving the highlighted methods.
Memory
Sample data (0.8 MB):
Bloat data (13 GB):
The results of memory tracking show that even though the package has to process big amounts of data the memory used remains roughly the same. This can be attributed to the use of streams. No further improvements in memory usage are required.
Changelog
v1.2.0 - (2022-10-24)
Added
- Upload to google drive functionality
- Autodetect separator if no
--separator
argument is provided docker-compose.yml
,Dockerfile
, and.dockerignore
files- Workflow job for building a docker image from the project and pushing it to DockerHub
- Csv to json transformer tests and a workflow that runs these tests on push
- Custom logger implementation
Updated
- Added
Feature flags
,Google drive storage requirements
, andRunning the docker image
sections to the README.md file - Fixed a separator symbol bug in optimized JSON building method
- Fixed JSON formatting issues
v1.1.0 - (2022-10-19)
Added
- Feature flag toggling functionality via .env
CsvToJsonOptimizedStream
a transform stream class which acts as an improved iteration of the previous transform stream- Refactored the project structure
TransformerFactory
a factory which handles the creation of different transformers
Updated
- Added benchmarks of the current version to benchmarks section in README.md
v1.0.0 - (2022-10-18)
Added
- Implemented
CsvToJsonStream
class. This class:- Transforms CSV to JSON data in chunks
- Handles the case of chunk having an incomplete line
- Checks if the CSV line was parsed into an array correctly and that no unescaped separators were used in the data itself
- Measured the execution time and the memory usage of the converter when using the bloated 13 GB data file and sample 0.8 MB test data file.
- Created a pipe out of
ReadStream
,CsvToJsonStream
, andWriteStream
and achieved the basic functionality of the package
Updated
- README.md
v0.1.1 - (2022-10-14)
Added
- Input Handling
- Github actions workflow (on release bump if tag and package version mismatch and publish to npm)
- Test data file generation function
- README.md
Package structure
index.js
contains the main code of the packagehandleArgs.js
contains logic related to handling input argumentsgenerate.js
contains test data file generation logictransformers/CsvToJsonStream.js
contains the extended transform class used for transforming data from json format to csvtransformers/CsvToJsonOptimizedStream.js
contains the enhanced transform methods of theCsvToJsonStream
classfactories/TransformerFactory.js
contains a factory which handles the creation of different transformersuploadToGoogleDrive.js
contains a function which uploads the transformed result file to the shared folder provided in .env filetests/
directory contains the tests of the package and a custom test runnerutils/Logger.js
contains a custom logger implementation