dbcp
v1.4.10
Published
Dump Mysql, Postgres, SQLServer, and ElasticSearch databases directly to AWS S3, Google Cloud Storage, or Azure.
Downloads
177
Readme
dbcp
$ dbcp --help
cli.js [inputFile] [outputFile]
$ dbcp data.parquet data.jsonl.gz
$ dbcp data.jsonl.gz s3://bucket/data.csv.gz
Copy from or to MySQL, PostgreSQL, SQLServer, LevelDB, MongoDB, and ElasticSearch directly to/from files on Amazon Web Services (AWS) S3, Google Cloud Storage (GCS), Microsoft Azure, SMB, HTTP, or another database.
Automatically converts between supported formats JSON, ND-JSON, CSV, SQL, Parquet, and TFRecord (with optional gzip compression).
CLI
Either --inputType
or --inputFile
and --outputType
or --outputFile
are required. Other options can be shortened, e.g --user
instead of --inputUser
. Only a database-to-database copy requires both --inputUser
and --outputUser
. The file format and compression is inferred from the filename.
dbcp
pipes Readable Node.JS streams to Writable streams. No intermediate storage is required.
API
- The
transformObject
/transformObjectStream
API can be used for streaming transforms of Big Data. - The CLI uses
transformBytes
/transformBytesStream
to render progress updates.
Features
dbcp
supports sharding. It can split or join groups of files.dbcp
can convert files from one format to another.dbcp
supports compound inserts, which can insert groups of associated rows from multiple tables.dbcp
can translate SQL dialects, e.g. dump a Postgres table to .sql file with SQLServer CREATE and INSERT syntax.
Credits
- Database powered by knex and knex-schema-inspector
- File system, file formats, and sharding provided by @wholebuzz/fs
- External file sorting with external-sorting
- Connected using the tree-stream primitives
ReadableStreamTree
andWritableStreamTree
- Used to implement @wholebuzz/mapreduce
Modules
Setup
Global install
$ npm install -g dbcp
$ dbcp --help
Local setup
$ npm init
$ npm install dbcp
$ ./node_modules/.bin/dbcp --help
Examples
API
Write object stream to any source and format
import { AnyFileSystem } from '@wholebuzz/fs/lib/fs'
import { LocalFileSystem } from '@wholebuzz/fs/lib/local'
import { S3FileSystem } from '@wholebuzz/fs/lib/s3'
import { dbcp } from 'dbcp'
import StreamTree from 'tree-stream'
const fileSystem = new AnyFileSystem([
{ urlPrefix: 's3://', fs: new S3FileSystem() },
{ urlPrefix: '', fs: new LocalFileSystem() }
])
await dbcp({
fileSystem,
outputFile: 's3://foo/bar.jsonl',
// e.g. from level (https://www.npmjs.com/package/level)) database
inputStream: StreamTree.readable(levelIteratorStream(leveldb.iterator())),
})
Read object stream from any source and format
import { openNullWritable } from '@wholebuzz/fs/lib/stream'
import { dbcp } from 'dbcp'
import { Transform } from 'stream'
// Supply transformObject and a do-nothing Writable for outputStream.
await dbcp({
fileSystem,
inputFiles: [ { url: '/tmp/foobar.csv.gz' } ],
outputStream: [ openNullWritable() ],
transformObject: (x) => { console.log('test', x) },
})
// Or alternatively supply outputStream with outputFormat = object
await dbcp({
fileSystem,
inputFiles: [ { url: '/tmp/foobar.csv.gz' } ],
// Without outputFormat = object, transform() would receive Buffer
outputFormat: DatabaseCopyFormat.object,
outputStream: [
StreamTree.writable(new Transform({
objectMode: true,
transform(data, _, cb) {
console.log('test', data)
cb()
},
}))
],
})
CLI
- Copy PostgreSQL table to Google Cloud Storage gzipped JSON file
- Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file
- Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table
- Copy MongoDB table to four gzipped JSON-Lines shards
- Copy SQLServer table to stdout
- Output a file or database to stdout
- Copy a file from AWS to GCP
- Convert file from ND-JSON to JSON
- Download a file
- Post a file to HTTP endpoint
- Create Athena DDL from JSON sample
- Create Postgres CREATE TABLE from JSON sample
- Split the test data file into four shards
- Join the split files back into one
Tested
PASS src/index.test.ts (85.9 s)
✓ Should hash test data as string
✓ Should hash test data stream
✓ Should copy local file
✓ Should read local directory
✓ Should convert to JSON from ND-JSON and back
✓ Should convert to sharded JSON from ND-JSON and back
✓ Should convert to Parquet from ND-JSON and back
✓ Should convert to TFRecord from ND-JSON and back
✓ Should load to level from ND-JSON and dump to JSON after external sort
✓ Should restore to and dump compound data
✓ Should restore to and dump from Elastic Search to ND-JSON
✓ Should restore to and dump from MongoDB to ND-JSON
✓ Should restore to and dump from Postgres to ND-JSON
✓ Should restore to and dump from Postgres to SQL
✓ Should not hang on error
✓ Should copy from Postgres to Mysql
✓ Should copy from Postgres to SQL Server
✓ Should dump from Postgres to Parquet file
✓ Should dump from MySQL to Parquet file
✓ Should dump from SQL Server to Parquet file
API Interface
export async function dbcp(args: DatabaseCopyOptions)
export interface DatabaseCopyOptions {
batchSize?: number
columnType?: Record<string, string>
compoundInsert?: boolean
contentType?: string
copySchema?: DatabaseCopySchema
engineOptions?: any
externalSortBy?: string[]
extra?: Record<string, any>
extraOutput?: boolean
fileSystem?: FileSystem
group?: boolean
groupLabels?: boolean
limit?: number
orderBy?: string[]
probeBytes?: number
query?: string
shardBy?: string
schema?: Column[]
schemaFile?: string
tempDirectories?: string[]
transformObject?: (x: unknown) => unknown | Promise<unknown>
transformObjectStream?: () => Duplex
transformBytes?: (x: string) => string
transformBytesStream?: () => Duplex
where?: Array<string | any[]>
}
CLI Options
$ dbcp --help
cli.js [inputFile] [outputFile]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--compoundInsert Compound insert mode can insert associated rows from
multiple tables. [boolean]
--contentType Content type [string]
--dataOnly Dump only the data, not the schema (data definitions).
[boolean]
--dbname Database [string]
--externalSortBy Sort data by property(s) with external-sorting [array]
--format
[choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
"sql"]
--group Group inputs with equinvalent orderBy [boolean]
--host Database host [string]
--inputFile Input file [array]
--inputFormat
[choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
"sql"]
--inputHost Input host [string]
--inputName Input database [string]
--inputPassword Input database password [string]
--inputPort Input database port [string]
--inputShards Input shards [number]
--inputTable Input database table [string]
--inputType Input database type
[string] [choices: "athena", "elasticsearch", "file", "http", "level",
"mongodb", "mssql", "mysql", "postgresql", "redis", "sqlite"]
--inputUser Input database user [string]
--limit Database query LIMIT [number]
--orderBy Database query ORDER BY [array]
--outputFile Output file [string]
--outputFormat
[choices: "csv", "json", "jsonl", "ndjson", "object", "parquet", "tfrecord",
"sql"]
--outputHost Output host [string]
--outputName Output database [string]
--outputPassword Output database password [string]
--outputPort Output database port [string]
--outputShards Output shards [number]
--outputTable Output database table [string]
--outputType Output database type
[string] [choices: "athena", "elasticsearch", "file", "http", "level",
"mongodb", "mssql", "mysql", "postgresql", "redis", "sqlite"]
--outputUser Output database user [string]
--password Database password [string]
--port Database port [string]
--probeBytes Probe bytes [number]
--query Query [string]
--schemaFile Use schema file if required, instead of schema inspection.
[string]
--schemaOnly Dump only the object definitions (schema), not data.
[boolean]
--shardBy Shard (or split) the data based on key [string]
--shards The number of shards to split or join the data [number]
--table Database table [string]
--user Database user [string]
--where Database query WHERE [array]
--whereDate Database query WHERE, final argument parsed as Javascript
date [array]
CLI Examples
Copy PostgreSQL table to Google Cloud Storage gzipped JSON file
$ dbcp \
--inputType postgresql \
--host localhost \
--dbname postgres \
--port 5433 \
--user postgres \
--password postgres \
--table foobar \
--outputFile gs://bucket/file.json.gz
Copy MySQL table to Amazon Web Services S3 gzipped JSON-Lines file
$ dbcp \
--inputType mysql \
--host localhost \
--dbname mydb \
--port 8083 \
--user root \
--password wp \
--table foobar \
--format jsonl \
--outputFile s3://bucket/object.jsonl.gz
Copy Amazon Web Services S3 gzipped JSON-Lines to MySQL table
$ dbcp \
--outputType mysql \
--host localhost \
--dbname mydb \
--port 8083 \
--user root \
--password wp \
--table foobar \
--inputFile s3://bucket/object.jsonl.gz
Copy SQLServer table to stdout
$ dbcp \
--inputType mssql \
--host localhost \
--dbname mymsdb \
--port 1433 \
--user SA \
--password "MyP@ssw0rd#" \
--table foobar \
--outputFile=-
Copy MongoDB table to four gzipped JSON-Lines shards
$ dbcp \
--inputType mongodb \
--host localhost \
--port 27017 \
--user root \
--password example \
--dbname test_db \
--table dbcptest \
--outputFile output-SSSS-of-NNNN.jsonl.gz \
--outputShards 4 \
--shardBy id
$ ls output*
-rw-r--r-- 1 user staff 782701 Feb 4 10:59 output-0001-of-0004.jsonl.gz
-rw-r--r-- 1 user staff 771980 Feb 4 10:59 output-0003-of-0004.jsonl.gz
-rw-r--r-- 1 user staff 794959 Feb 4 10:59 output-0000-of-0004.jsonl.gz
-rw-r--r-- 1 user staff 788720 Feb 4 10:59 output-0002-of-0004.jsonl.gz
Output a file to stdout
$ dbcp gs://bucket/archive.csv.gz | jq . | less
Copy a file from AWS to GCP
$ dbcp s3://bucket/object.json.gz gs://bucket/file.json.gz
Convert file from ND-JSON to JSON
$ dbcp foobar.jsonl bazbat.json
Download a file
$ dbcp "https://www.w3.org/People/mimasa/test/imgformat/img/w3c_home.png" foo.png
Post a file to HTTP endpoint
$ dbcp "./foo.png" "http://my.api/upload" --contentType "image/png"
Create Athena DDL from JSON sample:
$ dbcp --schemaOnly --inputFile ./sample.jsonl.gz --outputType athena --outputFile ddl.sql
Create Postgres CREATE TABLE from JSON sample:
$ dbcp --schemaOnly --inputFile ./sample.jsonl.gz --outputType postgresql --outputFile ddl.sql
Split the test data file into four shards:
$ dbcp ./test/test.jsonl.gz ./split-SSSS-of-NNNN.jsonl.gz --outputShards 4 --shardBy guid
Join the split files back into one:
$ dbcp ./split-SSSS-of-NNNN.jsonl.gz ./joined.jsonl.gz --inputShards 4 --orderBy id
dbcp / Exports