datastore-to-bigquery

v1.2.4

Published

3 years ago

Dump Google Cloud Datastore Contents and load them into BigQuery.

Downloads

0High
0Medium
0Low

mdornseif

datastore-to-bigquery

Dump Google Cloud Datastore Contents and load them into BigQuery.

Sample Output

You can run it with npx:

% npx datastore-to-bigquery --help
usage: datastore-to-bigquery [-h] [-b BUCKET] [-d BACKUPDIR] [-n BACKUPNAME] [-s NAMESPACE] [-p BQPROJECTID]
                             [--datasetName DATASETNAME]
                             projectId

Copy datastore Contents to BigQuery.

positional arguments:
  projectId             Datastore project ID

optional arguments:
  -h, --help            show this help message and exit
  -b BUCKET, --bucket BUCKET
                        GCS bucket to store backup. Needs to be in the same Region as datastore. (default:
                        projectId.appspot.com
  -d BACKUPDIR, --backupDir BACKUPDIR
                        prefix/dir within bucket
  -n BACKUPNAME, --backupName BACKUPNAME
                        name of backup (default: autogenerated)
  -s NAMESPACE, --namespace NAMESPACE
                        datastore namespace
  -p BQPROJECTID, --bqProjectId BQPROJECTID
                        BigQuery project ID. (default: same as datastore)
  --datasetName DATASETNAME
                        Name of BigQuery Dataset to write to. Needs to be in the same Region as GCS bucket. (default:
                        same as projectId)

Please provide `GOOGLE_APPLICATION_CREDENTIALS` via the Environment!

Loading into BigQuery

This loads Datastore Data dumped by datastore-backup or other means into BigQuery. For this you have to make sure that the bucket containing the Data to be loaded and the BigQuery Dataset are in the same location/Region.

The BigQuery Dataset will be created if this does not exist.

CLI Usage

CLI Usage is simple. You have to provide the bucket and path to read from and the name of the BigQuery Project and dataset to write to:

% npx -p datastore-to-bigquery bigqueryLoad --help
usage: bigqueryLoad.ts [-h] bucket pathPrefix projectId datasetName

Load Datastore Backup into BigQuery.

positional arguments:
  bucket       GCS bucket to read backup.
  pathPrefix   Backup dir & name of backup in GCS bucket.
  projectId    BigQuery project ID.
  datasetName  Name of BigQuery Dataset to write to. Needs to be in the same Region as GCS bucket.

optional arguments:
  -h, --help   show this help message and exit

Please provide `GOOGLE_APPLICATION_CREDENTIALS` via the Environment!

Loading takes a few seconds per kind:

% yarn ts-node src/bin/bigqueryLoad.ts samplebucket-tmp bak/20211223T085120-sampleproj sampleproj test_EU
ℹ bucket samplebucket-tmp is in EU
✔  BigQuery Dataset test_EU exists
ℹ dataset sampleproj:test_EU is in unknown location
ℹ Reading samplebucket-tmp/bak/20211223T085120-sampleproj*
✔ Loading NumberingAncestor done in 1.33s
✔ Loading NumberingItem done in 4.231s

Moving the dataset

In case you need the dataset in an different BigQuery location / region for reading you can use bigquery transfer service which is blazing fast:

bq --location=US mk --dataset sampleproj:test_US
bq mk --transfer_config --data_source=cross_region_copy --display_name='Copy Dataset' \
      --project_id=sampleproj --target_dataset=test_US
      --params='{"source_dataset_id":"test_US","source_project_id":"sampleproj"}'

Programmatic Usage

Basically the same as command line usage:

import { BigQuery } from '@google-cloud/bigquery';
import {loadAllKindsFromPrefix} from '../lib/load-into-to-bigquery';

const bigquery = ;
await loadAllKindsFromPrefix(
  new BigQuery({ projectId }),
  args.datasetName,
  args.bucket,
  args.pathPrefix,
);

Full Dump-Load Cycle

% npx datastore-to-bigquery datastoreProject -n production -b bucket-tmp -b bigqueryProject

Hints

Permissions are a a little tricky to set up: Permissions for Datastore Export must exist in the Source and also for writing to the Bucket. Permissions for BigQuery-Load must exist on BigQuery. Permission for listing and reading must exist on GCS.

Locations / Regions are also tricky to setup. Basically the Datastore, the Bucket and the Dataset should have the same region, e.g. EU. If your need to do BigQuery from a different region, see "Moving the dataset".

Beware of namespaces! Dumping different Namespaces and loading them into the same BigQuery Dataset will result in incomplete Data in BigQuery.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

datastore-to-bigquery

Loading into BigQuery

CLI Usage

Moving the dataset

Programmatic Usage

Full Dump-Load Cycle

Hints

See also: