tables
v2.3.0
Published
Tables is a simple command-line tool and powerful library for importing data like a CSV or JSON file into relational tables quickly and easily.
Downloads
127
Readme
Tables
Tables is a simple command-line tool and powerful library for importing data like a CSV or JSON file into relational database tables. The goal is to make data import easy, configurable, and stable for large datasets into a relational database for better analysis.
Install
- To include as a library:
npm install tables
- To use as a command-line tool:
npm install -g tables
- Or you can use directly via
npx
:npx tables ...
- Or you can use directly via
Features
- Automatic data type guessing.
- Automatic indexes guessing.
- Efficient memory and CPU usage utilizing Node streams.
- Supports CSV-ish, JSON, NDJSON, and HTML table data sources (utilizes tito).
- Default use of SQLite so no need to setup a database server.
- (Needs re-do) Resumable (currently only works for CSV in input mode).
- Sane defaults.
- Supports MySQL, Postgres, and SQLite.
- Untested in MSSQL
- Useful, pretty output
Command line usage
The command line version can handle most of the simple options that the library uses. A simple example that reads in a CSV named filename.csv
will produce an SQLite database at filename.sqlite
.
tables -i filename.csv
CLI options
Use the --help
option to get a full, up-to-date look at what the options are, but here are most of them for reference:
--help
: Shows help message.--version
: Outputs version.--input
: The input file to use. Tables also can use a piped in source as well, if this is not provided. See Pipe section below.--db
: The database URI to push data into. By default, Tables will use the input option to create a SQLite database with the same path with a.sqlite
extension, or for piping, it will create an SQLite database nametables-import.sqlite
where the command is run. If you are using a password to access your database please consider using theTABLES_DB_URI
environment variable for security reasons. Examples:--db="sqlite://./my-new-db.sql"
--db="mysql://username:password@localhost/my-database"
--db="postgres://username:@localhost:1234/my-database"
--format
: Force type of parsing. This is determined from filename and defaults tocsv
. Valid values arecsv
,tsv
,json
,ndjson
,html
.--key
: Creates a unique key from columns if the models options in config is not provided. This is a suggested option as it allows for data to be updated as opposed to being added to. You can use a comma-delimited list of columns for a multi-column key, like "column 1,other,thing". Do note that this option only works if not manually defining the model, specifically in a config file.--index-fields
: Defines which fields will be indexed. By default this looks fields that end in "key", "id", or "name".- Provide a JS regular expression that starts with
/
, for example/^(key|column)$
. - Or provide a comma-separated list of columns or regular-expression-like values, such as
key|column_1
or.*key|.*id
.
- Provide a JS regular expression that starts with
--table-name
: Specify name of table importing into. By default, Tables uses the file name if provided, ortables_import
for piping data. Do note that this option only works if not manually defining the model, specifically in a config file.--silent
: No output except errors.--overwrite
: WARNING: DELETES DATA Removes existing data and tables. Use this is you don't have a unique key,--key
option, defined or if your model has changed.--batch-size
: Numbers of rows to import at once. Default is500
. Use lower or higher numbers depending on the database and how it is configured.--guess-size
: Numbers of rows to use for guessing model. Default is500
. Use lower or higher numbers depending on the database and how it is configured.--csv-headers
: Use the keyword, false, if there are no headers. Or use a comma separated list of headers. Defaults to reading headers from file. Forces type to CSV.--csv-delimiter
: CSV delimiter character. Defaults to,
. If you need to use a tab character, use--csv-delimiter=$'\t'
. Forces type to CSV.--csv-quote
: CSV quote character. Defaults to"
. Forces type to CSV.--json-path
: JSON path to use for parsing rows. Default is*
. See JSONStream. Forces type to CSV.--html-selector
: CSS selector to target specific table when using thehtml
type.--date-format
: Date format to use when guessing date columns and parsing data. Defaults toMM/DD/YYYY
. See moment.js for options.--datetime-format
: Datetime format to use when guessing date columns and parsing data. Defaults toMM/DD/YYYY HH:mm:ss a
. See moment.js for options.--transformer
: Point to a JS module that exports a function to transform the data before being loaded into the database. See transformer documentation below.--models
: Point to a JS module that exports a function to define the database models (table layout). See models documentation below.--no-transactions
: Turn off transactions. Each batch is a single transaction.--no-optimize
: Turn off optimizations that run before and after database inserts/updates are made. For very large datasets, optimizing could add a significant amount of time Tables takes.--config
: Allows to use a JS file that exports configuration for Tables. Any other options will override the values in the file. This allows for options that are not easily supported on a command line. See Options section under Library Use.
Piping
Piping in data is supported. It should be noted that a couple things happen with piping. A bit more memory is used if guessing models. A progress bar with time estimate cannot be used since we can't know how much data is in total. Without an input file and certain options not defined, the default SQLite file will be tables-import.sql
and table name of tables_import
.
cat filename.json | tables -f json
Examples
Get data from the NYC Data Portal about water quality complaints and then create an SQLite database of the same name. This will create examples/nyc-water-quality-complaints.sql
with a nyc_water_quality_complaints
table that has all the data.
$ curl --silent "https://data.cityofnewyork.us/api/views/qfe3-6dkn/rows.csv" | tables --db="sqlite://nyc-water-quality-complaints.sqlite" --table-name="nyc_water_complaints";
$ sqlite3 nyc-water-quality-complaints.sqlite
sqlite> .tables
sqlite> SELECT * FROM nyc_water_complaints;
Put my Github followers into an SQLite database named github.sql
and table name followers
:
curl --silent "https://api.github.com/users/zzolo/followers" | tables --format="json" --db="sqlite://./examples/github.sqlite" --table-name="followers" --key="id";
Using a little script to turn a paging API into ndjson, we can get all 40k+ candidates from the FEC API.
FEC_API_KEY=xxxxx node examples/api-pager.js --uri="https://api.open.fec.gov/v1/candidates/?per_page=100&page=[[page]]&api_key=$FEC_API_KEY" --results="results" --page="pagination.page" --pages="pagination.pages" | tables --db="sqlite://examples/fec.sqlite" --table-name="candidates" --key="candidate_id"
Or, use the FEC bulk downloads to get candidates from a specific year. The format that the bulk data comes in is pipe (|
) delimited and does not have headers, but we can account for that with specific arguments.
curl -L --silent "https://www.fec.gov/files/bulk-downloads/2020/weball20.zip" \
| funzip \
| tables --db="sqlite://examples/fec-bulk.sqlite" \
--table-name="candidates" \
--key="CAND_ID" \
--csv-delimiter="|" \
--csv-headers="CAND_ID,CAND_NAME,CAND_ICI,PTY_CD,CAND_PTY_AFFILIATION,TTL_RECEIPTS,TRANS_FROM_AUTH,TTL_DISB,TRANS_TO_AUTH,COH_BOP, COH_COP,CAND_CONTRIB,CAND_LOANS,OTHER_LOANS,CAND_LOAN_REPAY,OTHER_LOAN_REPAY,DEBTS_OWED_BY,TTL_INDIV_CONTRIB,CAND_OFFICE_ST,CAND_OFFICE_DISTRICT,SPEC_ELECTION,PRIM_ELECTION,RUN_ELECTION,GEN_ELECTION,GEN_ELECTION_PRECENT,OTHER_POL_CMTE_CONTRIB,POL_PTY_CONTRIB,CVG_END_DT,INDIV_REFUNDS,CMTE_REFUNDS"
Load in 600k+ rows of ordinance violations in Chicago. This specific examples expands on what fields will be indexed. Indexing will make the database loading a bit slower and will increase disk size (this produced a 500MB+ file), but will make queries faster. We also update the model guessing size which helps alleviates issues with guessing field length. And then update the batch size which may help speed things up, though this is dependent on may things.
curl -L --silent "https://data.cityofchicago.org/api/views/6br9-quuz/rows.csv?accessType=DOWNLOAD" \
| tables \
--db="sqlite://examples/chicago.sqlite" \
--table-name="ordinance_violations" \
--key="ID" \
--guess-size=1000 \
--batch-size=1000 \
--index-fields=".*date,.*name,.*number,.*disposition,.*fine,.*cost,.*code,latitude,longitude,.*borough,status" \
--overwrite \
--yes
Library use
Include like most libraries:
const { Tables } = require("tables");
Import in a CSV file:
// Define instance
let t = new Tables({
input: "./examples/nyc-water-quality-complaints.csv",
db: "sqlite://examples/nyc-water-complaints.sqlite",
// If running as a library, you probably want to supress the output
silent: true
});
// Run import
await t.start();
Library options
models
: This is how the database tables are defined. If this is left empty, Tables will try to guess this from some of the data. We use Sequelize models to help abstract things for different database. Data types are the types supported by Sequelize types. The only different is that you can attach atablesInputColumn
property to each field to help with auto-transforming.Argument
databaseConnection
: This is the Sequelize instance.Argument
Sequelize
: This is the Sequelize library for getting the data types from.Argument
options
: The options that were passed to the Tables object.Context: The context of the function should be the Tables object.
Returns: The function should return an object of Sequelize Models.
Example:
... models: (db, Sequelize, options) { class ExampleModel extends Sequelize.Model {} ExampleModel.init( { id: { // This is our only custom field that will help the auto // transformer know how to match up fields tablesInputColumn: 'Unique Key', // Field describes the column name in the database field: 'id', type: Sequelize.STRING(128), primaryKey: true }, type: { tablesInputColumn: 'Complaint Type', field: 'type', type: Sequelize.STRING(128) }, description: { tablesInputColumn: 'Descriptor', field: 'description', type: Sequelize.TEXT } }, { sequelize: db, tableName: 'example_model', indexes: [ // These need to be the field name, not the // sequelize name fields: ['type', 'description'] ] // These are suggested, but not necessary timestamps: false, underscored: true, freezeTableName: true } ); return { // The key here is important if writing your own // transformer exampleModel: ExampleModel, // You could have multiple models/tables // anotherTable: AnotherTable, // lookupTable: LookupTable }; } ...
transformer
: Custom transform function that will get run before loading data into the database. If not provided, Tables will use the an automatic transform function that uses the models to try to match up columns. The context is the Tables object and arguments are:Argument
data
: Data coming from format pipe.Argument
models
: Models configuration. If themodels
options is not provided, the transformer function gets run when guessing models, so this will be undefined when that happens.Argument
options
: Options that were given or configured by the Tables object.Context: The context of the function will be the Tables object.
Returns: Should return transformed data, an object where each key correspondes to a model, and each value is the row of data to insert/update.
For example:
... transformer: (data, models, options) { return { modelName: { id: data['Unique Key'], thing: data.t + 1, other: data['OTHER'] } }; } ...
formatOptions
: Options to pass to the stream parser. This will depend on whatinputType
option is given and the defaults change on that as well. See tito for full options.- The CSV parser is fast-csv
- The JSON parser is JSONstream and should be in a format like
{ path: "*" }
. - The HTML table parser can use a CSS selector to get the table with some like
{ selector: ".table" }
fieldsToIndex
: Used if guessing models (i.e.models
options is not provided). This should be a JS regular expression or a string version of a regular expression.dbOptions
: Tables uses Sequelize as its ORM to more easily support multiple database backends. This options is an object that goes intonew Sequelize(uri, options)
. Defaults in here are important for regular Tables usage.hooks
: An object of functions that get run at certain points during Tables lifecycle. All hooks pass thesequelize
,models
, andoptions
object, with the Tables instance as the context. The following are the supported hooks:finish
: Run right before the final optimize step. Helpful for database updates that can't be performed during import.
The following are all options that correspond to command-line options; see that section for more description.
| Library option | CLI option | Notes |
| ---------------- | ------------------- | ---------------------------------------------------------------------------------- |
| input
| --input
| |
| format
| --format
| |
| db
| --db
| Suggested to use TABLES_DB_URI
environment variable if you are using a password. |
| tableName
| --table-name
| |
| key
| --key
| |
| batchSize
| --batch-size
| |
| guessSize
| --guess-size
| |
| dateFormat
| --date-format
| |
| datetimeFormat
| --datetime-format
| |
| overwrite
| --overwrite
| |
| silent
| --silent
| |
Tips
- If you run into a
Parse Error
on a "CSV" input and your input doesn't actually use quote characters, it could be that you need to set the quote character to null or an empty string, i.e.--csv-quote=""
.
Debugging
Use the DEBUG
environment variable to get debug information, for instance DEBUG=tables*
.
Tests
Tests are run through mocha; after a npm install
, do the following:
npm run test
Credits
Tables was worked on while I worked at WNYC and Star Tribune.