sql-data-profiler
v0.2.6
Published
Generate sql queries to describe and explore sql data sets
Downloads
7
Readme
sql-data-profiler
sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.
Installation
npm install sql-data-profiler --save
Quick start
var sql_data_profiler = require('sql-data-profiler');
Table profiler
Generates a sql statement that provides basic stats on the data in the "contacts" table.
var data_profiler = sql_data_profiler.data_profiler;
var options = {
target_table: 'contacts',
target_columns: ['email', 'a_industry']
};
var sql_code = data_profiler(options);
which accepts the following options
- target_table
- target_columns
- results_table
- calculate_frequency
- use_perm_table
- truncate_table
Distribution analysis
** What does this do?**
var distribution_analyzer = sql_data_profiler.distribution_analyzer;
var sql_code = distribution_analyzer({
target_table: 'contacts',
target_variable: {
data_set: 'SELECT DISTINCT contact_id FROM events WHERE meta_event = \'conversion\'',
join: {
source_column: 'contact_id',
target_column: 'contact_id'
}
},
target_columns: [
{
name: 'a_name',
max_number_of_values: 10,
},
{
name: 'a_gender',
max_number_of_values: 10
},
{
name: 'a_address',
max_number_of_values: 10,
transformation: 'has_any_value'
}
]
});
Table stats
The following stats are calculated for each column.
| Stats | Description | | -------- | ---------- | | count_total | number of records in the table | | count_not_null | number of records where the value for the specified column is not null | | fill_rate | number of non-null values divided by number of records | | count_distinct | number of distinct values | | dupe_rate | number of distinct values divided by number of records where the value for the specified column is not null | | maximum_value | | | minimum_value | | | most_frequent_value_1 | | | most_frequent_value_1_frequency | | | most_frequent_value_2 | | | most_frequent_value_2 | | | most_frequent_value_3 | | | most_frequent_value_3 | |
TO DO
- handle different data type (eg. boolean)
- performance improvement