sql-data-profiler

v0.2.6

Published

2 years ago

Generate sql queries to describe and explore sql data sets

Downloads

0High
0Medium
0Low

pcothenet

anselme

sql-data-profiler

sql-data-profiler is a utility module that generates sql code to profile data sets (eg. tables) in Redshift.

Installation

npm install sql-data-profiler --save

Quick start

var sql_data_profiler = require('sql-data-profiler');

Table profiler

Generates a sql statement that provides basic stats on the data in the "contacts" table.

var data_profiler = sql_data_profiler.data_profiler;

var options = {
	target_table: 'contacts',
	target_columns: ['email', 'a_industry']
};
var sql_code = data_profiler(options);

which accepts the following options

target_table
target_columns
results_table
calculate_frequency
use_perm_table
truncate_table

Distribution analysis

** What does this do?**

var distribution_analyzer = sql_data_profiler.distribution_analyzer;

var sql_code = distribution_analyzer({
	target_table: 'contacts',
	target_variable: {
		data_set: 'SELECT DISTINCT contact_id FROM events WHERE meta_event = \'conversion\'',
		join: {
			source_column: 'contact_id',
			target_column: 'contact_id'
		}
	},
	target_columns: [
		{
			name: 'a_name',
			max_number_of_values: 10,
		},
		{
			name: 'a_gender',
			max_number_of_values: 10
		},
		{
			name: 'a_address',
			max_number_of_values: 10,
			transformation: 'has_any_value'
		}
	]
});

Table stats

The following stats are calculated for each column.

| Stats | Description | | -------- | ---------- | | count_total | number of records in the table | | count_not_null | number of records where the value for the specified column is not null | | fill_rate | number of non-null values divided by number of records | | count_distinct | number of distinct values | | dupe_rate | number of distinct values divided by number of records where the value for the specified column is not null | | maximum_value | | | minimum_value | | | most_frequent_value_1 | | | most_frequent_value_1_frequency | | | most_frequent_value_2 | | | most_frequent_value_2 | | | most_frequent_value_3 | | | most_frequent_value_3 | |

TO DO

handle different data type (eg. boolean)
performance improvement

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

sql-data-profiler

Installation

Quick start

Table profiler

Distribution analysis

Table stats

TO DO