de-identify-sql
v1.0.4
Published
De-identify data in SQL statements
Downloads
2
Maintainers
Readme
De-identify SQL
De-identify data in SQL statements.
About
De-identify SQL removes protected data from a .sql file or piped input (e.g. mysqldump) and sends results to stdout or a new .sql file. De-identify SQL replaces protected data / PII / PHI using the following techniques:
- Replacing data with a fabricated value.
- Redacting data with a constant value.
- Generating new data (customizable JavaScript).
De-identify SQL supports:
- Reading from piped input or a passed-in file.
- Writing to piped output or creating a file.
Installation
Install globally using npm i -g de-identify-sql
Usage
> de-identify-sql -h
Options:
-v, --version Show version number [boolean]
-h, --help Show help [boolean]
-o, --outputFile Output to file instead of stdout [string]
-i, --inputFile Read from file instead of stdin [string]
-s, --strategyDirectory Location of SQL strategy files
[string] [default: {installed-dir}/strategy}]
-f, --functionsFile Location of file that contains all custom functions
(optional, overwrites .js per table) [string]
Examples
# Read from piped input, write to piped output ...
> mysqldump my-database | de-identify-sql | gzip > de-id-mysqldump.sql.gz
# Read from piped input, write to .sql ...
> mysqldump my-database | de-identify-sql -o de-id-mysqldump.sql
# Read from .sql, write to piped output ...
> de-identify-sql -i mysqldump.sql | gzip > de-id-mysqldump.sql.gz
# Read from .sql, write to .sql ...
> de-identify-sql -i mysqldump.sql -o de-id-mysqldump.sql
# Read from .sql, write to screen (test de-identification) ...
> de-identify-sql -i mysqldump.sql
# or
> cat mysqldump.sql | de-identify-sql
De-identifying data within SQL
De-identify SQL uses strategy files which are mapped to tables to modify SQL statements. Strategy files should be placed in the /strategy
folder (configurable) and named after the table they act on. If the incoming SQL contains USING
the format is DATABASE_NAME.TABLE_NAME.json
otherwise the format is TABLE_NAME.json
. A CREATE TABLE
statement must be part of the input file.
Example: /strategy/user.json
{
"columns": [
{
"redactWith": "internet.email",
"columnKey": "email",
"tracked": false
},
{
"redactWith": "NAME REMOVED",
"columnKey": "name",
"tracked": false
},
{
"redactWith": "generateDatetime",
"columnKey": "last_visit",
"tracked": false
},
{
"redactWith": "{{datatype.number({\"min\":18,\"max\":90})}}",
"columnKey": "age",
"tracked": false
}
]
}
The columns
array describes how each SQL column should be modified. Omitted columns pass-through without modification.
columnKey
- (string) The column to be modified by de-identify-sql.redactWith
- (string) There are four ways to de-identify data:- A faker function - Possible functions : e.g.
name.lastName
orphone.phoneNumber
- A faker template - A mustache template of faker methods: e.g.
{{name.firstName}} {{name.lastName}}, {{name.jobTitle}}
or{{address.streetAddress}} {{address.city}}
- Custom JavaScript - You can call a function to create a value, these are defined in a
.js
file that matches the name of the.json
file - A constant value - Replace with a constant, e.g.
VALUE REMOVED
- A faker function - Possible functions : e.g.
tracked
- (boolean) Tracking preserves data relationships while de-identifying SQL.
Tracking: Data Relationships
De-identify SQL can preserve relationships within data. If original data repeats, de-identify SQL recognizes and replaces it with the same value it previously used.
This allows the generated SQL to retain its structure while removing protected data. For example, if there are multiple INSERTs which contain the same email address these would become different email addresses. However, if the tracked
parameter is true
, then every instance will be replaced with the same made-up value.