de-identify-sql

v1.0.4

Published

3 years ago

De-identify data in SQL statements

Downloads

0High
0Medium
0Low

frizzled

de-identify sql hipaa pii phi gdpr

De-identify SQL

De-identify data in SQL statements.

De-identify data in SQL example

About

De-identify SQL removes protected data from a .sql file or piped input (e.g. mysqldump) and sends results to stdout or a new .sql file. De-identify SQL replaces protected data / PII / PHI using the following techniques:

Replacing data with a fabricated value.
Redacting data with a constant value.
Generating new data (customizable JavaScript).

De-identify SQL supports:

Reading from piped input or a passed-in file.
Writing to piped output or creating a file.

Installation

Install globally using npm i -g de-identify-sql

Usage

> de-identify-sql -h
Options:
  -v, --version            Show version number                         [boolean]
  -h, --help               Show help                                   [boolean]
  -o, --outputFile         Output to file instead of stdout             [string]
  -i, --inputFile          Read from file instead of stdin              [string]
  -s, --strategyDirectory  Location of SQL strategy files
                                   [string] [default: {installed-dir}/strategy}]
  -f, --functionsFile      Location of file that contains all custom functions
                           (optional, overwrites .js per table)         [string]

Examples

# Read from piped input, write to piped output ...
> mysqldump my-database | de-identify-sql | gzip > de-id-mysqldump.sql.gz

# Read from piped input, write to .sql ...
> mysqldump my-database | de-identify-sql -o de-id-mysqldump.sql

# Read from .sql, write to piped output ...
> de-identify-sql -i mysqldump.sql | gzip > de-id-mysqldump.sql.gz

# Read from .sql, write to .sql ...
> de-identify-sql -i mysqldump.sql -o de-id-mysqldump.sql

# Read from .sql, write to screen (test de-identification) ...
> de-identify-sql -i mysqldump.sql
# or
> cat mysqldump.sql | de-identify-sql

De-identifying data within SQL

De-identify SQL uses strategy files which are mapped to tables to modify SQL statements. Strategy files should be placed in the /strategy folder (configurable) and named after the table they act on. If the incoming SQL contains USING the format is DATABASE_NAME.TABLE_NAME.json otherwise the format is TABLE_NAME.json. A CREATE TABLE statement must be part of the input file.

Example: /strategy/user.json

{
  "columns": [
    {
      "redactWith": "internet.email",
      "columnKey": "email",
      "tracked": false
    },
    {
      "redactWith": "NAME REMOVED",
      "columnKey": "name",
      "tracked": false
    },
    {
      "redactWith": "generateDatetime",
      "columnKey": "last_visit",
      "tracked": false
    },
    {
      "redactWith": "{{datatype.number({\"min\":18,\"max\":90})}}",
      "columnKey": "age",
      "tracked": false
    }
  ]   
}

The columns array describes how each SQL column should be modified. Omitted columns pass-through without modification.

columnKey - (string) The column to be modified by de-identify-sql.
redactWith - (string) There are four ways to de-identify data:
- A faker function - Possible functions : e.g. name.lastName or phone.phoneNumber
- A faker template - A mustache template of faker methods: e.g. {{name.firstName}} {{name.lastName}}, {{name.jobTitle}} or {{address.streetAddress}} {{address.city}}
- Custom JavaScript - You can call a function to create a value, these are defined in a .js file that matches the name of the .json file
- A constant value - Replace with a constant, e.g. VALUE REMOVED
tracked - (boolean) Tracking preserves data relationships while de-identifying SQL.

Tracking: Data Relationships

De-identify SQL can preserve relationships within data. If original data repeats, de-identify SQL recognizes and replaces it with the same value it previously used.

This allows the generated SQL to retain its structure while removing protected data. For example, if there are multiple INSERTs which contain the same email address these would become different email addresses. However, if the tracked parameter is true, then every instance will be replaced with the same made-up value.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme