@ocha-hdx/redact-pii
v3.2.3-47-structured
Published
Remove personally identifiable information from text.
Downloads
57
Readme
redact-pii
NOTE: Users of [email protected] please check the Changelog before upgrading .
Remove personally identifiable information from text.
Prerequesites
This library is primarily written for node.js but it should work in the browser as well.
It is written in TypeScript and compiles to ES2017. The library makes use of async
functions and hence needs node.js 8.0.0 or higher (or a modern browser). If this is a problem for you please open an issue and we may consider adapting the compiler settings to support older node.js versions.
Simple example (synchronous API)
npm install redact-pii
const { SyncRedactor } = require('redact-pii');
const redactor = new SyncRedactor();
const redactedText = redactor.redact('Hi David Johnson, Please give me a call at 555-555-5555');
// Hi NAME, Please give me a call at PHONE_NUMBER
console.log(redactedText);
Simple example (asynchronous / promise-based API)
const { AsyncRedactor } = require('redact-pii');
const redactor = new AsyncRedactor();
redactor.redactAsync('Hi David Johnson, Please give me a call at 555-555-5555').then(redactedText => {
// Hi NAME, Please give me a call at PHONE_NUMBER
console.log(redactedText);
});
Supported Features
- sync and async API variants
- ability to customize what to use as replacement value for detected patterns
- built in regex based redaction rules for:
- credentials
- creditCardNumber
- emailAddress
- ipAddress
- name
- password
- phoneNumber
- streetAddress
- username
- usSocialSecurityNumber
- zipcode
- url
- digits
NOTE: the built-in redaction rules are mostly applicable for identifying (US-)english PII. Consider using custom patterns or Google Cloud DLP if you have non-english PII to redact.
- ability to add custom redaction regex patterns and complete custom redaction functions (both sync and async)
- ability to use Google Data Loss Prevention as advanced custom redactor
Advanced usage and features
Customize replacement values
const { SyncRedactor } = require('redact-pii');
// use a single replacement value for all built-in patterns found.
const redactor = new SyncRedactor({ globalReplaceWith: 'TOP_SECRET' });
redactor.redact('Dear David Johnson, I live at 42 Wallaby Way');
// Dear TOP_SECRET, I live at TOP_SECRET
// use a custom replacement value for a specific built-in pattern
const redactor = new SyncRedactor({
builtInRedactors: {
names: {
replaceWith: 'ANONYMOUS_PERSON'
}
}
});
redactor.redact('Dear David Johnson');
// Dear ANONYMOUS_PERSON
Add custom patterns or redaction functions
Note that the order of redaction rules matters, therefore you have to decide whether you want your custom redaction rules to run before
or after
the built-in ones. Generally it's better to put very specialized patterns or functions before
the built-in ones and more broad / general ones after
.
const { SyncRedactor } = require('redact-pii');
// add a custom regexp pattern
const redactor = new SyncRedactor({
customRedactors: {
before: [
{
regexpPattern: /\b(cat|dog|cow)s?\b/gi,
replaceWith: 'ANIMAL'
}
]
}
});
redactor.redact('I love cats, dogs, and cows');
// I love ANIMAL, ANIMAL, and ANIMAL
// add a synchronous custom redaction function
const redactor = new SyncRedactor({
customRedactors: {
before: [
{
redact(textToRedact) {
return textToRedact.includes('TopSecret')
? 'THIS_FILE_IS_SO_TOP_SECRET_WE_HAD_TO_REDACT_EVERYTHING'
: textToRedact;
}
}
]
}
});
redactor.redact('This document is classified as TopSecret.')
// THIS_FILE_IS_SO_TOP_SECRET_WE_HAD_TO_REDACT_EVERYTHING
import { AsyncRedactor } from './src/index';
// add an asynchronous custom redaction function
const redactor = new AsyncRedactor({
customRedactors: {
before: [
{
redactAsync(textToRedact) {
return myCustomRESTApiServer.redactCustomWords(textToRedact);
}
}
]
}
});
Disable specific built-in redaction rules
const redactor = new SyncRedactor({
builtInRedactors: {
names: {
enabled: false
},
emailAddress: {
enabled: false
}
}
});
Use Google Data Loss Prevention
Google Data Loss Prevention (DLP) has an extensive rule set to identify and redact PII that goes beyond just simple regex patterns. Consider using DLP in-addition to the built-in patterns of redact-pii for high value / sensitive data applications.
Also we strongly advice on using DLP if you have to redact non-english data since redact-pii's built-in patterns cover mostly US english patterns only and have no support for non-latin characters, whereas DLP has extensive support for international IDs, Chinese and Korean characters etc..
redact-pii
provides a small wrapper GoogleDLPRedactor
around DLP that can be used seperately or in conjunction with redact-pii's built-in patterns.
Note that Google Cloud DLP already also provides a node.js library (https://www.npmjs.com/package/@google-cloud/dlp) that can be used directly to redact data. You have to decide yourself if you want to use the GoogleDLPRedactor
wrapper or @google-cloud/dlp
directly. The main differentiators of using redact-pii
/ GoogleDLPRedactor
are:
GoogleDLPRedactor
already instantiates@google-cloud/dlp
with a bunch of sane defaults and infoTypes- redact-pii has a bunch of built-in patterns which can run in addition to DLP infoTypes
- it is easy to add custom patterns or rules to redact-pii
GoogleDLPRedactor
uses the.inspectContent
instead of.deidentifyContent
method of@google-cloud/dlp
which has a pricing advantage for large scale redaction scenarios since you will be only charged "Inspection Units" and no additional "Transformation Units" (see https://cloud.google.com/dlp/pricing) . redact-pii only uses DLP toidentify
PII but does the replacementtransformation
by itself which saves you some 💰💰💰.
Use Google Data Loss Prevention only (this won't make use of redact-pii's built-in regex patterns)
Prequesites: You have to have a Google Cloud Project with DLP enabled and you need a serviceaccount key json-file for a service account with the
serviceusage.services.use
permission orroles/dlp.user
role. For more detailed steps on how to get a valid service account key follow the steps here: https://github.com/googleapis/nodejs-dlp#before-you-beginSet the environment variable
GOOGLE_APPLICATION_CREDENTIALS
and point it to the serviceaccount key. E.g.:export GOOGLE_APPLICATION_CREDENTIALS=./path/to/my-serviceaccount-key.json
Use redact pii
const { GoogleDLPRedactor } = require('redact-pii');
const redactor = new GoogleDLPRedactor();
redactor.redactAsync('I live at 123 Park Ave Apt 123 New York City, NY 10002').then(redactedText => {
console.log(redactedText);
// I live at STREET_ADDRESS US_STATE City, LOCATION ZIPCODE'
});
Use Google DLP AND built-in patterns AND a custom pattern
You can create an AsyncRedactor
and add a GoogleDLPRedactor
as custom redactor to the AsyncRedactor
.
That way you are combining redact-pii's built-in patterns with Google DLP. The example below additionally adds a custom regexp pattern.
const { AsyncRedactor, GoogleDLPRedactor } = require('redact-pii');
const redactor = new AsyncRedactor({
customRedactors: {
before: [
new GoogleDLPRedactor(),
{
regexpPattern: /\b(cat|dog|cow)s?\b/gi,
replaceWith: 'ANIMAL'
}
]
}
});
redactor.redactAsync('I live at 123 Park Ave Apt 123 New York City, NY 10002 and love cats').then(redactedText => {
console.log(redactedText);
// I live at STREET_ADDRESS US_STATE City, LOCATION ZIPCODE and love ANIMAL'
});
Google DLP content size limit
The Google DLP service has a content size limit of 524288 bytes. If the input is over this limit, the GoogleDLPRedactor
will
by default automatically split the content into smaller batches and then combine the results together again. If this
behavior is undesired, it can be disabled by setting the disableAutoBatchWhenContentSizeExceedsLimit
option flag to
true:
new GoogleDLPRedactor({ disableAutoBatchWhenContentSizeExceedsLimit: true })
There is no intelligence to try to prevent splitting the batches in the middle of a word. If the batch happens to be split in the middle of a sensitive word then that word may not be redacted. You can always perform your own intelligent batching prior if needed.
Contributing
Run tests
You can run the tests via npm run test
. There are are a bunch of tests which require access to Google's DLP API.
They will only be run if you set the GOOGLE_APPLICATION_CREDENTIALS
environment variable - otherwise they'll be skipped automatically. You can set it via GOOGLE_APPLICATION_CREDENTIALS=/path/to/keyfile.json npm test
.