leo-profanity
v1.7.0
Published
Profanity filter, based on Shutterstock dictionary
Downloads
26,243
Maintainers
Readme
leo-profanity
Profanity filter, based on "Shutterstock" dictionary. Demo page, API document page
Installation
// npm
npm install leo-profanity
npm install leo-profanity --no-optional # install only English bad word dictionary
// yarn
yarn add leo-profanity
yarn add leo-profanity --ignore-optional # install only English bad word dictionary
// Bower
bower install leo-profanity
// dictionary/default.json
// githack
<script src="https://raw.githack.com/jojoee/bahttext/master/src/index.js"></script>
const filter = LeoProfanity
filter.clearList()
filter.add(["boobs", "butt"])
Example usage for npm
// support languages
// - en
// - fr
// - ru
var filter = require('leo-profanity');
// output: I have ****, etc.
filter.clean('I have boob, etc.');
// replace current dictionary with the french
filter.loadDictionary('fr');
// create new dictionary
filter.addDictionary('th', ['หนึ่ง', 'สอง', 'สาม', 'สี่', 'ห้า'])
See more here LeoProfanity - Documentation
Algorithm
This project decide to split it into 2 parts, Sanitize
and Filter
and these below is a interesting algorithms.
Sanitize
Attempt 1 (1.1): Convert all into lowercase string
Example:
- "SomeThing" to "something"
Advantage:
- Simple to understand
- Simple to implement
Disadvantage or Caution:
- Will ignore "case sensitive" word
Attempt 2 (1.2): Turn "similar-like" symbol to alphabet
Example:
- "@" to "a"
- "5" or "$" to "s"
- "@ss" to "ass"
- "b00b" to "boob"
- "a$$a$$in" to "assassin"
Advantage:
- Detect some trick words
Disadvantage or Caution:
- False positive
- Subjective, which depends on each person think about the symbol
- Limit user imagination (user cannot play with word)
e.g. "[email protected]"
e.g. user want to try something funny like "a$$a$$in"
Attempt 3 (1.3): Replace "." and "," with space to separate words
In some sentence, people usually using "." and "," to connect or end the sentence
Example:
- "I like a55,b00b.t1ts" to "I like a55 b00b t1ts"
Advantage:
- Increase founding possibility e.g. "I like a55,b00b.t1ts"
Disadvantage or Caution:
- Disconnect some words e.g. "[email protected]"
Filter
Attempt 1 (2.1): Split into array (or using regex)
Using space to split "word string" into "word array" then check by profanity word list
Example:
- "I like ass boob" to ["I", "like", "ass", "boob"]
Advantage:
- Simple to implement
Disadvantage:
- Need proper list of profanity word
- Some "false positive" e.g. Great tit (https://en.wikipedia.org/wiki/Great_tit)
Attempt 2 (2.2): Filter word inside (with or without space)
Detect all alphabet that contain "profanity word"
Example:
- "thistextisfunnyboobsanda55" which contains suspicious words: "boobs", "a55"
Advantage:
- Can detect "un-spaced" profanity word
Disadvantage:
- Many "false positive" e.g. http://www.morewords.com/contains/ass/, Clbuttic mistake (filter mistake)
In Summary
- We don't know all methods that can produce profanity word (e.g. how many different ways can you enter a55 ?)
- There have a non-algorithm-based approach to achieve it (yet)
- People will always find a way to connect with each other (e.g. Leet)
So, this project decide to go with 1.1, 1.3 and 2.1.
(note - you can found other attempts in "Reference" section)
CMD
npm run test.watch
npm run validate
npm run doc.generate
# test npm publish
npm publish --dry-run
Other languages
- [x] Javascript on npmjs.com/package/leo-profanity
- [x] PHP on packagist.org/packages/jojoee/leo-profanity
- [x] Python on pypi.org/project/leoprofanity
- [ ] Java on Maven
- [ ] Wordpress on wordpress.org
Reference
- Inspired by jwils0n/profanity-filter
- Algorithm / Discussion
- "similar-like" symbol to alphabet
- Replace Bad words using Regex
- Clbuttic
- The Clbuttic Mistake
- The Clbuttic Mistake: When obscenity filters go wrong
- Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?
- How do you implement a good profanity filter?
- The Untold History of Toontown’s SpeedChat (or BlockChattm from Disney finally arrives)
- Profanity Filter Performance in Java
- Resource bad-word list
- Bad words list (458 words) by Alejandro U. Alvarez
- DansGuardian - dansguardian.org, DansGuardian Phraselists
- Seven dirty words
- Shutterstock
- MauriceButler/badwords
- http://www.cs.cmu.edu/~biglou/resources/bad-words.txt
- Tool