@neopass/wordlist
v0.5.2
Published
Generate a word list from various sources, including SCOWL
Downloads
980
Maintainers
Readme
wordlist
Generate a word list from various sources, including system dictionaries and SCOWL.
Includes a default list of ~86,000 english words.
Additional dictionary/wordlist paths can be configured via the options. System dictionaries exist at locations such as /usr/share/dict/words
, /usr/share/dict/british-english
, etc.
Contents
- Installation
- Usage
- Options
- The Default List
- Generate a List From SCOWL Sources
- Create a Custom Word List File
- SCOWL License
Installation
npm install @neopass/wordlist
Usage
There are three functions available for creating word lists: wordList
, wordListSync
, and listBuilder
. The default list is included by default, so no configuration of options is required.
wordList
builds and returns the list asynchronously:
const { wordList } = require('@neopass/wordlist')
wordList().then(list => console.log(list.length)) // 86748
wordListSync
builds and returns the list synchronously:
const { wordListSync } = require('@neopass/wordlist')
const list = wordListSync()
console.log(list.length) // 86748
listBuilder
calls back each word asynchronously:
const { listBuilder } = require('@neopass/wordlist')
const builder = listBuilder()
const list = []
builder(word => list.push(word))
.then(() => console.log(list.length)) // 86748
Options
export interface IListOptions {
/**
* Word list paths to search for in order. Only the first
* one found is used. This option is ignored if 'combine'
* is a non-empty array.
*
* default: [
* '$default',
* ]
*/
paths?: string[]
/**
* Word list paths to combine. All found files are used.
*/
combine?: string[]
/**
* Mutate the list by filtering on lower-case words, converting to
* lower case, or applying a custom mutator function.
*/
mutator?: 'only-lower'|'to-lower'|Mutator
}
paths
: Allows alternate, fallback lists to be used.
combine
: Allows multiple lists to be combined into one.
mutator
: mutates the list depending on the value provided.
only-lower
: Filter out words that are not strictly comprised of characters[a-z]
.to-lower
: Convert words to lower case.Mutator
:(word: string) => string|string[]|void
: a custom function that receives a word and returns one or more words, orundefined
. Used for custom transformation/exclusion of words in the list.
Return values:
string
: the returnedstring
is added to the list.string[]
: all returnedstring
s are added to the list.- For any other return value the word is not added.
const { wordList } = require('@neopass/wordlist')
/**
* Create a custom mutator for splitting hyphenated words
* and converting them to lower case.
*/
function customMutator(word: string) {
// Will return ['west', 'ender'] for an input of 'West-ender'.
return word.split('-').map(word => word.toLowerCase())
}
const options = {
paths: ['/some/list/path/words.txt'],
mutator: customMutator,
}
const list = await wordList(options)
assert(list.includes('west'))
assert(list.includes('ender'))
Specify Alternate Word Lists
The paths
specified in options
are searched in order and the first list found is used. This allows for the use of system word lists with different names and/or locations on various platforms. A common location for the system word list is /usr/share/dict/words
.
const { wordList } = require('@neopass/wordlist')
// Prefer british-english list.
const options = {
paths: [
'/usr/share/dict/british-english', // if found, use this one
'/usr/share/dict/american-english', // else if found, use this one
'/usr/share/dict/words', // else if found, use this one
'$default', // else use this one
]
}
wordList(options)
.then(list => console.log(list.length)) // 101825
Combine Lists
Lists can be combined into one with the combine
option:
const { wordList } = require('@neopass/wordlist')
// Combine multiple dictionaries.
const options = {
combine: [
// System dictionary.
'/usr/share/dict/words', // use this one
'$default', // and use this one
]
}
wordList(options)
.then(list => console.log(list.length)) // 335427
Important: Using combine
with wordList
/wordListSync
will result in duplicates if the lists overlap. It is recommended to use combine
with listBuilder
to control how words are added. For example, a Set
can be used to eliminate duplicates from combined lists:
const { listBuilder } = require('@neopass/wordlist')
// Combine multiple lists.
const options = {
combine: [
// System dictionary.
'/usr/share/dict/words',
// Default list.
'$default',
]
}
// Create a list builder.
const builder = listBuilder(options)
// Create a set to avoid duplicate words.
const set = new Set()
// Run the builder.
builder(word => set.add(word))
.then(() => console.log(set.size)) // 299569
The Default List
The default list is a ~86,000-word, PG-13, lower-case list taken from english SCOWL sources, with some other additions including slang.
Suggestions for additions to the default list are welcome by submitting an issue. Whole lists are definitely preferred to single-word suggestions, e.g., "notable extraterrestrials in history"
, "insects of upper polish honduras"
, or "names of horses in modern literature"
. Suggestions for inappropriate word removal are also welcome (curse words, coarse words/slang, racial slurs, etc.).
By default the list alias, $default
, is included in the options. This allows wordlist
to create a largish list without any additional configuration.
export const defaultOptions: IListOptions = {
paths: [
'$default'
]
}
/**
* We don't need to specify a config because the `$default` alias
* is part of the default configuration.
*/
const list = wordListSync()
The $default
alias (along with other aliases) resolves to a path at run time.
Generate a List From Scowl Sources
SCOWL word lists are included as aliases, and can be used to generate custom lists:
const { listBuilder } = require('@neopass/wordlist')
// Combine multiple lists from scowl.
const options = {
combine: [
'$english-words.10',
'$english-words.20',
'$english-words.35',
'$special-hacker.50',
]
}
// Create a list builder.
const builder = listBuilder(options)
// We'll add the words to a set.
const set = new Set()
// Run the builder.
builder(word => set.add(word))
.then(() => console.log(set.size)) // 49130
Warning: Some SCOWL sources contain words not approprate for all audiences, including swear words, racial slurs, and words of a sexual nature. You'll most likely want to scrutinize these sources depending on your use case and intended audience.
SCOWL is primarily intened as a source for spell checkers. From the SCOWL website:
SCOWL (Spell Checker Oriented Word Lists) and Friends is a database of information on English words useful for creating high-quality word lists suitable for use in spell checkers of most dialects of English. The database primary contains information on how common a word is, differences in spelling between the dialects if English, spelling variant information, and (basic) part-of-speech and inflection information.
Note: SCOWL sources contain some words with apostrophes 's
and also unicode characters. Care should be taken to deal with these depending on your needs. For example, we can transform words to remove any trailing 's
characters and then only accept words that contain the letters a-z:
const { listBuilder } = require('@neopass/wordlist')
/**
* Remove trailing `'s` from words.
*/
function transform(word) {
if (word.endsWith(`'s`)) {
return word.slice(0, -2)
}
return word
}
/**
* Determine if a word should be added.
*/
function accept(word) {
// Only accept words with characters a-z (case insensitive).
return (/^[a-z]+$/i).test(word)
}
// Combine multiple lists from scowl.
const options = {
combine: [
'$english-words.10',
'$english-words.20',
'$english-words.35',
'$special-hacker.50',
]
}
// Create a list builder.
const builder = listBuilder(options)
// Create a set to avoid duplicate words.
const set = new Set()
// Run the builder.
const _builder = builder((word) => {
word = transform(word)
if (accept(word)) {
set.add(word)
}
})
_builder.then(() => console.log(set.size)) // 38714
Scowl Aliases
A path alias is defined for every SCOWL source list. SCOWL aliases consist of the $
character followed by the source file name. The below is a representative sample of the available source aliases.
$american-abbreviations.70
$american-abbreviations.95
$american-proper-names.80
$american-proper-names.95
$american-upper.50
$american-upper.80
$american-upper.95
$american-words.35
$american-words.80
$australian-abbreviations.35
$australian-abbreviations.80
$australian-contractions.35
$australian-proper-names.35
$australian-proper-names.80
$australian-proper-names.95
$australian-upper.60
$australian-upper.95
$australian-words.35
$australian-words.80
$australian_variant_1-abbreviations.95
$australian_variant_1-contractions.60
$australian_variant_1-proper-names.80
$australian_variant_1-proper-names.95
$australian_variant_1-upper.80
$australian_variant_1-upper.95
$australian_variant_1-words.80
$australian_variant_1-words.95
$australian_variant_2-abbreviations.80
$australian_variant_2-abbreviations.95
$australian_variant_2-contractions.50
$australian_variant_2-contractions.70
$australian_variant_2-proper-names.95
$australian_variant_2-upper.80
$australian_variant_2-words.55
$australian_variant_2-words.95
$british-abbreviations.35
$british-abbreviations.80
$british-proper-names.80
$british-proper-names.95
$british-upper.50
$british-upper.95
$british-words.10
$british-words.20
$british-words.35
$british-words.95
$british_variant_1-abbreviations.55
$british_variant_1-contractions.35
$british_variant_1-contractions.60
$british_variant_1-upper.95
$british_variant_1-words.10
$british_variant_1-words.95
$british_variant_2-abbreviations.70
$british_variant_2-contractions.50
$british_variant_2-upper.35
$british_variant_2-upper.95
$british_variant_2-words.80
$british_variant_2-words.95
$british_z-abbreviations.80
$british_z-abbreviations.95
$british_z-proper-names.80
$british_z-proper-names.95
$british_z-upper.50
$british_z-upper.95
$british_z-words.10
$british_z-words.95
$canadian-abbreviations.55
$canadian-proper-names.80
$canadian-proper-names.95
$canadian-upper.50
$canadian-upper.95
$canadian-words.10
$canadian-words.95
$canadian_variant_1-abbreviations.55
$canadian_variant_1-contractions.35
$canadian_variant_1-proper-names.95
$canadian_variant_1-upper.35
$canadian_variant_1-upper.80
$canadian_variant_1-words.35
$canadian_variant_1-words.95
$canadian_variant_2-abbreviations.70
$canadian_variant_2-contractions.50
$canadian_variant_2-upper.35
$canadian_variant_2-upper.80
$canadian_variant_2-words.35
$canadian_variant_2-words.80
$english-abbreviations.20
$english-abbreviations.80
$english-contractions.35
$english-contractions.80
$english-contractions.95
$english-proper-names.35
$english-proper-names.80
$english-upper.35
$english-upper.80
$english-words.80
$english-words.95
$special-hacker.50
$special-roman-numerals.35
$variant_1-abbreviations.55
$variant_1-abbreviations.95
$variant_1-contractions.35
$variant_1-proper-names.80
$variant_1-proper-names.95
$variant_1-upper.35
$variant_1-upper.80
$variant_1-words.20
$variant_1-words.80
$variant_2-abbreviations.70
$variant_2-abbreviations.95
$variant_2-contractions.50
$variant_2-contractions.70
$variant_2-upper.35
$variant_2-upper.95
$variant_2-words.35
$variant_2-words.95
$variant_3-abbreviations.40
$variant_3-abbreviations.95
$variant_3-words.35
$variant_3-words.95
See the SCOWL Readme for a description of SCOWL sources.
Create a Custom Word List File
A custom word list file from miscellaneous sources can be assembled with the wordlist-gen
binary, or the word-gen
utility in the wordlist repo.
From the @neopass/wordlist
package:
npx wordlist-gen --sources <path1 path2 ...> [options]
From the wordlist repo:
git clone [email protected]:neopass/wordlist.git
cd wordlist
node bin/word-gen --sources <path1 path2 ...> [options]
First, set up a directory of book and/or word list files, for example:
root
+-- data
+-- books
| -- modern steam engine design.txt
| -- how to skin a rabbit.txt
+-- lists
| -- names.txt
| -- animals.txt
| -- slang.txt
+-- scowl
| -- english-words.10
| -- english-words.20
| -- english-words.35
| -- special-hacker.50
+-- exclusions
| -- patterns.txt
The structure doesn't really matter. The format should be utf-8
text, and can consist of one or more words per line. exclusions
is optional.
npx wordlist-gen --sources data/books data/lists data/scowl --out my-words.txt
sources
can specify multiple files and/or directories.
Note: only words consisting of letters a-z
are added, and they're all lower-cased.
Exclusions
Words can be scrubbed by specifying exclusions
:
node bin/word-gen <...> --exclude data/exclusions
Much like the sources, exclusions can consist of multiple files and/or directories in the following format:
# Exclude whole words (case insensitive):
spoon
fork
Tongs
# Exclude patterns (as regular expressions):
/^fudge/i # words starting with 'fudge'
/crikey/i # words containing 'crikey'
/shazam$/ # words ending in lowercase 'shazam'
/^BLASTED$/ # exact match for uppercase 'blasted'
Using the Custom List
Use path.resolve
or path.join
to create an absolute path to your custom word list file:
const path = require('path')
const { wordList } = require('@neopass/wordlist')
const options = {
paths: [
// Use a path relative to the location of this module.
path.resolve(__dirname, '../my-words.txt')
]
}
wordList(options)
.then(list => console.log(list.length)) // 124030
SCOWL License
Copyright 2000-2016 by Kevin Atkinson
Permission to use, copy, modify, distribute and sell these word
lists, the associated scripts, the output created from the scripts,
and its documentation for any purpose is hereby granted without fee,
provided that the above copyright notice appears in all copies and
that both that copyright notice and this permission notice appear in
supporting documentation. Kevin Atkinson makes no representations
about the suitability of this array for any purpose. It is provided
"as is" without express or implied warranty.