remove-stopwords

v1.0.3

Published

3 years ago

A package to remove common stopwords from an array, it covers most languages and is optimized primarily for WorldBrain

Downloads

1,583

0High
0Medium
0Low

swissums

stopwords nlp remove

remove-stopwords

remove-stopword is a node module that allows you to strip stopwords from an input text. In natural language processing, "Stopwords" are words that are so frequent that they can safely be removed from a text without altering its meaning..

This library is specifically designed for WorldBrain's usecase of stripping as many words from every webpage as possible to make search-indexing faster in regards to several thousand documents of varying information.

Credits:

This module was essentially coppied directly from @fergiemcdowall's stopword library. The only differences is that more language support was added from this stopwords json lib Also there are minor tweaks to several languages specifically for worldbrains use-case. Unless otherwise specified all the stopwords came from stopwords json lib

Usage

Default (English)

By default, stopword will strip an array of "meaningless" English words

sw = require('stopword')
const oldString = 'a really Interesting string with some words'.split(' ')
const newString = sw.removeStopwords(oldString)
// newString is now [ 'really', 'Interesting', 'string', 'words' ]

Other languages

You can also specify a language other than English, as a string:

sw = require('stopword')
const oldString = 'Trädgårdsägare är beredda att pröva vad som helst för att bli av med de hatade mördarsniglarna åäö'.split(' ')
// sw.sv contains swedish stopwords
const newString = sw.removeStopwords(oldString, 'sv')
// newString is now [ 'Trädgårdsägare', 'beredda', 'pröva', 'helst', 'hatade', 'mördarsniglarna', 'åäö' ]

All languages

You can also specify to remove stopwords from all languages by specifying 'all':

sw = require('stopword')
const oldString = 'Trädgårdsägare är beredda att a really Interesting string with some words ciao'.split(' ')
// 'all' iterates over every stopword list in the lib
const newString = sw.removeStopwords(oldString, 'all')
// newString is now [ 'Trädgårdsägare', 'beredda', 'really', 'Interesting', 'string', 'words' ]

Custom list of stopwords

And last, but not least, it is possible to use your own, custom list of stopwords:

sw = require('stopword')
const oldString = 'you can even roll your own custom stopword list'.split(' ')
// Just add your own list/array of stopwords
const newString = sw.removeStopwords(oldString, [ 'even', 'a', 'custom', 'stopword', 'list', 'is', 'possible']
// newString is now [ 'you', 'can', 'roll', 'your', 'own']

API

Language List

Arrays of stopwords for the following languages are supplied:

af - Afrikaans
ar - Modern Standard Arabic
hy - Armenian
eu - Basque
bn - Bengali
br - Brazilian Portuguese
bg - Bulgarian
ca - Catalan
zh - Chinese
hr - Croation
hr - Czech
da - Danish
nl - Dutch
en - English
eo - Esperanto
et - Estonian
fa - Farsi
fi - Finnish
fr - French
gl - Galician
de - German
el - Greek
ha - Hausa
he - Hebrew
hi - Hindi
hu - Hungarian
id - Indonesian
ga - Irish
it - Italian
ja - Japanese
ko - Korean
la - Latin
lv - Latvian
mr - Marathi
no - Norwegian
fa - Persian
pl - Polish
pt - Portuguese
ro - Romanian
ru - Russian
sk - Slovak
sl - Slovenian
so - Somalia
st - Southern Sotho
es - Spanish
sw - Swahili
sv - Swedish
th - Thai
yo - Yoruba
zu - Zulu

sw = require('stopword')
norwegianStopwords = sw.no
// norwegianStopwords now contains an Array of norwgian stopwords

Languages with no space between words

ja Japanese and zh Chinese Simplified have no space between words. For these languages you need to split the text into words before feeding it to the stopword module. You can check out TinySegmenter for Japanese and chinese-tokenizer for Chinese.