stopwords-json
v1.2.0
Published
Stopwords for various languages in JSON format.
Downloads
460
Maintainers
Readme
stopwords-json
Stopwords for various languages in JSON format. Per Wikipedia:
Stop words are words which are filtered out prior to, or after, processing of natural language data [...] these are some of the most common, short function words, such as the, is, at, which, and on.
You can use all stopwords with stopwords-all.json (keyed by language ISO 639-1 code), or see the below table for individual language stopword files.
Languages
There are a total of 50 supported languages:
Language | Stopword count | Filename --- | --- | --- Afrikaans | 51 | af.json Arabic | 162 | ar.json Armenian | 45 | hy.json Basque | 98 | eu.json Bengali | 116 | bn.json Breton | 126 | br.json Bulgarian | 259 | bg.json Catalan | 218 | ca.json Chinese | 542 | zh.json Croatian | 179 | hr.json Czech | 346 | cs.json Danish | 101 | da.json Dutch | 275 | nl.json English | 570 | en.json Esperanto | 173 | eo.json Estonian | 35 | et.json Finnish | 772 | fi.json French | 606 | fr.json Galician | 160 | gl.json German | 596 | de.json Greek | 75 | el.json Hausa | 39 | ha.json Hebrew | 194 | he.json Hindi | 225 | hi.json Hungarian | 781 | hu.json Indonesian | 355 | id.json Irish | 109 | ga.json Italian | 619 | it.json Japanese | 109 | ja.json Korean | 679 | ko.json Latin | 49 | la.json Latvian | 161 | lv.json Marathi | 99 | mr.json Norwegian | 172 | no.json Persian | 332 | fa.json Polish | 260 | pl.json Portuguese | 408 | pt.json Romanian | 282 | ro.json Russian | 539 | ru.json Slovak | 110 | sk.json Slovenian | 446 | sl.json Somalia | 30 | so.json Southern Sotho | 31 | st.json Spanish | 577 | es.json Swahili | 74 | sw.json Swedish | 401 | sv.json Thai | 115 | th.json Turkish | 279 | tr.json Yoruba | 60 | yo.json Zulu | 29 | zu.json
Sources
- Apache Lucene - Apache 2.0 License
- Carrot2 - License
- cue.language - Apache 2.0 License
- Jacques Savoy - BSD License
- SMART Information Retrieval System: ftp://ftp.cs.cornell.edu/pub/smart/
- ASP Stoplist Project - CC-BY and Apache 2.0
License and Copyright
Copyright (c) 2017 Peter Graham, contributors. Released under the Apache-2.0 license.