@brc-dd/lingua

v0.1.0

Published

7 months ago

Downloads

0High
0Medium
0Low

brc-dd

1. What does this library do?

Its task is simple: It tells you which language some text is written in. This is very useful as a preprocessing step for linguistic data in natural language processing applications such as text classification and spell checking. Other use cases, for instance, might include routing e-mails to the right geographically located customer service department, based on the e-mails' languages.

2. Why does this library exist?

Language detection is often done as part of large machine learning frameworks or natural language processing applications. In cases where you don't need the full-fledged functionality of those systems or don't want to learn the ropes of those, a small flexible library comes in handy.

So far, other comprehensive open source libraries in the Rust ecosystem for this task are CLD2, Whatlang and Whichlang. Unfortunately, most of them have two major drawbacks:

Detection only works with quite lengthy text fragments. For very short text snippets such as Twitter messages, they do not provide adequate results.
The more languages take part in the decision process, the less accurate are the detection results.

Lingua aims at eliminating these problems. She nearly does not need any configuration and yields pretty accurate results on both long and short text, even on single words and phrases. She draws on both rule-based and statistical methods but does not use any dictionaries of words. She does not need a connection to any external API or service either. Once the library has been downloaded, it can be used completely offline.

3. Which languages are supported?

Compared to other language detection libraries, Lingua's focus is on quality over quantity, that is, getting detection right for a small set of languages first before adding new ones. Currently, the following 75 languages are supported:

A
- Afrikaans
- Albanian
- Arabic
- Armenian
- Azerbaijani
B
- Basque
- Belarusian
- Bengali
- Norwegian Bokmal
- Bosnian
- Bulgarian
C
- Catalan
- Chinese
- Croatian
- Czech
D
- Danish
- Dutch
E
- English
- Esperanto
- Estonian
F
- Finnish
- French
G
- Ganda
- Georgian
- German
- Greek
- Gujarati
H
- Hebrew
- Hindi
- Hungarian
I
- Icelandic
- Indonesian
- Irish
- Italian
J
- Japanese
K
- Kazakh
- Korean
L
- Latin
- Latvian
- Lithuanian
M
- Macedonian
- Malay
- Maori
- Marathi
- Mongolian
N
- Norwegian Nynorsk
P
- Persian
- Polish
- Portuguese
- Punjabi
R
- Romanian
- Russian
S
- Serbian
- Shona
- Slovak
- Slovene
- Somali
- Sotho
- Spanish
- Swahili
- Swedish
T
- Tagalog
- Tamil
- Telugu
- Thai
- Tsonga
- Tswana
- Turkish
U
- Ukrainian
- Urdu
V
- Vietnamese
W
- Welsh
X
- Xhosa
Y
- Yoruba
Z
- Zulu

4. How accurate is it?

Lingua is able to report accuracy statistics for some bundled test data available for each supported language. The test data for each language is split into three parts:

a list of single words with a minimum length of 5 characters
a list of word pairs with a minimum length of 10 characters
a list of complete grammatical sentences of various lengths

Both the language models and the test data have been created from separate documents of the Wortschatz corpora offered by Leipzig University, Germany. Data crawled from various news websites have been used for training, each corpus comprising one million sentences. For testing, corpora made of arbitrarily chosen websites have been used, each comprising ten thousand sentences. From each test corpus, a random unsorted subset of 1000 single words, 1000 word pairs and 1000 sentences has been extracted, respectively.

Given the generated test data, I have compared the detection results of Lingua, CLD2, Whatlang and Whichlang running over the data of Lingua's supported 75 languages. Languages that are not supported by the other libraries are simply ignored for the respective library during the detection process.

Each of the following sections contains four plots. The bar plots show the detailed accuracy results for each supported language. The box plots illustrate the distributions of the accuracy values for each classifier. The boxes themselves represent the areas which the middle 50 % of data lie within. Within the colored boxes, the horizontal lines mark the median of the distributions.

The first two plots in each section show the results for all supported languages in each classifier, respectively. The last two plots are restricted to the common subset of currently 16 languages that is supported by all compared classifiers. This distinction makes sense because the first box plot creates the impression that Whichlang is the most accurate classifier, but it is not. Whichlang supports only 16 languages whereas Lingua supports 75 languages. For the second box plot, the supported languages in Whatlang and Lingua have been restricted to those 16 languages supported by Whichlang. This provides for a more accurate comparison and shows that overall, Lingua is the most accurate language detection library in this comparison.

4.1 Single word detection

4.1.1 All languages

4.1.2 Common languages

4.2 Word pair detection

4.2.1 All languages

4.2.2 Common languages

4.3 Sentence detection

4.3.1 All languages

4.3.2 Common languages

4.4 Average detection

4.4.1 All languages

4.4.2 Common languages

4.5 Mean, median and standard deviation

The tables below show detailed statistics for each language and classifier including mean, median and standard deviation.