compendium-js

v0.0.32

Published

2 years ago

Natural Language Processing in the browser: Tokenization, Part-of-Speech tagging and more.

0High
0Medium
0Low

ulflander

Compendium

English NLP for Node.js and the browser. Try it in your browser!

35k gzipped, Part-of-Speech tagging (92% on Penn treebank), entity recognition, sentiment analysis and more, MIT licensed.

Summary

Client-side install

Step 1: get the lib

Install it with bower:

bower install --save compendium

Or clone this repo and copy the dist/compendium.minimal.js file into your project.

Step 2: include the lib in your HTML page

<script type="text/javascript"
    src="path/to/compendium/dist/compendium.minimal.js"></script>

In order to ensure that Compendium will work as intended, you must specify the encoding of the HTML page as UTF-8.

Step 3: enjoy

Call the compendium.analyse function with a string as parameter, and get a complete analysis of the text.

console.log( compendium.analyse('Hello world :)') );

Node.js install

Step 1: get the lib

npm install --save compendium-js

Step 2: enjoy

var compendium = require('compendium-js');

console.log(compendium.analyse('Hello world :)'));

API

The main function to call is analyse.

It takes a string as unique argument, and returns an array containing an analysis for each sentence. For example, calling:

compendium.analyse('My name is Dr. Jekyll.');

will return an array like this one:

[ { time: 9,                        // Time of processing, in ms
    length: 6,                      // Count of tokens
    raw: 'My name is Dr. Jekyll .', // Raw string
    stats:
     { confidence: 0.4583,          // PoS tagging confidence
       p_foreign: 0,                // Percentage of foreign PoS tags, e.g. `FW`
       p_upper: 0,                  // Percentage of uppercased tokens, e.g. `HELLO`
       p_cap: 50,                   // Percentage of capitalized tokens, e.g. `Hello`
       avg_length: 3 },             // Average token length
    profile:
     { label: 'neutral',            // Sentiment: `negative`, `neutral`, `positive`, `mixed`
       sentiment: 0,                // Sentiment score
       amplitude: 0,                // Sentiment amplitude
       types: [],                   // Types ('tags') of sentence
       politeness: 0,               // Politeness score
       dirtiness: 0,                // Dirtiness score
       negated: false },            // Is sentence mainly negated
    entities: [ {                   // List of entities
        raw: 'Dr. Jekyll',          // Raw reconstructed entity
        norm: 'doctor jekyll',      // Normalized entity
        fromIndex: 3,               // Start token index
        toIndex: 4,                 // End token index
        type: null } ],             // Type of entity: null for unknown, `ip`, `email`...
    tags:                           // Array of PoS tags
     [ 'PRP$', 'NN', 'VBZ', 'NNP', 'NNP', '.' ],
    tokens:                         // Tokens details
     [ { raw: 'My',                 // Raw token
        norm: 'my',                 // Normalized
        pos: 'PRP$',                // PoS tag
        profile:
         { sentiment: 0,            // Sentiment score
           emphasis: 1,             // Emphasis multiplier
           negated: false,          // Is negated
           breakpoint: false },     // Is breakpoint
        attr:
         { acronym: false,          // Is acronym
           plural: false,           // Is plural
           abbr: false,             // Is an abbreviation
           verb: false,             // Is a verb
           entity: -1 } },          // Entity index, `-1` if no entity
        //
        // ... Other tokens
        //
   ] } ]

Skipping detectors

From version 0.0.26, in order to speed up the analyse, one can use the skipDetectors argument of the analyse function to skip some specific detectors.

Skippable detectors are the following:

sentiment: Sentiment analysis
entities: Entity extraction
negation: Negation detection
type: Type detection (declarative, interrogative...)
numeric: Numeric values extraction

For example, the following call to analyse won't run the entity extraction detector, meaning that Dr. Jekyll won't appear in the entities section of the analysis result:

compendium.analyse('My name is Dr. Jekyll.', null, ['entities']);

Processing overview

Decoding

Handles decoding of HTML entities (e.g. & to &), and normalization of some abbreviations that involve breakpoints chars (e.g. w/ to with).

Lexer

No good part-of-speech tagging is possible without a good lexer. A lot of efforts has been put into the Compendium's lexer, so it provides the right tokens to be processed. Currently the lexer is a combination of four passes:

A first pass splits the text into sentences
A second one applies some regular expressions to extract specific parts of the sentences (URLs, emails, emoticons...)
The third pass is a char by char parser that splits tokens in a sentence, relying on Punycode.js to properly handle emojis
The final pass consolidates tokens such as acronyms, abbreviations, contractions..., and handles a few exceptions

Cleaner

This very little piece runs after the lexer, and is in charge to normalize a few other slangs (e.g. gr8 to great).

Part-of-speech tagging

Tagging is performed using a Brill tagger (i.e. a base lexicon and a set of rules), with the addition of some inflection-based rules.

It's been inspired by the following projects that are worth being checked out:

Eric Brill tagger: latest implementation published under MIT license is available for download on the Plymouth University website at this link (direct download).
Mark Watson's FastTag Java library, a very simple implementation of the Brill's tagger.
NLP Compromise, another great JS NLP toolkit, with an interesting inflection-based approach

PoS tagging is tested a set of unit tests generated with the Stanford PoS tagger, double checked with common sense and another machine-learning oriented tagger, and is then evaluated using the Penn Treebank dataset.

In September 2015, Compendium PoS tagging score on Penn Treebank was 92.76% tags recognized for the browser version, and 94.31% for the Node.js version.

Dependency parsing

Warning: the following process has been proved hardly extensible, and isn't powerful enough given the amount of code already. It's being replaced in v1.0 by another one currently in development [September 5th, 2015].

Dependency parsing module. Still experimental, and requires a lot of additional rules, but promising.

Inspired in some extent by Syntex from Didier Bourigault ref. (fr).

Constraint based. Constraints are:

The governor is the head of the sentence (it doesnt have a master)
When possible, the governor is the first conjugated verb of the sentence
All other tokens must have a master
A token can have one and only one master
A master can have one or many dependencies
If no master is found for a token, then its master is the governor

Parsing is done through several passes:

First pass define direct dependencies from left to right
Second pass define direct dependencies from right to left
Third pass consolidate linked indirect dependencies using existing masters
Final pass consolidate unlinked indirect dependencies

Detectors

Starting from here, some detectors handle further analysis of the text. They're in charge to add some metadata to the analysis, such as the sentiment score and label.

These detectors can work at three different levels:

the token level
the sentence level
the text (global) level

Token level detectors add attributes to each token (sentiment and emphasis scores, normalized token...).

Sentence level detectors work accross many tokens (negation detection, entity recognition, sentiment analysis...).

Global level detectors (there are none yet) are supposed to provide a global analysis of the whole text: topics, global sentiment labelling...

Lexicons

The full lexicon for Node.js is based on the lexicon from Mark Watson's FastTag (around 90 000 terms, itself being imported from the Penn Treebank).

The minimal lexicon for the browser contains only a few thousands terms extracted from the full lexicon, and filtered using:

the list of the 10000 most common English words, an extract from the Google's Trillion Word Corpus
the list of scored sentiments words
Compendium suffixes detector
Compendium embedded knowledge

Part-of-Speech tags definition

Here is the list of Part-of-Speech tags used by Compendium. See at the bottom newly introduced tags.

, Comma                     ,
: Mid-sent punct.           : ;
. Sent-final punct          . ! ?
" quote                     "
( Left paren                (
) Right paren               )
# Pound sign                #
CC Coord Conjuncn           and,but,or
CD Cardinal number          one,two,1,2
DT Determiner               the,some
EX Existential there        there
FW Foreign Word             mon dieu
IN Preposition              of,in,by
JJ Adjective                big
JJR Adj., comparative       bigger
JJS Adj., superlative       biggest
LS List item marker         1,One
MD Modal                    can,should
NN Noun, sing. or mass      dog
NNP Proper noun, sing.      Edinburgh
NNPS Proper noun, plural    Smiths
NNS Noun, plural            dogs
PDT Predeterminer           all, both
POS Possessive ending       's
PP Personal pronoun         I,you,she
PRP$ Possessive pronoun     my,one's
RB Adverb                   quickly, not
RBR Adverb, comparative     faster
RBS Adverb, superlative     fastest
RP Particle                 up,off
SYM Symbol                  +,%,&
TO 'to'                     to
UH Interjection             oh, oops
VB verb, base form          eat
VBD verb, past tense        ate
VBG verb, gerund            eating
VBN verb, past part         eaten
VBP Verb, present           eat
VBZ Verb, present           eats
WDT Wh-determiner           which,that
WP Wh pronoun               who,what
WP$ Possessive-Wh           whose
WRB Wh-adverb               how,where

Compendium also includes the following new tag:

EM Emoticon                 :) :( :/

Development

Go to the wiki to get more details about the project.

License

The MIT License (MIT)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Compendium

Summary

Client-side install

Step 1: get the lib

Step 2: include the lib in your HTML page

Step 3: enjoy

Node.js install

Step 1: get the lib

Step 2: enjoy

API

Skipping detectors

Processing overview

Decoding

Lexer

Cleaner

Part-of-speech tagging

Dependency parsing

Detectors

Lexicons

Part-of-Speech tags definition

Development

License