text-corpus
v0.0.2
Published
Some classes to represent elements in a text corpus.
Downloads
3
Maintainers
Readme
text-corpus
Some classes to represent elements in a text corpus. Currently, this is mainly something to be used in cetem-publico, tnt-tagger and other modules, but hopefully it will be generic enough to be useful in other contexts as well.
Installation
$ npm install text-corpus
Classes
Token
Used to represent the tokens (words) in the corpus.
new Token(word, info)
word
is the word in the original corpus textinfo
(all these are optional)tokenId
: an ID for this tokenlemma
: the lemmatized version ofword
pos
: the part-of-speech (POS) tag forword
- `other*: more information about the token
MultiWordExpression
This class provides a way to group some tokens into multi-word expressions.
MWEs can have attributes indicating the lemma and the POS tag for the whole expression.
new MultiWordExpression({lemma, pos}, tokens)
lemma
: the lemma for the multi-word expressionpos
: the POS tag for the multi-word expressiontokens
: an array of Token objects which make this MWE
Sentence
Sentences contain a list of tokens (the words in that sentence).
Because some words can form multi-word expressions, inside a
Sentence
we can find both Token
s and MultiWordExpression
s
(which, in turn, have Token
objects inside).
new Sentence(id, tokens)
id
: an id for the sentencetokens
: an array of tokens and MWEs which form this sentence
Paragraph
Paragraphs are composed of a sequence of sentences.
new Paragraph(id, sentences)
id
: an id for the paragraphsentences
: an array of sentences which form this paragraph
Bugs and stuff
Open a GitHub issue or, preferably, send me a pull request.
License
MIT