ezs-teeftfr
v3.1.2
Published
teeftfr statements for EZS
Downloads
1
Maintainers
Readme
Teeft adapted to French language
This package cannot be used alone. ezs has to be installed
Usage
import ezs from 'ezs';
import teeftfr from 'ezs-teeftfr';
ezs.use(teeftfr);
process.stdin
.pipe(ezs('STATEMENT_NAME', { STATEMENT_PARAMETERS })
.pipe(process.stdout);
Flow
The sequence of statements is determined by the structure of the expected input.
[ "/path/to/a/directory/of/documents" ] ->
[ListFiles]
pattern = *.txt
--> [ "/path1", "path2", ... ] -->
[GetFilesContent]
--> [ { path, content }, ... ] -->
[TEEFTSentenceTokenize]
--> [ { path, sentences: [ "sentence", ... ] }, ... ] -->
[TEEFTTokenize]
--> [ { path, sentences: [ ["token", ... ], ...] }, ... ] -->
[TEEFTNaturalTag]
--> [ { path, sentences: [ [
{
token: "token",
tag: [ "tag", ...]
}, ...
], ... ] }, ... ]
[TEEFTExtractTerms]
nounTag = NOM
adjTag = ADJ
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTFilterTags]
tags = NOM
tags = ADJ
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTStopWords]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTSumUpFrequencies]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length
},
{
term: "multiterm",
frequency,
length
}, ...
] }, ... ]
[TEEFTSpecificity]
sort = true
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[TEEFTFilterMonoFreq]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[TEEFTFilterMultiSpec]
--> [ { path, terms: [
{
term: "monoterm",
tag: [ "tag", ...],
frequency,
length,
specificity,
},
{
term: "multiterm",
frequency,
length,
specificity
}, ...
] }, ... ]
[JSONString]
wrap = true
indent = true
Example
To use the example examples/teeftfr.ezs
, you have to
- install
ezs
- install
ezs-teeftr
- install
ezs-basics
- run the script
That is to say:
npm i ezs
npm i ezs-teeftfr
npm i ezs-basics
echo examples/data/fr-articles | npx ezs ./examples/teeftfr.ezs
You can even use jq to beautify the JSON in the output.
Statements
Table of Contents
- TEEFTExtractTerms
- TEEFTFilterMonoFreq
- TEEFTFilterMultiSpec
- TEEFTFilterTags
- GetFilesContent
- ListFiles
- natural-tag
- profile
- TEEFTSentenceTokenize
- TEEFTSpecificity
- TEEFTStopWords
- TEEFTSumUpFrequencies
- ToLowerCase
- TEEFTTokenize
TEEFTExtractTerms
Take an array of objects { path, sentences: [token, tag: ["tag"]]} Regroup multi-terms when possible (noun + noun, adjective + noun, etc.), and computes statistics (frequency, etc.).
Parameters
data
Stream array of documents containing sentences of tagged tokensfeed
Array<Objects> same as data, withterm
replacingtoken
,length
, andfrequency
nounTag
string noun tag (optional, default'NOM'
)adjTag
string adjective tag (optional, default'ADJ'
)
Examples
[{
path: '/path/1',
sentences:
[[
{ token: 'elle', tag: ['PRO:per'] },
{ token: 'semble', tag: ['VER'] },
{ token: 'se', tag: ['PRO:per'] },
{ token: 'nourrir', tag: ['VER'] },
{
token: 'essentiellement',
tag: ['ADV'],
},
{ token: 'de', tag: ['PRE', 'ART:def'] },
{ token: 'plancton', tag: ['NOM'] },
{ token: 'frais', tag: ['ADJ'] },
{ token: 'et', tag: ['CON'] },
{ token: 'de', tag: ['PRE', 'ART:def'] },
{ token: 'hotdog', tag: ['UNK'] }
]]
}]
TEEFTFilterMonoFreq
Filter the data
, keeping only multiterms and frequent monoterms.
Parameters
data
Streamfeed
Array<Object>multiLimit
Number threshold for being a multiterm (in tokens number) (optional, default2
)minFrequency
Number minimal frequency to be taken as a frequent term (optional, default7
)
TEEFTFilterMultiSpec
Filter multiterms to keep only multiterms which specificity is higher than multiterms' average specificity.
Parameters
data
anyfeed
any
TEEFTFilterTags
Filter the text in input, by keeping only adjectives and names
Parameters
GetFilesContent
Take an array of file paths as input, and returns a list of
objects containing the path
, and the content
of each file.
ListFiles
Take an array of directory paths as input, a pattern, and returns a list of file paths matching the pattern in the directories from the input.
Parameters
pattern
String pattern for files (ex: "*.txt") (optional, default"*"
)
natural-tag
POS Tagger from natural
French pos tagging using natural (and LEFFF resources)
Take an array of documents (objects: { path, sentences: [[]] })
Yield an array of documents (objects: { path, sentences: [ [{ token: "token", tag: [ "tag", ... ] }, ...] ] })
Examples
[{
path: "/path/1",
sentences: [{ "token": "dans", "tag": ["prep"] },
{ "token": "le", "tag": ["det"] },
{ "token": "cadre", "tag": ["nc"] },
{ "token": "du", "tag": ["det"] },
{ "token": "programme", "tag": ["nc"] }
},
]
}]
profile
Profile the time a statement takes to execute.
You have to place one to initialize, and a second to display the time it takes.
Parameters
data
anyfeed
any
TEEFTSentenceTokenize
Segment the data into an array of documents (objects { path, content }).
Yield an array of documents (objects { path, sentences: []})
TEEFTSpecificity
Take documents (with a path
, an array of terms
, each term being an object
{ term, frequency, length[, tag] })
Process objects containing frequency, add a specificity to each object, and
remove all object with a specificity below average specificity (except when
filter
is false
).
Can also sort the objects according to their specificity, when sort
is
true
.
Parameters
data
anyfeed
anyweightedDictionary
string name of the weigthed dictionary (optional, default"Ress_Frantext"
)filter
Boolean filter below average specificity (optional, defaulttrue
)sort
Boolean sort objects according to their specificity (optional, defaultfalse
)
TEEFTStopWords
Filter the text in input, by removing stopwords in token
Parameters
data
Streamfeed
Array<Object>stopwords
string name of the stopwords file to use (optional, default'StopwFrench'
)
TEEFTSumUpFrequencies
Sums up the frequencies of identical lemmas from different chunks.
Parameters
ToLowerCase
Transform strings to lower case.
Parameters
TEEFTTokenize
Extract tokens from an array of documents (objects { path, sentences: [] }).
Yields an array of documents (objects: { path, sentences: [[]] })
Warning: results are surprising on uppercase sentences