fr-compromise
v0.2.8
Published
Linguistique computationnelle modeste
Downloads
364
Readme
fr-compromise
est un port de compromise en français
L'objectif de ce projet est de fournir un petit POS-tagger de base basé sur des règles.
import tal from 'fr-compromise'
let doc = tal(`Je m'baladais sur l'avenue le cœur ouvert à l'inconnu`)
doc.match('#Noun').out('array')
// [ 'je', 'avenue', 'cœur', 'inconnu' ]
ou côté client:
<script src="https://unpkg.com/fr-compromise"></script>
<script>
let txt = `J'avais envie de dire bonjour à n'importe qui`
let doc = frCompromise(txt) // espace de noms global
console.log(doc.sentences(1).json())
// { text:'J'avais...', terms:[ ... ] }
</script>
API
fr-compromise inclut toutes les méthodes de compromise/one
:
Output
- .text() - return the document as text
- .json() - return the document as data
- .debug() - pretty-print the interpreted document
- .out() - a named or custom output
- .html({}) - output custom html tags for matches
- .wrap({}) - produce custom output for document matches
Utils
- .found [getter] - is this document empty?
- .docs [getter] get term objects as json
- .length [getter] - count the # of characters in the document (string length)
- .isView [getter] - identify a compromise object
- .compute() - run a named analysis on the document
- .clone() - deep-copy the document, so that no references remain
- .termList() - return a flat list of all Term objects in match
- .cache({}) - freeze the current state of the document, for speed-purposes
- .uncache() - un-freezes the current state of the document, so it may be transformed
Accessors
- .all() - return the whole original document ('zoom out')
- .terms() - split-up results by each individual term
- .first(n) - use only the first result(s)
- .last(n) - use only the last result(s)
- .slice(n,n) - grab a subset of the results
- .eq(n) - use only the nth result
- .firstTerms() - get the first word in each match
- .lastTerms() - get the end word in each match
- .fullSentences() - get the whole sentence for each match
- .groups() - grab any named capture-groups from a match
- .wordCount() - count the # of terms in the document
- .confidence() - an average score for pos tag interpretations
Match
(match methods use the match-syntax.)
- .match('') - return a new Doc, with this one as a parent
- .not('') - return all results except for this
- .matchOne('') - return only the first match
- .if('') - return each current phrase, only if it contains this match ('only')
- .ifNo('') - Filter-out any current phrases that have this match ('notIf')
- .has('') - Return a boolean if this match exists
- .before('') - return all terms before a match, in each phrase
- .after('') - return all terms after a match, in each phrase
- .union() - return combined matches without duplicates
- .intersection() - return only duplicate matches
- .complement() - get everything not in another match
- .settle() - remove overlaps from matches
- .growRight('') - add any matching terms immediately after each match
- .growLeft('') - add any matching terms immediately before each match
- .grow('') - add any matching terms before or after each match
- .sweep(net) - apply a series of match objects to the document
- .splitOn('') - return a Document with three parts for every match ('splitOn')
- .splitBefore('') - partition a phrase before each matching segment
- .splitAfter('') - partition a phrase after each matching segment
- .lookup([]) - quick find for an array of string matches
- .autoFill() - create type-ahead assumptions on the document
Tag
- .tag('') - Give all terms the given tag
- .tagSafe('') - Only apply tag to terms if it is consistent with current tags
- .unTag('') - Remove this term from the given terms
- .canBe('') - return only the terms that can be this tag
Case
- .toLowerCase() - turn every letter of every term to lower-cse
- .toUpperCase() - turn every letter of every term to upper case
- .toTitleCase() - upper-case the first letter of each term
- .toCamelCase() - remove whitespace and title-case each term
Whitespace
- .pre('') - add this punctuation or whitespace before each match
- .post('') - add this punctuation or whitespace after each match
- .trim() - remove start and end whitespace
- .hyphenate() - connect words with hyphen, and remove whitespace
- .dehyphenate() - remove hyphens between words, and set whitespace
- .toQuotations() - add quotation marks around these matches
- .toParentheses() - add brackets around these matches
Loops
- .map(fn) - run each phrase through a function, and create a new document
- .forEach(fn) - run a function on each phrase, as an individual document
- .filter(fn) - return only the phrases that return true
- .find(fn) - return a document with only the first phrase that matches
- .some(fn) - return true or false if there is one matching phrase
- .random(fn) - sample a subset of the results
Insert
- .replace(match, replace) - search and replace match with new content
- .replaceWith(replace) - substitute-in new text
- .remove() - fully remove these terms from the document
- .insertBefore(str) - add these new terms to the front of each match (prepend)
- .insertAfter(str) - add these new terms to the end of each match (append)
- .concat() - add these new things to the end
- .swap(fromLemma, toLemma) - smart replace of root-words,using proper conjugation
Transform
- .sort('method') - re-arrange the order of the matches (in place)
- .reverse() - reverse the order of the matches, but not the words
- .normalize({}) - clean-up the text in various ways
- .unique() - remove any duplicate matches
Lib
(these methods are on the main nlp
object)
nlp.tokenize(str) - parse text without running POS-tagging
nlp.lazy(str, match) - scan through a text with minimal analysis
nlp.plugin({}) - mix in a compromise-plugin
nlp.parseMatch(str) - pre-parse any match statements into json
nlp.world() - grab or change library internals
nlp.model() - grab all current linguistic data
nlp.methods() - grab or change internal methods
nlp.hooks() - see which compute methods run automatically
nlp.verbose(mode) - log our decision-making for debugging
nlp.version - current semver version of the library
nlp.addWords(obj) - add new words to the lexicon
nlp.addTags(obj) - add new tags to the tagSet
nlp.typeahead(arr) - add words to the auto-fill dictionary
nlp.buildTrie(arr) - compile a list of words into a fast lookup form
nlp.buildNet(arr) - compile a list of matches into a fast match form
Les Numeros:
fr-compromise peut analyser les nombres écrits et numériques:
let doc = nlp(`j'ai moins quarante dollars`).debug()
doc.numbers().add(50)
doc.text()
// "j'ai dix dollars"
Lemmatisation:
il peut conjuguer des mots à leur racine:
let doc=nlp('Nous jetons les chaussures')
doc.compute('root')
doc.found('{jeter} les {chaussure}')
// true
Analyse de date:
à l'aide le plugin fr-compromise-dates
, il peut transformer des dates en langage naturel en dates au format ISO
import plg from 'fr-compromise-dates'
nlp.plugin(plg)
let opts = { timezone: 'UTC', today: '2023-03-02' }
let doc=nlp('Je peux emprunter votre voiture entre le 2 mai et le 14 juillets')
let res=doc.dates().json()[0]
/*
{
text: 'entre le 2 mai et le 14 juillet',
dates: [
{
start: '2023-05-02T00:00:00.000Z',
end: '2023-07-14T23:59:59.999Z'
}
]
}
*/
// true
Contribuant
Veuillez rejoindre pour aider! - please join to help!
help with first PR1
git clone https://github.com/nlp-compromise/fr-compromise.git
cd fr-compromise
npm install
npm test
npm watch
Voir aussi
- benob/french-tagger - python french tagger
- opennlp-french - Java tagger w/ french model
- TreeTagger - Perl tagger w/ french model
MIT