fr-compromise

v0.2.8

Published

a year ago

Linguistique computationnelle modeste

Downloads

364

0High
0Medium
0Low

spencermountain

fr-compromise est un port de compromise en français

L'objectif de ce projet est de fournir un petit POS-tagger de base basé sur des règles.

import tal from 'fr-compromise'

let doc = tal(`Je m'baladais sur l'avenue le cœur ouvert à l'inconnu`)
doc.match('#Noun').out('array')
// [ 'je', 'avenue', 'cœur', 'inconnu' ]

ou côté client:

<script src="https://unpkg.com/fr-compromise"></script>
<script>
  let txt = `J'avais envie de dire bonjour à n'importe qui`
  let doc = frCompromise(txt) // espace de noms global 
  console.log(doc.sentences(1).json())
  // { text:'J'avais...', terms:[ ... ] }
</script>

API

fr-compromise inclut toutes les méthodes de compromise/one:

Output

.text() - return the document as text
.json() - return the document as data
.debug() - pretty-print the interpreted document
.out() - a named or custom output
.html({}) - output custom html tags for matches
.wrap({}) - produce custom output for document matches

Utils

.found [getter] - is this document empty?
.docs [getter] get term objects as json
.length [getter] - count the # of characters in the document (string length)
.isView [getter] - identify a compromise object
.compute() - run a named analysis on the document
.clone() - deep-copy the document, so that no references remain
.termList() - return a flat list of all Term objects in match
.cache({}) - freeze the current state of the document, for speed-purposes
.uncache() - un-freezes the current state of the document, so it may be transformed

Accessors

.all() - return the whole original document ('zoom out')
.terms() - split-up results by each individual term
.first(n) - use only the first result(s)
.last(n) - use only the last result(s)
.slice(n,n) - grab a subset of the results
.eq(n) - use only the nth result
.firstTerms() - get the first word in each match
.lastTerms() - get the end word in each match
.fullSentences() - get the whole sentence for each match
.groups() - grab any named capture-groups from a match
.wordCount() - count the # of terms in the document
.confidence() - an average score for pos tag interpretations

Match

(match methods use the match-syntax.)

.match('') - return a new Doc, with this one as a parent
.not('') - return all results except for this
.matchOne('') - return only the first match
.if('') - return each current phrase, only if it contains this match ('only')
.ifNo('') - Filter-out any current phrases that have this match ('notIf')
.has('') - Return a boolean if this match exists
.before('') - return all terms before a match, in each phrase
.after('') - return all terms after a match, in each phrase
.union() - return combined matches without duplicates
.intersection() - return only duplicate matches
.complement() - get everything not in another match
.settle() - remove overlaps from matches
.growRight('') - add any matching terms immediately after each match
.growLeft('') - add any matching terms immediately before each match
.grow('') - add any matching terms before or after each match
.sweep(net) - apply a series of match objects to the document
.splitOn('') - return a Document with three parts for every match ('splitOn')
.splitBefore('') - partition a phrase before each matching segment
.splitAfter('') - partition a phrase after each matching segment
.lookup([]) - quick find for an array of string matches
.autoFill() - create type-ahead assumptions on the document

Case

.toLowerCase() - turn every letter of every term to lower-cse
.toUpperCase() - turn every letter of every term to upper case
.toTitleCase() - upper-case the first letter of each term
.toCamelCase() - remove whitespace and title-case each term

Whitespace

.pre('') - add this punctuation or whitespace before each match
.post('') - add this punctuation or whitespace after each match
.trim() - remove start and end whitespace
.hyphenate() - connect words with hyphen, and remove whitespace
.dehyphenate() - remove hyphens between words, and set whitespace
.toQuotations() - add quotation marks around these matches
.toParentheses() - add brackets around these matches

Loops

.map(fn) - run each phrase through a function, and create a new document
.forEach(fn) - run a function on each phrase, as an individual document
.filter(fn) - return only the phrases that return true
.find(fn) - return a document with only the first phrase that matches
.some(fn) - return true or false if there is one matching phrase
.random(fn) - sample a subset of the results

Insert

.replace(match, replace) - search and replace match with new content
.replaceWith(replace) - substitute-in new text
.remove() - fully remove these terms from the document
.insertBefore(str) - add these new terms to the front of each match (prepend)
.insertAfter(str) - add these new terms to the end of each match (append)
.concat() - add these new things to the end
.swap(fromLemma, toLemma) - smart replace of root-words,using proper conjugation

Transform

.sort('method') - re-arrange the order of the matches (in place)
.reverse() - reverse the order of the matches, but not the words
.normalize({}) - clean-up the text in various ways
.unique() - remove any duplicate matches

Lib

(these methods are on the main nlp object)

nlp.tokenize(str) - parse text without running POS-tagging
nlp.lazy(str, match) - scan through a text with minimal analysis
nlp.plugin({}) - mix in a compromise-plugin
nlp.parseMatch(str) - pre-parse any match statements into json
nlp.world() - grab or change library internals
nlp.model() - grab all current linguistic data
nlp.methods() - grab or change internal methods
nlp.hooks() - see which compute methods run automatically
nlp.verbose(mode) - log our decision-making for debugging
nlp.version - current semver version of the library
nlp.addWords(obj) - add new words to the lexicon
nlp.addTags(obj) - add new tags to the tagSet
nlp.typeahead(arr) - add words to the auto-fill dictionary
nlp.buildTrie(arr) - compile a list of words into a fast lookup form
nlp.buildNet(arr) - compile a list of matches into a fast match form

Les Numeros:

fr-compromise peut analyser les nombres écrits et numériques:

let doc = nlp(`j'ai moins quarante dollars`).debug()
doc.numbers().add(50)
doc.text()
// "j'ai dix dollars"

Lemmatisation:

il peut conjuguer des mots à leur racine:

let doc=nlp('Nous jetons les chaussures')
doc.compute('root')
doc.found('{jeter} les {chaussure}')
// true

Analyse de date:

à l'aide le plugin fr-compromise-dates, il peut transformer des dates en langage naturel en dates au format ISO

import plg from 'fr-compromise-dates'
nlp.plugin(plg)
let opts = { timezone: 'UTC', today: '2023-03-02' }

let doc=nlp('Je peux emprunter votre voiture entre le 2 mai et le 14 juillets')
let res=doc.dates().json()[0]
/*
  {
    text: 'entre le 2 mai et le 14 juillet',
    dates: [
      {
        start: '2023-05-02T00:00:00.000Z',
        end: '2023-07-14T23:59:59.999Z'
      }
    ]
  }
*/
// true

Contribuant

Veuillez rejoindre pour aider! - please join to help!

help with first PR1

git clone https://github.com/nlp-compromise/fr-compromise.git
cd fr-compromise
npm install
npm test
npm watch

Voir aussi

benob/french-tagger - python french tagger
opennlp-french - Java tagger w/ french model
TreeTagger - Perl tagger w/ french model

MIT