npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

probabilistic-earley-parser

v0.9.6

Published

A parser for parsing Probabilistic Context Free Grammars

Downloads

9

Readme

Build Status npm version dev dependencies License

Probabilistic Earley parser

This is a library for parsing a string of tokens (like words) into parse trees that are weighted by probability. For example: you might want to know the probabilities for all derivations of an English sentence, or the most likely table of contents structure for a list of paragraphs. This library allows you to do so efficiently, as long as you can describe the rules as a Context-free Grammar (CFG).

The innovation of this library with respect to the gazillion other parsing libraries is that this one allows the poduction rules in your grammar to have a probability attached to them. This allows us to make a better choice in case of an ambiguous sentence: just select the derivation with the highest probability (this is called the Viterbi parse). If you do not need probabilities attached to your parse trees, you are probably better off using nearley instead.

For a theoretical grounding of this work, refer to Stolcke, An Efficient Probabilistic Context-Free Parsing Algorithm that Computes Prefix Probabilities.

Motivation

While libraries for nondeterministic grammars abound, I could not find an existing JavaScript implementation of the Probabilistic Earley Parser. I have made a stochastic CYK parser before, but I wanted something more top down that makes it easier to intervene in the parsing process, for instance when an unexpected token is encountered. In many cases Earley also parses faster than CYK (sparse grammars) and it doesn't require the grammar to be rewritten in any normal form.

Usage

import {getViterbiParse, Grammar} from 'probabilistic-earley-parser';
import treeify from 'treeify';

// Nonterminals are string
const S = "S"; // : NonTerminal 
const NP = "NP"; // : NonTerminal 
const VP = "VP"; // : NonTerminal 
const TV = "TV"; // : NonTerminal 
const Det = "Det"; // : NonTerminal 
const N = "N"; // : NonTerminal 
const Mod = "Mod"; // : NonTerminal 

// Terminals are functions that should return true when the parameter is of given type
const transitiveVerb = (token) => !!token.match(/(hit|chased)/); // : Terminal<string>
const the = (token) => !!token.match(/the/i);// : Terminal<string> 
const a = (token) => !!token.match(/a/i);// : Terminal<string> 
const man = (token) => !!token.match(/man/);// : Terminal<string> 
const stick = (token) => !!token.match(/stick/);// : Terminal<string> 
const with_ = (token) => !!token.match(/with/);// : Terminal<string> 

const grammar = Grammar.builder("test") //: Grammar<string,number> 
    .addNewRule(
        1.0,   // Probability between 0.0 and 1.0, defaults to 1.0. The builder takes care of converting it to the semiring element
        S,     // Left hand side of the rule
        [NP, VP] // Right hand side of the rule
    )
    // NP -> Det N (1.0)
    .addNewRule(
        1.0,
        NP,
        [Det, N] // eg. The man
    )
    // NP -> Det N Mod (1.0)
    .addNewRule(
        1.0,
        NP,
        [Det, N, Mod] // eg. The man (with a stick)
    )
    // VP -> TV NP Mod (0.4)
    .addNewRule(
        0.4,
        VP,
        [TV, NP, Mod] // eg. (chased) (the man) (with a stick)
    )
    // VP -> TV NP (0.6)
    .addNewRule(
        0.6,
        VP,
        [TV, NP] // eg. (chased) (the man with a stick)
    )
    .addNewRule(1.0, Det, [a])
    .addNewRule(1.0, Det, [the])
    .addNewRule(1.0, N, [man])
    .addNewRule(1.0, N, [stick])
    .addNewRule(1.0, TV, [transitiveVerb])
    .addNewRule(1.0, Mod, [with_, NP]) // eg. with a stick
    .build();

const tokens = ["The", "man", "chased", "the", "man", "with", "a", "stick"];
const viterbi = getViterbiParse(
    S,
    grammar,
    tokens
); // : ParseTreeWithScore<string>

console.log(viterbi.probability); // 0.6

function makeTree(o){
    if(o.children && o.children.length > 0){
        const obj = {
        };
        for(var i=0;i<o.children.length;i++){
            const name = o.children[i].token?o.children[i].token:o.children[i].category;
            obj[name] = makeTree(o.children[i]);
        }
        return obj;
    }else if(o.token) return o.token;
    else return o.category;
}

/*
0.6
└─ S
   ├─ NP
   │  ├─ Det
   │  │  └─ The
   │  └─ N
   │     └─ man
   └─ VP
      ├─ TV
      │  └─ chased
      └─ NP
         ├─ Det
         │  └─ the
         ├─ N
         │  └─ man
         └─ Mod
            ├─ with
            └─ NP
               ├─ Det
               │  └─ a
               └─ N
                  └─ stick
*/


console.log(treeify.asTree(makeTree(viterbi.parseTree)));

Some notes on implementation

Written in TypeScript, published as a commonjs module on NPM (ES6 with type declarations) and a single-file minified UMD module on Github in vulgar ES5.

This is an implementation of a probabilistic Earley parsing algorithm, which can parse any Probabilistic Context Free Grammar (PCFG) (also known as Stochastic Context Free Grammar (SCFG)), or equivalently any language described in Backus-Naur Form (BNF). In these grammars, rewrite rules may be non-deterministic and have a probability attached to them.

The probability of a parse is defined as the product of the probalities all the applied rules. Usually, we define probability as a number between 0 and 1 inclusive, and use common algebraic notions of addition and multiplication.

This code makes it possible to use any semiring for computing scores. My use for this is to avoid arithmetic underflow: imagine a computation like 0.1 * 0.1 * ... * 0.1. At some point, floating point arithmetic will be unable to represent a number so small. To counter, we use the Log semiring which holds the minus log of the probability. So that maps the numbers 0 and 1 to the numbers between infinity and zero, skewed towards lower probabilities:

Graph plot of f(x) = -log(x)

Graph for f(x) = -log x

Runtime complexity

The Earley algorithm has nice complexity properties. In particular, it can parse:

  • any CFG in O(n³),
  • unambiguous CFGs in O(n²)
  • left-recursive unambiguous grammars in O(n)

Note that this implementation does not apply innovations such as Joop Leo's improvement to run linearly on on right-recursive grammars as well. It might be complicated to implement this: making the parser stochastic is not as easy for Earley as it is for CYK.

For a faster parser that work on non-probabilistic grammars, look into nearley.

Limitations

  • I have not provisioned for ε-rules (rules with an empty right hand side)
  • Rule probability estimation may be performed using the inside-outside algorithm, but is not currently implemented
  • Higher level concepts such as wildcards, * and + are not implemented
  • Viterbi parsing (querying the most likely parse tree) only returns one single parse. In the case of an ambiguous sentence in which multiple dervation have the highest probability, the returned parse is not guaranteed the left-most parse (I think).

License

This software is licensed under a permissive MIT license.

References

Stolcke, Andreas. "An efficient probabilistic context-free parsing algorithm that computes prefix probabilities." Computational linguistics 21.2 (1995): 165-201. APA