Introduction of Logic in Language Modelling: The Minimum Perplexity Criterion

D. Bouchaffra


Logic, statistical language model, maximum likelihood estimation,word n-grams, minimum perplexity


Sparse configurations are inherent to any statistical model that is based on training. Sparse data are those that have not been encountered during the training phase. This problem represents a big challenge to the scientific community. It is well known that the ML estimator is sensitive to extreme values and is therefore unreliable. To answer this challenge, the author proposes a novel approach based on logic that uses the minimal perplexity criterion. In this approach, configurations are considered as probabilistic events, such as predicates related through logical connectors. The method is general; it can be applied to any type of data. In this work it is applied to estimate word trigram probability values from a corpus. Experimental results conducted on several test sets show that this logical approach using the minimal perplexity criterion is promising: it outperforms both the absolute discounting and the Good-Turing discounting techniques. It thus represents a significant contribution to language modelling.

Important Links:

Go Back