The Use of Absolute Discounting in Similarity-based Techniques for Language Modeling

S. Dianati (Iran)


Statistical Language Modeling, Similarity Based Model.


Statistical language models estimate the distribution of various natural language phenomena for the purpose of speech recognition and other language technologies. Statistical natural language processing (NLP) methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work, we applied four similarity-based methods to a standard Persian database (Farsdat) and compared these four methods with each other. In these four methods we use two different similarity metric and two different methods to avoid dividing by zero in calculating KL divergence. The results show that using absolute discounting and KL divergence improved test perplexity 47%.

Important Links:

Go Back