Word N-Grams for Polish

B. Ziółko, D. Skurzok, and M. Ziółko (Poland)

Keywords

Polish, ngrams, speech recognition, language modelling

Abstract

The large collection of word n-gram statistics for Polish is described. Some details of the text analysis algorithm supporting processing data on computer clusters is presented as well. The corpora of total size of 267 030 267 words were used. The encountered problems due to the special Polish characters are described as well as the impact of rich morphology in Polish on this type of statistics. The most common n-grams are presented and commented. This is the first publication of such statistics of Polish.

Important Links:



Go Back