Investigating Esperanto's Statistical Proportions Relative to other Languages using Neural Networks and Zipf's Law

B. Manaris, L. Pellicoro, G. Pothering, and H. Hodges (USA)


Natural language processing, artificial neural networks, classification, Zipf’s law


Esperanto is a constructed natural language, which was intended to be an easy-to-learn lingua franca. Zipf's law models the statistical proportions of various phenomena in human ecology, including natural languages. Given Esperanto’s artificial origins, one wonders how “natural” it appears, relative to other natural languages, in the context of Zipf’s law. To explore this question, we collected a total of 283 books from six languages: English, French, German, Italian, Spanish, and Esperanto. We applied Zipf-based metrics on our corpus to extract distributions for word, word distance, word bigram, word trigram, and word length for each book. Statistical analyses show that Esperanto’s statistical proportions are similar to those of other languages. We then trained artificial neural networks (ANNs) to classify books according to language. The ANNs achieved high accuracy rates (86.3% to 98.6%). Subsequent analysis identified German as having the most unique proportions, followed by Esperanto, Italian, Spanish, English, and French. Analysis of misclassified patterns shows that Esperanto’s statistical proportions resemble mostly those of German and Spanish, and least those of French and Italian.

Important Links:

Go Back