Japanese Text Classification using N-Gram and the Maximum Ratio of Term Frequency among Categories

M. Suzuki (Japan)

Keywords

Text mining, automatic text categorization, Naive Bayes, and Ngram

Abstract

In this paper, we consider the automatic text classification as a series of information processing and propose a new classification technique called the Maximum Frequency Ratio Accumulation Method (MFRAM). This is a simple technique that adds up the maximum ratios of term frequency among categories. However, in MFRAM, feature terms can be used without limit. Therefore, we propose the use of Character N-gram and Word N-gram as feature terms using the above-described property of MFRAM. Next, we evaluate the proposed technique through some experiments. Our experiments classify articles from Japanese newspaper “CD-Mainichi 2002” using the Naive Bayes method (baseline method) and the proposed method. As a result, we show that the proposed method outperforms the baseline method greatly. That is, the classification accuracy of the proposed method was 88.7%. Thus, the proposed method has a very high performance. Though the proposed method is a simple technique, it has a new viewpoint, a high potential, so it can be expected the development in the future.

Important Links:



Go Back