Document Page Image Classification based on Similarity of Visual Appearance

C. Shin and D. Doermann (USA)

Keywords

Document image categorization and classification, databases and retrieval, visual similarity, decision tree classifiers, selforganizing maps

Abstract

Categorizing documents by their type or genre is a natural way to enhance the effectiveness of document retrieval. Visual appearance of a document’s layout contains a significant amount of information that can be used to classify it by type in the absence of domain-specific models. Our approach to classification is based on “visual similarity” of layout structure and is implemented by building a supervised classifier, given examples of each class. We use image features such as percentages of text and non-text (graphics, images, tables, and rulings) content regions, column structures, relative point sizes of fonts, density of content area, and statistics of features of connected components which can be derived without class knowledge. In order to obtain class labels for training samples, we conducted a study where subjects ranked document pages with respect to their resemblance to representative page images. Class labels may also be assigned based on known document types, or can be defined by the user. We implemented our classification scheme using decision tree classifiers as well as self organizing maps.

Important Links:



Go Back