Text Extraction using Document Structure Features and Support Vector Machines

K. Zagoris and N. Papamarkos (Greece)


Page Layout, Text Extraction, Support Vector Machines, Document Structure Elements, Connected Component Analysis


In order to successfully locate and retrieve document images such as technical articles and newspapers, a text localization technique must be employed. The proposed method detects and extracts homogeneous text areas in document images indifferent to font types and size by using connected components analysis to detect blocks of foreground objects. Next, a descriptor that consists of a set of structural features is extracted from the merged blocks and used as input to a trained Support Vector Machines (SVM). Finally, the output of the SVM classifies the block as text or not.

