Optical Character Segmentation from Structured Documents

M. Adornato, C. Scagliola and S. Dellepiane (Italy)


OCR, character segmentation, structured documents.


This work deals with character segmentation of words taken from structured documents presenting connected or broken characters. In an OCR system, the segmentation phase plays a determinant role in the global accuracy of the system. The algorithm presented here joins the use of region growing with the analysis of geometrical statistics of presumed characters to recover connected or fragmented characters, taking advantage of information derived from the other presumed characters present in the same sequence. This approach makes the algorithm independent from the font type and able to segment characters having different fonts in different fields of the same document. Found regions are classified as “large”, “regular” and “narrow”; large regions are split using information derived from median pitch; fragments are merged using information derived from overlap or closeness of regular and/or narrow regions. Each presumed character recovered must pass a dimensional test to be really considered a valid character.

Important Links:

Go Back