Adaptive Generic Classifier for Structured Documents

Hamam Mokayed and Azlinah Hj. Mohamed


Document classification, Feature Extraction, DTW


Structured documents as forms, cheques, and slips are used widely in all sectors and have an inherently high error rate (ERR) which is mainly due to many factors as inconsistent of human while filling the documents manually, different written language used all over the world to fill up the required information, different structure and layouts for each document. In document classification systems, not only it is difficult to keep the ERR low, finding features that differentiate the documents that are almost similar is considered as another tough challenge. Finding a generic solution for a different written language forms and solving the previous mentioned obstacles poses a great challenge in the development of more robust structured document classification system. In this paper, an adaptive generic document classification engine is proposed based on building a unique sequence of discrete symbols out of the structured document’s features and implementing a dynamic time wrapping (DTW) algorithm to calculate the similarities between the sequence of symbols of the tested document and all the saved sequence of symbols for all the templates and providing the decision. This novel technique of building a sequence of different symbols extracted out of a unique features and using a DTW algorithm to classify the input shows a higher level of robustness with improved ERR.

Important Links:

Go Back