Bridging the Gap: From Multi Document Template Detection to Single Document Content Extraction

T. Gottron (Germany)


Content Extraction, Template Detection, Template Cluster ing, Web Mining.


Template Detection algorithms use collections of web documents to determine the structure of a common under lying template. Content Extraction algorithms instead op erate on a single document and use heuristics to determine the main content. In this paper we propose a way to com bine the reliability and theoretic underpinning of the first world with the single document based approach of the lat ter. Starting from a single initial document we use the set of hyperlinked web pages to build the required training set for Template Detection automatically. By clustering the doc uments in this set according to their underlying templates we clean the training set from documents based on differ ent templates. We confirm the applicability of the approach by using an entropy based Template Detection algorithm to build a Content Extractor.

