Semi-Automatic Parsing for Web Knowledge Extraction

D. Camacho, M.A. López, and R. Aler (Spain)


Knowledge Acquisition, Information Extraction, Wrapper and Filtering Agents


Two hard problems deserve attention when design ing and implementing WEB-based applications for min ing knowledge. In one hand, it is necessary to build sev eral types of WEB agents (crawlers, spiders...) special ized in the knowledge sources that will be used by the whole application, and how these agents could work more or less semi-automatically adapting their behaviours to the dynamic conditions of these electronic sources. On the other hand, once these agents are fully implemented and deployed, maintenance can be hard because if the sources change, it will be necessary to modify the related agents. Therefore, if robust WEB-based agent systems are required, it is necessary to bear in mind that several parts (or func tions) of the agents could be replaced at any time by the engineers. This paper presents a new approach to the prob lem of Information Extraction from the HTML information stored in the WEB. Information Extraction will be achieved by a set of specialized WEB agents that can access, retrieve and finally filter the stored information in the HTML pages. Our approach uses a semi-automatic HTML parser that use both a set of rules that define the knowledge to be extracted from the HTML pages and a set of rules to represent the final structure to store the knowledge retrieved.

