On-the-Fly Detection of Content-Poor Webpaths

T.-C. Hsu, H.-T. Chang, and S. Wu (Taiwan)


Content filter, Web crawler.


Web page crawling is an essential part of a web search en gine. As the number of web pages in the Web is so big, it’s practically impossible for a search engine to cover all web pages. An important question for the search engine is then ”Which web pages should be crawled and indexed ?”. In our observation, we found that most of the index-worthless web pages in a web site are in a same directory or gener ated by a same CGI program. We use webpath to denote the set of web pages residing in a same directory or gener ated by a same CGI program and we call it a content-poor webpath if it contains mostly index-worthless web pages. In this paper, we present an approach to detect the content poor webpaths on the fly, such that the crawler can improve the quality of the data crawling. We use statistical approach by analyzing URL patterns and page content structures in the crawled pages to decide whether a webpath is content poor. Our experimental results show that, given a fixed time interval, the data crawler with content-poor webpath filtering will produce a search index that has approximately 10% of search result improvement, compared to the origi nal crawler without the filter. The precision of detection is exceeding 90%.

