Focused Crawling Guided by Link Context

J. Dong, W. Zuo, and T. Peng (PRC)


Link context, anchor Text, focused crawling


In designing a focused crawler, the choice of strategy for prioritizing unvisited URLs is crucial. There is an intuition that the text surrounding a link or the link context on the HMTL page is a good summary of the target page. But little work has been done to utilize the beneficial link context information about the seed URLs before actual crawling. Motivated by the two observations, we propose a method to collect this kind of resources beforehand and then use it to guide the actual crawling. Experiments show that the proposed approach is reasonable and especially effective to a single-topic crawling, especially at the initial stage.

