Anirban Kundu


Crawling, hierarchical crawling, bandwidth utilization, Y-typequeue


Web crawler is a computer program that browses World Wide Web in methodical and automated manner. Latest crawling techniques in use are parallel crawling and hierarchical crawling. In later case, total Web site is extracted by dividing it into a few levels. The homepage from which crawling process starts is considered to be the first level. All the hyperlinks present on that Web page all together is considered to be the next level and so on. In this crawling process all the Web pages at a single level gets downloaded simultaneously by the creation of multiple crawlers dynamically depending on the number of hyperlinks on that level. But in real-life scenario the bandwidth available is limited and acts as a deterrent in this case. In this paper, a scheduling algorithm has been proposed on the basis of the sizes of the Web pages to make full utilization of the bandwidth available. To achieve this, a modified type of queue (Y-type) is introduced where URLs of the Web pages are kept in an orderly manner and they are released in such a way that the total size of the Web pages issued is closest to the bandwidth available.

Important Links:

Go Back