A Hierarchical Web Page Segmentation Algorithm using Machine Learning

T. Ito, H. Sano, T. Ozono, and T. Shintani (Japan)


Web Page Segmentation, Web Page Layout, Mobile Phone, Machine Learning


We implemented a web browsing system to facilitate navi gation, and reading with mobile phones that have a small screen. We need to divide large web pages into small blocks so that they can be displayed on a small screen for the system. The blocks should be the semantic part of the web page, and they have various granularities for each user and application. We propose a new web page segmentation algorithm that uses layout information after rendering. Our algorithm consists of two points. The first point is to seg ment a web page in hierarchical fashion by using eight lay out templates. The second point is to divide a web page into content blocks by using a support vector machine. Experi mental results show that the method has a higher precision than the existing method.

