Skip to Main Content
Recently, researchers have been actively studying on Web mining with various data in the World Wide Web. Since Web pages are generally semi-structured, which makes it difficult to identify informative blocks, techniques of content detection by removing unnecessary data (e.g. advertisements) from the Web pages become important. Generally a Web page consists of many blocks containing various data and structural information. In this paper, we propose a method that classifies the blocks of a Web page into an appropriate category by building a Tree Alignment model representing HTML structure and a Vector model representing the features of the blocks. Web sites normally have their own templates and the blocks may be related to different categories even though they are located in the same position in the Web browser or are structurally similar. Hence it is difficult to classify the blocks into accurate categories through building one classifier. To solve the problem, in our approach, multiple classifiers are built, one for each training domain, and the block classification proceeds through combining them.