Skip to Main Content
A comparison method for Web pages in terms of visual similarity is proposed. Conventional Web information retrieval/gathering systems, such as search engines, extract keywords from HTML source files, based on which the similarity between pages is calculated. The extracted keywords are considered as semantic features representing the contents of Web pages. On the other hand, visual feature of Web pages is as important as semantic feature, because HTML is designed for visualizing a Web page in understandable manner for humans. The proposed method compares the layouts of Web pages based on image processing and graph matching. The experimental results show that the accuracy of layout analysis is 91.6% in average, and the visual similarity calculated by the proposed method is closer to the visual judgment by test subjects than color-based comparison method.