By Topic

A Direct Web Page Templates Detection Method

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

6 Author(s)
Xie Su-bin ; Sch. of Inf., Renmin Univ. of China, Beijing, China ; Liang Bin ; Shi Wen-chang ; Liang Zhao-hui
more authors

Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.

Published in:

Internet Technology and Applications (iTAP), 2011 International Conference on

Date of Conference:

16-18 Aug. 2011