Skip to Main Content
Notice of Violation of IEEE Publication Principles
"An Efficient Method of Eliminating Noisy Information in Web Pages for Data Mining,"
Tripathy, A. K.; Singh, A. K.
The Fourth International Conference on Computer and Information Technology (CIT'04),
September 14-16, 2004, Wuhan, China, pp. 978-985.
After careful and considered review of the content and authorship of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE?s Publication Principles.
The first author (A.K. Tripathy) has taken full responsibility and this violation was done without the knowledge of the second author (A.K. Singh).
This paper is a duplication of the original text from the paper cited below. The original text was copied without attribution (including appropriate references to the original author(s) and/or paper title) and without permission. Due to the nature of this violation, reasonable effort should be made to remove all past references to this paper, and future references should be made to the following article:
Lan Yi, Bing Liu, and Xiaoli Li.
"Eliminating Noisy Information in Web Pages for Data Mining,"
Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (KDD-2003),
Washington, DC, USA, August 24-27, 2003.
A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements. We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this work, we propose a noise elimination technique. We propose a tree structure, called pattern tree, to capture the common presentation styles and the actual contents of the pages in a give- Web site. By sampling the pages of the site, a pattern tree can be built for the site, which we call the site pattern tree (SPT). We then introduce an information-based measure to determine which parts of the SPT represent noises and which parts represent the main contents of the site. The SPT is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SPT. The proposed technique is evaluated by a data-mining task that is Web clustering.