By Topic

Recognition of common areas in a Web page using visual information: a possible application in a page classification

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
M. Kovacevic ; Sch. of Civil Eng., Belgrade Univ., Serbia ; M. Diligenti ; M. Gori ; V. Milutinovic

Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. A common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. We propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.

Published in:

Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on

Date of Conference:

2002