Notification:
We are currently experiencing intermittent issues impacting performance. We apologize for the inconvenience.
By Topic

Reverse engineering for Web data: from visual to semantic structures

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Chung, C.Y. ; Verity Inc., Sunnyvale, CA, USA ; Gertz, M. ; Sundaresan, N.

Despite the advancement of XML, the majority of documents on the Web is still marked up with HTML for visual rendering purposes only, thus building a huge amount of legacy data. In order to facilitate querying Web based data in a way more efficient and effective than just keyword based retrieval, enriching such Web documents with both structure and semantics is necessary. We describe a novel approach to the integration of topic specific HTML documents into a repository of XML documents. In particular, we describe how topic specific HTML documents are transformed into XML documents. The proposed document transformation and semantic element tagging process utilizes document restructuring rules and minimum information about the topic in the form of concepts. For the resulting XML documents, a majority schema is derived that describes common structures among the documents in the form of a DTD. We explore and discuss different techniques, and rules for document conversion and majority schema discovery. We finally demonstrate the feasibility and effectiveness of our approach by applying it to a set of resume HTML documents gathered by a Web crawler

Published in:

Data Engineering, 2002. Proceedings. 18th International Conference on

Date of Conference:

2002