Loading [a11y]/accessibility-menu.js
Schema inference and data extraction from templatized Web pages | IEEE Conference Publication | IEEE Xplore

Schema inference and data extraction from templatized Web pages


Abstract:

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However,...Show More

Abstract:

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.
Date of Conference: 08-10 January 2015
Date Added to IEEE Xplore: 16 April 2015
Electronic ISBN:978-1-4799-6272-3
Conference Location: Pune, India

References

References is not available for this document.