By Topic

Crawling for domain-specific hidden Web resources

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Bergholz, A. ; Xerox Res. Center Eur., Meylan, France ; Childlovskii, B.

The Hidden Web, the part of the Web that remains unavailable for standard crawlers, has become an important research topic during recent years. Its size is estimated to 400 to 500 times larger than that of the publicly indexable Web (PIW). Furthermore, the information on the hidden Web is assumed to be more structured, because it is usually stored in databases. In this paper, we describe a crawler which starting from the PIW finds entry points into the hidden Web. The crawler is domain-specific and is initialized with pre-classified documents and relevant keywords. We describe our approach to the automatic identification of Hidden Web resources among encountered HTML forms. We conduct a series of experiments using the top-level categories in the Google directory and report our analysis of the discovered Hidden Web resources.

Published in:

Web Information Systems Engineering, 2003. WISE 2003. Proceedings of the Fourth International Conference on

Date of Conference:

10-12 Dec. 2003