Cart (Loading....) | Create Account
Close category search window

What's there and what's not?: focused crawling for missing documents in digital libraries

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Ziming Zhuang ; Sch. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA ; Wagle, R. ; Giles, C.L.

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures

Published in:

Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on

Date of Conference:

7-11 June 2005

Need Help?

IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2014 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.