Scheduled System Maintenance:
Some services will be unavailable Sunday, March 29th through Monday, March 30th. We apologize for the inconvenience.
By Topic

What's there and what's not?: focused crawling for missing documents in digital libraries

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

The purchase and pricing options are temporarily unavailable. Please try again later.
3 Author(s)
Ziming Zhuang ; Sch. of Inf. Sci. & Technol., Pennsylvania State Univ., University Park, PA ; Wagle, R. ; Giles, C.L.

Some large scale topical digital libraries, such as CiteSeer, harvest online academic documents by crawling open-access archives, university and author homepages, and authors' self-submissions. While these approaches have so far built reasonable size libraries, they can suffer from having only a portion of the documents from specific publishing venues. We propose to use alternative online resources and techniques that maximally exploit other resources to build the complete document collection of any given publication venue. We investigate the feasibility of using publication metadata to guide the crawler towards authors' homepages to harvest what is missing from a digital library collection. We collect a real-world dataset from two Computer Science publishing venues, involving a total of 593 unique authors over a time frame of 1998 to 2004. We then identify the missing papers that are not indexed by CiteSeer. Using a fully automatic heuristic-based system that has the capability of locating authors' homepages and then using focused crawling to download the desired papers, we demonstrate that it is practical to harvest using a focused crawler academic papers that are missing from our digital library. Our harvester achieves a performance with an average recall level of 0.82 overall and 0.75 for those missing documents. Evaluation of the crawler's performance based on the harvest rate shows definite advantages over other crawling approaches and consistently outperforms a defined baseline crawler on a number of measures

Published in:

Digital Libraries, 2005. JCDL '05. Proceedings of the 5th ACM/IEEE-CS Joint Conference on

Date of Conference:

7-11 June 2005