An increasing number of data sources become available on the Web now, but often their contents are only accessible through query interfaces. For a domain of interest, accessing deep Web content has been a long-standing challenge. In this paper, we propose a deep Web crawling approach based on ordinal regression model. We divide page into 3 levels, and take the feedback of page classifier as an ordinal regression problem. We also take into account the interests of link delay; the related links are limited within 3 layers or less. Experiment results demonstrate that the feedback- based crawling strategy could effectively improve the crawling speed and accuracy.
Published in:
Internet Technology and Applications (iTAP), 2011 International Conference on
Date of Conference: 16-18 Aug. 2011