By Topic

A cognitive crawler using structure pattern for incremental crawling and content extraction

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Shijia Xi ; Tsinghua Univ., Beijing, China ; Fuchun Sun ; Jianmin Wang

In this paper, we design a cognitive crawler to dramatically reduce the website crawling cost and extract useful content from web pages in an unsupervised procedure. The main idea of reducing the crawling cost is to retrieving those lately modified pages and newly added pages only. However, in reality, it is impossible for traditional crawler to judge whether a page has been modified or newly added without doing a whole crawling. We propose a method to predict those lately modified pages and newly added pages without do any actual crawling; we also find a feasible and stable feature "structure pattern" to better indicates the modified probability of certain page. In the meanwhile, we develop a hybrid clustering method combined with K-means and agglomerative hierarchical clustering to automatically find all the structure patterns in certain website. Using structure pattern, we developed an unsupervised algorithm to generate website's templates; using templates, crawler can extract useful information of web pages much more easily and precisely. We also introduce feasible formulas to predict pages' modified probabilities and crawling time intervals. To evaluate the performance of an incremental crawling algorithm, we proposed three new indicators. Using the algorithm proposed, we could extract content of pages with high performance. The experimental results illustrate that structure pattern is very useful and the performance of this cognitive crawler is quite promising and it can save huge amount of bandwidth and is qualified for different websites of various scales.

Published in:

Cognitive Informatics (ICCI), 2010 9th IEEE International Conference on

Date of Conference:

7-9 July 2010