By Topic

HisTrace: A system for mining on news-related articles instead of web pages

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Lian'en Huang ; Shenzhen Grad. Sch., Internet Res. & Eng. Center, Peking Univ., Shenzhen, China ; Xiaoming Li

The Web is now playing an important part in people's real-life activities. Scientists of not only computer science but also sociology and economics might be interested in mining on information directly related to real-life events, or news-related information on the Web. In this paper we propose a system to enable mining on news-related articles instead of raw web pages. There are functionally two tasks in our system: 1) mining for news-related articles and 2) duplicate elimination. For the first task, a novel approach for determining titles, contents and publication-times of news-related articles is presented. Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles. After that, crawl-times and are used to initially compute publication-times for all articles. At last, times extracted from HTML bodies, URLs and anchor texts are used to determine precise publication-times for possible articles. For the second task, a duplicate detection algorithm for news-related articles is described which is base on LCS (longest common subsequence) and achieves both high precision and high recall. The framework of this algorithm has been presented as a general-purpose algorithm for web pages in a previously published paper. In this paper we explain why this algorithm is particularly suitable for news-related articles and present corresponding implementation details. Evaluations have been conducted which show the effectiveness of our approaches.

Published in:

Web Society (SWS), 2010 IEEE 2nd Symposium on

Date of Conference:

16-17 Aug. 2010