Close category search window
 

PEBL: Web page classification without negative examples

Full text access may be available

To access full text, please use your member or institutional sign in.


This paper appears in:
Knowledge and Data Engineering, IEEE Transactions on
Date of Publication: Jan. 2004
Author(s): Hwanjo Yu
Dept. of Comput. Sci., Illinois Univ., Urbana, IL, USA
Jiawei Han ;  Chang, K.C.-C.
Volume: 16 , Issue: 1
Page(s): 70 - 81
Product Type: Journals & Magazines

Available Formats Non-Member Price Member Price
US$31.00 US$10.00
Learn how you can qualify for the best price for the item!
  • Email
  • Print
  • Rights And Permissions

Abstract

Web page classification is one of the essential techniques for Web mining because classifying Web pages of an interesting class is often the first step of mining the Web. However, constructing a classifier for an interesting class requires laborious preprocessing such as collecting positive and negative training examples. For instance, in order to construct a "homepage" classifier, one needs to collect a sample of homepages (positive examples) and a sample of nonhomepages (negative examples). In particular, collecting negative training examples requires arduous work and caution to avoid bias. The paper presents a framework, called positive example based learning (PEBL), for Web page classification which eliminates the need for manually collecting negative training examples in preprocessing. The PEBL framework applies an algorithm, called mapping-convergence (M-C), to achieve high classification accuracy (with positive and unlabeled data) as high as that of a traditional SVM (with positive and negative data). M-C runs in two stages: the mapping stage and convergence stage. In the mapping stage, the algorithm uses a weak classifier that draws an initial approximation of "strong" negative data. Based on the initial approximation, the convergence stage iteratively runs an internal classifier (e.g., SVM) which maximizes margins to progressively improve the approximation of negative data. Thus, the class boundary eventually converges to the true boundary of the positive class in the feature space. We present the M-C algorithm with supporting theoretical and experimental justifications. Our experiments show that, given the same set of positive examples; the M-C algorithm outperforms one-class SVMs, and it is almost as accurate as the traditional SVMs.

Index Terms

Index Terms are available to subscribers and IEEE members.

 





Need Help?


IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A non-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2012 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.