Skip to Main Content
It is often difficult and time-consuming to provide a large amount of positive and negative examples for training a classification system in many applications such as information retrieval. Instead, users often find it easier to indicate just a few positive examples of what he or she likes, and thus, these are the only labeled examples available for the learning system. A large amount of unlabeled data are easier to obtain. How to make use of the positive and unlabeled data for learning is a critical problem in machine learning and information retrieval. Several approaches for solving this problem have been proposed in the past, but most of these methods do not work well when only a small amount of labeled positive data are available. In this paper, we propose a novel algorithm called Topic-Sensitive pLSA to solve this problem. This algorithm extends the original probabilistic latent semantic analysis (pLSA), which is a purely unsupervised framework, by injecting a small amount of supervision information from the user. The supervision from users is in the form of indicating which documents fit the users' interests. The supervision is encoded into a set of constraints. By introducing the penalty terms for these constraints, we propose an objective function that trades off the likelihood of the observed data and the enforcement of the constraints. We develop an iterative algorithm that can obtain the local optimum of the objective function. Experimental evaluation on three data corpora shows that the proposed method can improve the performance especially only with a small amount of labeled positive data.