Skip to Main Content
High dimensional concept spaces have various applications in Web search including personalized search, related page computation, diversity preservation, user interest inference, similarity computation, and advertisement targetting. Clustering and classification methods are common means to map documents and users into concept spaces. In most classification algorithms, precision (accuracy) and recall (coverage) tend to be competing aspects. In this paper, we introduce query smearing, an algorithm that can significantly improve both the accuracy and coverage of an existing classifier by leveraging information contained in fully anonymized search engine logs. Starting with a potentially incomplete seed classification, it expands the classification information to cover various items in search engine logs using a weighted majority voting scheme. The technique is similar to semi-supervised learning approaches and may be classified as one, but we have notable differences from most such examples. In particular, initial labels are not fully trusted for accuracy or completeness (hence, after the first stage, they can be thrown away), and additional relationships between classified items are used extensively to guide the process. Empirical evaluation shows that our algorithm performs well under the following assumptions: (i) the search engine log contains a sufficiently large number of query transactions, (ii) most results of most queries are relevant and on-topic, and (iii) sufficient fraction of search results are classified in the seed classification, and those classifications are reasonably accurate (but not necessarily complete).
Date of Conference: 14-16 Sept. 2009