By Topic

Query smearing: Improving classification accuracy and coverage of search results using logs

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Oztekin, B.U. ; Google Inc., Mountain View, CA, USA ; Chiu, A.

High dimensional concept spaces have various applications in Web search including personalized search, related page computation, diversity preservation, user interest inference, similarity computation, and advertisement targetting. Clustering and classification methods are common means to map documents and users into concept spaces. In most classification algorithms, precision (accuracy) and recall (coverage) tend to be competing aspects. In this paper, we introduce query smearing, an algorithm that can significantly improve both the accuracy and coverage of an existing classifier by leveraging information contained in fully anonymized search engine logs. Starting with a potentially incomplete seed classification, it expands the classification information to cover various items in search engine logs using a weighted majority voting scheme. The technique is similar to semi-supervised learning approaches and may be classified as one, but we have notable differences from most such examples. In particular, initial labels are not fully trusted for accuracy or completeness (hence, after the first stage, they can be thrown away), and additional relationships between classified items are used extensively to guide the process. Empirical evaluation shows that our algorithm performs well under the following assumptions: (i) the search engine log contains a sufficiently large number of query transactions, (ii) most results of most queries are relevant and on-topic, and (iii) sufficient fraction of search results are classified in the seed classification, and those classifications are reasonably accurate (but not necessarily complete).

Published in:

Computer and Information Sciences, 2009. ISCIS 2009. 24th International Symposium on

Date of Conference:

14-16 Sept. 2009