Improving Query Quality for Transductive Learning in Learning to Rank

In traditional transductive learning, all queries are used in learning to rank in order to generate pseudo-labels when sufficient training data are not available. However, low quality queries may affect retrieval performance in transductive learning. We thus think that it is important to improve the quality of queries in transductive learning to train an effective ranking model. By using a small number of reliable samples and data close to the boundaries of classification, we propose building a query quality estimator by establishing a relationship between the benefits of good retrieval performance and features of the normalized query commitment that influence query quality. In our proposed transduction model, all queries available are filtered by the proposed query quality estimator and only high quality queries that enhance the effectiveness of retrieval such that they yield performance-related benefits, are used to generate pseudo-labels for learning to rank. Queries that can degrade performance benefits are discarded while creating the pseudo-labels. Pseudo-labels aggregated by high quality queries in transductive learning are then leveraged in learning to rank scenarios without sufficient training data. The results of extensive experiments on the standard LETOR 4.0 dataset showed that our proposed method can outperform strong baselines and the average normalized discounted cumulative gain is enhanced up to 7.77% in some case. INDEX TERMS Transductive learning, query quality, retrieval performance, learning to rank


I. INTRODUCTION
Several techniques have been proposed in recent decades to construct ranking models for information retrieval, including traditional heuristic methods, probabilistic methods, and machine learning methods [1][2]. Of these, learning to rank is among the most popular weighting schemes [3]. However, the need for a sufficient amount of training data in learning to rank renders its use expensive and unfeasible when it is difficult to gather labels for the data [4].
Transductive learning [5][6][7], a semi-supervised mode of learning, is often used to iteratively aggregate pseudo-labels for learning to rank in information retrieval in case a sufficient amount of training data are not available [8]. In information retrieval, it is often assumed that the top-ranking documents in the initial retrieval results, namely the results of content-based retrieval, are highly relevant to the given query while the bottom-ranking documents are irrelevant [9][10]. Thus, the top-ranking documents of each query are taken as positive samples and the bottom-ranking documents as negative ones in transduction. However, the assumption of transductive learning is not always reliable. For example, only 13% of the five top-ranking documents in the initial retrieval results on MQ2008, a subject of LETOR 4.0 [11], were found to be relevant to the queries. In this case, noise was introduced to the pseudo-positive examples when transductive learning was applied, which significantly degraded retrieval performance. Therefore, to guarantee the effectiveness of learning to rank in information retrieval, it is necessary to enhance the quality of the queries and their associated pseudo-labels during transductive learning.
Considering that a small amount of labeled samples can be easily obtained in most application scenarios, we aimed to improve query quality by selecting pseudo-data on a perquery basis. We considered these labeled samples reliable as they are manually annotated rather than generated by a semisupervised algorithm. Moreover, examples with a low degree of confidence (unconfident examples), located close to the boundaries of classification, significantly affect the VOLUME XX, 2017 1 effectiveness of retrieval [12] and these unconfident examples are incorporated in our methods. To improve the quality of pseudo-labels in transduction, we estimate the quality of each given query by determining whether it can enhance retrieval performance with a limited number of reliable labels and data with low confidence close to the classification boundaries. High quality queries are extracted by building a query quality estimator between quality-related features based on the normalized query commitment (NQC) and performance-related benefits. The queries used for training are classified into two groups, a high quality and a low quality group. Only queries that can improve performance were incorporated into the transduction to create pseudo-labels for learning to rank while those belonging to the low quality group are discarded. The major contributions of this paper are two-fold. First, we proposed improving the effectiveness of retrieval of transductive learning by building a query quality estimator that uses a small number of reliable examples as well as examples with low confidence located close to the classification boundaries. In contrast to traditional transductive learning, only high quality queries were leveraged to aggregate pseudo-examples during the iterative process of transduction. Second, experiments on a standard dataset showed that the automatically learned features for ranking may not be sufficient to improve the effectiveness of retrieval in complex ranking models while the improvement in the quality of the training data played an important role in enhancing the effectiveness of information retrieval.
The remainder of this paper is organized as follows: Section II explains the application of transductive learning in information retrieval and surveys research on improving training data for learning to rank. Section III presents the proposed transductive learning approach in detail, and Section IV introduces the experimental settings and explains the baselines employed in the evaluation for comparison with the proposed method. Section V shows the experimental results on LETOR 4.0 and Section VI gives discussions. Section VII offers the conclusions of this work.

A. TRANSDUCTIVE LEARNING IN INFORMATION RETRIEVAL
Semi-supervised learning has been applied to many domains, such as indoor localization, visual ranking, speech recognition, scintillation detection, and web classification [7,[13][14][15][16][17]. For example, to enhance recommendation performance, Zhang et al. proposed a graph-based semisupervised learning algorithm for indoor localization by utilizing crowd-sourced data [13]. To reduce the word error rate in speech recognition, semi-supervised learning was used to improve model generalization with limited available data [14]. For visual ranking, a semi-supervised learning algorithm, the SSLPP, was developed by incorporating information into the degree of relevance [15]. For scintillation detection, a semi-supervised detection system based on the DeepInfomax approach was developed [16]. Besides, Semi-supervised learning can be combined with active learning to improve the efficiency of classification. Stikic et al. proposed combining self-training, co-training, and active learning for human activity recognition [17], and Wang et al. proposed a semi-supervised learning algorithm that combines active learning with transductive SVM for applications without sufficient labeled data [7]. For learning to rank without sufficient labeled data in information retrieval, many studies have verified the benefits of using transductive learning [17][18][19][20].
Transductive learning, a self-training-based semisupervised learning approach, has been widely used in applications where there only a small amount of labeled data are available, or none are available at all [5]. In the scenarios mentioned above, transductive learning is often utilized to create pseudo-labels for learning to rank in information retrieval, where it needs to only train a learning model to predict the pseudo-label of the given set in test examples, instead of all unobserved examples, as explained in [21]. The algorithm iteratively generates pseudo-examples from the remaining unlabeled data to construct a pseudo-training set for learning to rank algorithms. This section briefly explains the general setup of transductive learning in information retrieval, as described in [7, 18, and 19].
 Extraction of initial pseudo-labels. The top-ranking and bottom-ranking documents in the initial results of content-based retrieval are chosen as the initial positive and negative labels, respectively. The remaining documents in the initial results of retrieval are taken as unlabeled examples.  Construction of a learning model. A classifier is built by making using of the initial pseudo-labels. Reconstruction of a learned model. A new learned model is trained by utilizing the most recently updated pseudo-labels. The processes of the selection of the new pseudo-labels and reconstruction of the learning model are iterative, and continue until the halting criterion has been satisfied. For learning to rank applications where human labels are rare or unavailable, the pseudo-labels are generated iteratively by the transduction-based approach described above [22][23].
The effectiveness of the transductive learning approach depends mainly on the quality of the pseudo-examples, which in turn is affected directly by query quality. However, the traditional transductive method does not consider the quality of individual queries, and its retrieval performance is VOLUME XX, 2017 1 affected if queries with poor quality are included in the training query set for learning to rank.

B. IMPROVING QUALITY OF TRAINING DATA
The quality of the training data has a significant impact on retrieval performance in information retrieval [24][25][26][27]. Previous research on the quality of the training data has focused mainly on supervised learning, including supervised classification and supervised learning to rank [28][29][30].
Research on handling noise-related data in classification can be divided into three categories: methods of feature selection, data selection, and establishing noise-tolerant models. In this paper, we present research on improving the training data in learning to rank. Given that the click logs of a web user can accurately reflect the relevance of a given document to the corresponding query, Xu et al. used document click logs to correct noisy data for learning to rank [27]. Insisting that not all labels of the training data are reliable, they proposed two dependent models (a sequentially dependent model and a fully dependent model) to predict the pseudo-labels of the training data again. If the predicted label and the original label of a document were significantly inconsistent, the data was judged to be noisy and their labels were corrected manually. Carvalho and Elsas proposed the sigmoid loss function to replace the hinge function in RankSVM to reduce the impact of noise on document pairs [28]. Geng et al. proposed a feature selection method in which the relevant documents to a given query are re-ranked to ensure the hierarchy of the query and the documents in learning to rank [29]. Geng et al. proposed a probabilistic graphical model by introducing an implicit variable to identify the true annotation of a document [30].
In the context of semi-supervised learning, some research has been published on improving the quality of the training data. Mallapragada et al. proposed an active query selection method by using the min-max criterion for semi-supervised clustering [31]. Considering that graph-based methods are limited in their ability to jointly model graph structures and data features, Wu et al. proposed a graph-filtering framework that injects graph similarity into the data features by considering them to be signals on the graph and applying a low-pass graph filter to extract useful data representations for classification, where labels can be efficiently assigned by conveniently adjusting the strength of the graph filter [32]. Some research has also considered filtering the training queries for semi-supervised learning in learning to rank [22][23][24]. Rahangdale and Raut proposed a clustering-based semisupervised learning method and combined it with the nonmeasure-specific listwise approach for learning to rank in case no labeled data are available [22]. Zhang et al. proposed a two-step clustering approach to filter the quality of the query in semi-supervised learning for learning to rank in case no labeled data are available [23]. Muandet et al. selected the example which led to the largest perturbation in the labels of the other examples, and used active learning to query a label for an unlabeled data [24].
We think that the quality of pseudo-labels should be enhanced in transductive learning. Besides, considering that a small number of reliable samples and unconfident examples are obtained in our application, we propose to improve the quality of the pseudo-labels by constructing a query quality estimator using semi-supervised learning algorithms instead of unsupervised learning.

III. FRAMEWORK OF PROPOSED TRANSDUCTIVE LEARNING WITH LIMITED RELIABLE LABELS AND EXAMPLES WITH LOW CONFIDENCE
In this section, we provide details of the proposed transductive learning method that learns a query quality estimator based on a few reliable examples, and examples with low confidence located close to the classification boundaries (unconfident examples) to select high-quality queries. All queries available are filtered by the query quality estimator, thus high quality queries are selected from these candidate queries. These selected queries were utilized to generate pseudo-labels iteratively for learning to rank. In general, our proposed algorithm was composed of three phases:  estimating query quality with a few reliable examples;  tagging the unlabeled data with transductive learning, where only high quality queries were utilized;  retrieval documents by learning to rank and generating the results of retrieval using the tagged pseudo-labels for learning to rank. The last phase is a common step in semi-supervised learning to rank, in which the pseudo-labels generated by transduction are used as input in learning to rank to train a ranking model for predicting the relevance of each document of a given query. The framework of our proposed approach is illustrated in Fig. 1

A. ESTIMATING QUERY QUALITY
The quality of queries in a dataset differs. The relevant documents obtained for some queries were highly ranks in the initial (content-based) retrieval results, whereas other queries may have few relevant documents scattered around the initial results. Queries with high-ranking relevant documents were believed to have a high mean average precision, and were defined as high quality in this paper. In contrast, queries with few relevant documents were considered low quality as they had a low mean average precision. Our proposed method aims to improve transductive learning involved selecting high quality queries.

1) FEATURES OF QUERY QUALITY
To estimate query quality, we needed to extract features that could indicate it. Query quality prediction, the task of estimating the effectiveness of a retrieval system given a search query in the absence of any feedback from the searcher [33], has been proven to be very challenging [33][34][35][36].
The methods of estimating query quality are diverse, such as clarity [34], weighted information gain (WIG) [35], NQC [36], and QPP for microblog search [33]. As in [24,25], we used NQC in our method because it was easy to compute while still being effective in estimating query performance, as reported in [36]. The NQC measures the amount of query drift in the results list. As shown in [36], NQC is computed as follows: where R is the commitment of documents related to query Q, R  is the standard deviation of the retrieval scores of the documents in R, and Score(Q, C) is the retrieval score of the collection.
Considering that the datasets we used had different fields, including title, body, anchor, URL, and the entire given document, we extracted query features for different fields: query features based on title, body, anchor, URL, and the whole document. We also collected the content-based sores of the corresponding fields by applying TF*IDF, BM25, LMIR.ABS, LMIR.DIR, and LMIR.JM, and then computed the NQC scores of the different query features. Therefore, in our proposed method, 25 query features were used to indicate query quality, as listed in Table 1.

2) SELECTING HIGH QUALITY QUERIES
Queries used to represent different users' information-related needs had unique characteristics [37]. Some queries were popular and the top-ranking retrieval results could supply informative messages, whereas some queries were not popular and the retrieval results of these queries were not satisfactory. Intuitively, popular queries may be highly precise in ranking documents, and thus have a high mean precision. We called these queries high quality queries; they were expected to provide pseudo-examples with little noisy data. During the selection of high quality queries, all queries available were divided into three categories without overlap: training, validation, and testing queries. For each querydocument pair, we exploited query features on the basis of the NQC to represent its quality. Note that it was assumed that a small number of reliable examples could be obtained for each query-document pair to train a query dataset. We first trained a ranker by utilizing a few reliable samples, and then greedily added a small number of samples with low confidence randomly to the original training data to reconstruct the learning ranker. Because the LETOR 4.0 dataset, with about 1,700 queries on MQ2007 and 800 queries on MQ2008, was large, the enumeration-based sampling of examples with low confidence located close to the classification boundaries was infeasible. Thus, random sampling was conducted 10 ~ 15 times, and achieved near optimal results. Finally, the learning ranker with the highest mean average precision was chosen, and its effectiveness was tested on the validation queries. We then obtained the mean average precision α by applying the ranker to the validation queries. Before the ranker was applied to the validation dataset, a content-based retrieval model was used on the validation queries to obtain a mean average precision β . We thus obtained a subtracted value of mean average precision δ ,α -β . A query quality estimator was then built by using the query-related features and the difference in mean average precision ( δ ) between the ranker and the content-based retrieval model on the set of the validation queries set.
The query quality estimator was then applied to each test query in the test queries' collection to assess its ability to enhance retrieval performance. If a query enhanced the effectiveness of retrieval on the test queries, namely, if the VOLUME XX, 2017 1 mean average precision obtained by the query quality estimator was greater than zero, it was considered a highquality query. Queries that enhanced the mean average precision were added to the high-quality queries' set. It was expected to enhance the effectiveness of retrieval when extracting the top-ranking and bottom-ranking documents. Because documents ranked high by a high quality query were likely to be mostly related to the query topic, these queries created effective pseudo-labels through the iterative procedure of transductive learning. Table 2 shows the query types and their corresponding roles while selecting high-quality queries. To balance the distributions of high-quality and low-quality queries in the training/testing/validation sets, three-fold cross-validation experiments were conducted while selecting high quality queries. Note that during the process of building the ranker to predict retrieval performance in terms of mean average precision, each querydocument pair was represented by predefined document features. When constructing the query quality estimator, each query-document pair was represented by the exploited quality-related features, namely the NQC-based features.

B. TAGGING UNLABELED DATA BY TRANSDUCTIVE LEARNING
It is common for documents relevant to different queries to have diverse distributions, and ignoring the quality of queries used for training may hurt the effectiveness of retrieval in learning to rank because it leads to the aggregation of low quality pseudo-labels during the iterations of transductive learning. After the selection of the high quality queries, all queries were divided into two classes, high quality and low quality queries. The former enabled learning to rank to improve its effectiveness of retrieval because they can generate pseudo-labels with a higher average quality. In contrast, the latter brought about only a minor improvement in performance, or can even degraded retrieval performance. Thus, the pseudo-labels created by these low quality queries were unreliable. In the proposed method, the 100 highest documents in the initial content-based retrieval were treated as unlabeled examples. For high quality queries, we extracted the highest and lowest ranked documents as input, and trained a learner by applying learning to rank algorithms. The remaining unlabeled documents of the top 100 documents were re-ranked by our learner, and we picked the most relevant and irrelevant examples from this list to update the training data. A new learner was then retrained by the most recent training data to re-rank the remaining unlabeled examples once again.
The procedure of training and selecting pseudo-labels was iterated until the halting criterion was met. The algorithm for tagging the unlabeled data is shown in Fig.  2. Finally, the pseudo-examples iteratively generated by the transductive learning algorithm were provided as input for learning to rank.

C. RETRIEVAL DOCUMENTS BY LEARNNING TO RANK
VOLUME XX, 2017 1 To examine the retrieval performance of our proposed method, the pseudo-labels aggregated iteratively by it were used in learning to rank. We now introduced the query partitions in the learning to rank procedure. Four types of queries were used in our experiments: training, validation, target, and test queries. Both relevant and irrelevant examples were extracted from the documents retrieved by the training queries as initial pseudo-training data to train the ranking model. The queries in the validation set were used to determine the settings of the tunable parameters for learning to rank. Thus the optimal parameters of the learning to rank algorithm were obtained and used for the target queries to iteratively generate the final pseudo-examples. Finally, the ranking model learned from the pseudo-training examples generated by the target queries was used to generate document rankings for each query in the test set.
In our experiments, the high quality queries selected by the proposed transductive learning method were randomly split into three parts: a training queries' set, a validation queries' set, and a test queries' set. The low quality queries were in turn arbitrarily split: the remaining part of the validation set, and the test queries. There was no intersection among the training, validation, and testing sets, but there was an overlap between the target and the test queries. While manually assigned labels were used for the validation set during the optimization of the parameters, there was no overlap between the validation query set and the training/testing query set. Such an experimental setup is common for semi-supervised learning to rank approaches, such as in [19,43].

D. MACHINE LEARNING ALGORITHMS FOR LEARNING TO RANK AND CLASSIFICATION
In this section, we introduced the classification algorithms used to select high quality training queries, and the learning to rank algorithm to process the tagging of unlabeled data and searching the results of retrieval for a given query.

1) LEARNING TO RANK ALGORITHMS
Three major learning to rank algorithms, the pointwise, pairwise, and listwise approaches, are commonly used in information retrieval, and where the pairwise and listwise methods have been shown to outperform the pointwise approach in terms of retrieval performance [38][39][40][41]. In this paper, RankSVM [42], a classical pairwise approach, was chosen to train the ranking algorithms. The literature has shown the benefits of RankSVM in search tasks [42,43].
Given a query Q and the set of documents D relevant to it, RankSVM transforms the ranking task into a classification task by leveraging the relative relevance of each document pair in D. As a classical pairwise learning to rank approach, RankSVM took a pair of documents in D as input, and its optimization was similar to that of SVM, where it added an SVM regularization for margin maximization to the objective and a parameter C that allowed a trading-off margin against training error: wing opti 0 and 1 . .
where w is a weight vector adjusted during the learning procedure, x is a vector of documents, y is the relevance of a document to the given query, and  is a slack variable.

2) CLASSIFICATION ALGORITHMS
In information retrieval, in case labeled data are available, many classification algorithms, such as the neuron network, gradient boosting, logistic regression, and SVM, can be applied [44]. Because only a small amount of reliable data was available in our application, we used classification algorithms for predicting the retrieval-related benefits of each given query while estimating query quality. Both logistic regression and naive Bayes were used to classify the queries. We now briefly introduce the two classifications.
The idea of logistic regression was nearly identical to that of linear regression. However, logistic regression involves a step that uses the sigmoid function to convert the output of linear regression to return a probability value that can then be mapped to two or more discrete classes. The kernel of classification in logistic regression was that it utilized the sigmoid function. Similar to linear regression, it had a cost function obtained by minimizing the final solution [45]. When logistic regression was applied to binary classification, the parametric model was as follows: where w is the weight of each vector X. Logistic regression achieved classification by learning hyper-planes. In contrast, naive Bayes carried out classification by taking the unique approach of considering the probabilities of features. The Bayesian method used knowledge of probability to classify the sample dataset. Because of its sound mathematical foundation, the false positive rate of the Bayesian classification algorithm was low. Its characteristic was to combine the prior probability with the posterior probability, which helped avoid subjective bias and the over-fitting phenomenon. The Bayesian classification algorithm yielded high accuracy when the dataset was large and the algorithm itself was relatively simple. Naive Bayes classifiers have worked well in many complex situations [46][47][48]: where C k is a class, x is a vector space, and x i is the feature that is a part of x.

A. DATASET
To examine the effectiveness of retrieval of our proposed method, a series of extensive experiments were conducted on LETOR 4.0, which is a standard dataset for learning to rank [11]. LETOR 4.0 contains 25 million pages from the Gov2 dataset, and two query sets from Million Query track1 of TREC 2007 and TREC 2008. The two query sets are hereinafter denoted by MQ2007 and MQ2008, respectively. MQ2007 contained 1,692 queries with labeled documents and MQ2008 contained 784 with labeled documents. In the experiments, we remove those queries having no relevant documents while evaluating the effectiveness of our proposed transductive learning approaches, as suggested in [22,23]. Documents in LETOR 4.0 were represented by predefined features. Each row in the dataset was a query-document pair, in which the first column was the relevance label of the document to the given query, the second column was query id, and the subsequent columns were 46-dimensional feature vectors. In each query-document pair, the larger the relevance label was, the more relevant the document was to the given query [11].
In the tables provided below, a* indicates that a statistically significant difference between the baseline and our proposed method was observed. Considering that the Ttest is commonly used as a standard statistical test for information retrieval on LETOR 4.0, our significance test was conducted using it at a 0.05 confidence level.

B. EXPERIMENTAL SETUP AND EVALUATION MEASURE
The aim of our experiments was to evaluate the effectiveness of the proposed transductive learning, in which pseudo-examples generated iteratively by utilizing high quality queries were used to learn a ranking model for learning to rank. Five-fold cross-validation experiments were conducted using our proposed method. To examine the retrieval performance of the proposed method, four strong baselines were used.  Semi-supervised learning approaches. In this approach, as suggested in [18,[20][21], all queries despite quality were utilized and the pseudo-examples were extracted from the documents ranked high and low. In traditional transductive learning, all queries were divided into three partitions: test queries, training queries, and validation queries.
 Clustered-based transductive learning proposed in [22]. In this study, the authors presented a semi-supervised learning algorithm that used the clustered-based transductive method combined with a non-measurespecific listwise approach for learning to rank.  A training query-filtering approach for semisupervised learning to rank, in [23]. In this method, queries were classified by utilizing only sparse labeled data for applications.  The state-of-the-art deep neural ranking models, as proposed in [49,50]. DeepRank is a deep learning architecture that models the relevance of a document for a given query by simulating the process of human judgment [49]. Another neural ranking model is HiNT, which is a hierarchical neural matching model used to capture diverse patterns of relevance in ad-hoc retrieval. In HiNT, a deep neural network is employed to support high-quality relevance signal generation and flexible relevance assessment strategies [50]. For DeepRank and HiNT, we used the codes released by the authors. Standard evaluation measures in information retrieval such as the mean average precision (MAP) and average normalized discounted cumulative gain (AVG_NDCG) were used in our experiments. The MAP, as a standard evaluation measure for information retrieval, was computed using the mean of average precisions on all queries. However, precision is the fraction of the number of relevant documents in all documents for a given query. Thus, the MAP was given as mean of precisions all over queries. For the given queries, a high-MAP was obtained if the relevant documents of the queries were mostly ranked high. The MAP is given as follows: where Q is the number of all queries available, |Q| is the total number of queries, m j is the number of relevant documents for a given query, and R jk is the document set in the results ranked before the relevant document d k in terms of ranking.
The normalized discounted cumulative gain (NDCG) is also an evaluation measure commonly used in information retrieval. It leverages the relevance judgment in terms of multiple order categories. It is given as: where Q is number of all queries available, |Q| is the total number of queries, R(j,m) is the relevance label for document m to the given query j, and Z j,k denotes the normalization factor. In our experiments, the mean NDCG (AVG_NDCG) was also used as evaluation measure.

A. PERFORMANCE COMPARISON BETWEEN PROPOSED TRANSDUCTIVE LEARNING AND TRADITIONAL TRANSDUCTIVE LEARNING
We first evaluated the effectiveness of retrieval of our proposed method (PTL) in comparison with traditional transductive learning (TTL), which used all available queries to generate pseudo-labels for learning to rank. In our proposed method, two classification algorithms were used, naive Bayes and logistic regression. In the details of the experiments provided below, the proposed methods with the two classifications are respectively denoted by PTL_NB and PTL_LR. Table 3 shows that a significant improvement over the baseline was obtained when the proposed method was applied to the MQ2007 and MQ2008. An increase of 0.85% in terms of AVG_NDCG on the MQ2007 was obtained; although a small decrease in terms of MAP was noted when naive Bayes was used to classify the test queries. On the MQ2008, the proposed method, PTL, outperformed the baseline both in terms of MAP and AVG_NDCG. Improvements of 0.81% in terms of the MAP and 1.69% in terms of the AVG_NDCG were observed when PTL_NB was applied to the MQ2008. It also showed significant advantages over the conventional method on LETOR 4.0 when logistic regression was used as the classification algorithm. For example, with MAP as evaluation metric the proposed method PTL_LR outperformed the baseline TTL by 1.47%. Furthermore, PTL_LR showed notable advantages in terms of AVG_NDCG. On the MQ2008, an improvement of 1.03% in terms of the AVG_NDCG was obtained. The results thus showed that the query quality had a significant impact on the quality of pseudo-labels in transductive learning, and that it was necessary to distinguish high quality queries from low quality ones.

B. COMPARISON BETWEEN PROPOSED TRANSDUCTIVE LEARNING AND CLUSTERED-BASED TRANSDUCTIVE LEARNING
In this section, we compared our proposed method (PTL) with the clustered-based transductive learning (CTL) proposed in [22]. As shown in Table 4, when naive Bayes was used to classify the queries, the proposed method PTL_NB outperformed CTL_LISTMLE by 7.77% and CTL_LambdaRank by 2.57% in terms of AVG_NDCG. However, no significant advantage was obtained in the other cases. The improvements on LETOR 4.0 were notable both on the MQ2007 and the MQ2008 in most cases when logistic regression was used to assess query quality. For example, in terms of the MAP, the proposed method PTL_LR outperformed the strong baselines significantly, by 0.82% compared with the CTL_LISTMLE and 1.42% with the CTL_LambdaRank. However, no notable improvement in terms of the MAP was obtained between PTL_LR and CTL_LambdaRank. The table shows that the proposed method PTL performed well on the MQ2008. By contrast, on the MQ2007, the effectiveness of PTL_LR was superior to that of obtained by PTL_NB. We think that for naive Bayes classification, the algorithm assumes that the features are independent of one another, but the document features in LETOR 4.0 are intimately related because features are scores obtained by applying different retrieval models. Thus, PTL_NB did not perform as well as PTL_LR.
In case a few labeled examples were available, the proposed approach, PTL, instead of CTL was more suitable for creating pseudo-labels for learning to rank because the quality of the selected queries obtained is higher. Of course, for scenarios involving no labeled data, the clustering-based transduction method is recommended for generating pseudo-labels for learning to rank. VOLUME XX, 2017 1

C. COMPARISON BETWEEN PROPOSED TRANSDUCTIVE LEARNING AND CLASSIFICATION-BASED TRANSDUCTIVE LEARNING
Thirdly, we compared our proposed method with classification-based transductive learning as shown in Table  5. It shows that our method, PTL, outperformed the classification-based method in most cases, although the improvements were not notable on the MQ2008 when naive Bayes was applied to distinguish query quality. A significant advantage was obtained when its performance on the MQ2007 was evaluated in terms of AVG_NDCG. PTL_NB significantly outperformed the classificationbased method STL (an improvement of 1.05%), with the exception that the retrieval performance of our proposed transductive learning, PTL, was notably worse on MQ2008 when logistic regression was used as classification algorithm and AVG_NDCG as evaluation measure. A significant decrease of 1.06% was observed by applying PTL_LR. In general, the proposed transductive learning PTL_NB performed well when naive Bayes was applied as classification method, whereas PTL_LR did not perform as well as expected. It is believed that the job of sampling most unconfident data nearby the classification is mainly around the classification boundaries, which may not bring benefit for logistic regression achieving classification by learning hyper-planes. However, the parameter optimization of naive Bayes is maximum likelihood estimation and the quantity of the pseudo-labeled data may increase the precision.  show that in terms of information retrieval, although more complex models were trained using deep neural ranking, the automatically learned features for ranking may not be sufficient to improve the effectiveness of retrieval. Our PTL significantly outperformed the deep neural ranking models in terms of MAP, which shows that the improvement in the quality of the training data played an important role in enhancing the effectiveness of information retrieval. Table 7 gives the precision and recall of the selected high-quality queries measured against the ground truth. If a query yielded a positive benefit for the test queries during the assessment of the quality of query estimation, it was considered a high-quality query. Otherwise, it was considered a low quality one. On the MQ2007, a high precision of 93.2% was obtained in this way. Although a moderately low accuracy was obtained on the MQ2008, the high recall (85.82%) indicated that most high quality training queries were correctly identified by utilizing the proposed transductive learning method. Therefore, an effective and robust retrieval performance was still achieved by using the appropriate classification algorithm. The above results showed that the PTL exhibited significant advantages in terms of retrieval performance compared with traditional transduction (TTL) and clustered-based transduction (CTL). Although enhancements in the effectiveness of retrieval were also noted, PTL was not always superior to STL. We thus investigated the effectiveness of classification-based transduction, STL, through iterative tests. Fig. 3 shows us a comparison between PTL and STL in terms of the MAP at each iteration from 1 to 15 on the MQ2007 when logistic regression was used to estimate query quality. It is clear VOLUME XX, 2017 9

VI. DISCUSSION
that PTL outperformed STL in most cases. Note also that it yielded the highest values in terms of the MAP in the third and fourth rounds. We also calculated the MAP and AVG_NDCG between PTL and STL in each iteration from 1 to 15 on the MQ2008 when logistic regression was used for classification. Although, PTL did not yielded as high an MAP as STL in the second round, a high retrieval performance was obtained when it was applied to the MQ2008. Fig. 3 and Fig. 4 show that regardless of the datasets, the retrieval performance of PTL and STL exhibited a downward trend with the number of iterations. A high value of the MAP was obtained around the second to fourth rounds, both for PTL and STL, which also suggests that the pseudo-labels in these rounds were more likely to supply useful information for semi-supervised learning when applied to ranking queries.

VI I.Conclusions
In this paper, we proposed a transductive learning approach that distinguished high quality queries from low quality ones to iteratively generate pseudo-labels for learning to rank. Specifically, to improve the quality of recognition of relevant and irrelevant pseudo-labels, we estimated the quality of each given query by judging whether it could enhance retrieval performance with a limited number of reliable labels along with data with low confidence close to the classification boundaries. By building a classifier between features that influence query quality and performance-related benefits, high quality queries were extracted by our proposed approach. The proposed transductive method was significantly different from traditional transduction in that only the selected queries were used to iteratively aggregate pseudo-labels for learning to rank. Experiments were conducted on the standard dataset LETOR 4.0, and the results showed that a statistically significant improvement over the baselines was achieved by applying our proposed transductive learning approaches in most of the cases considered. In addition, our study proved that with a limited number of labeled examples available, the proposed approach, PTL, was suitable for creating pseudo-labels for learning to rank because the quality of the selected queries in it can supply useful information.