PeerRank: Robust Learning to Rank with Peer Loss over Noisy Labels

User-generated data are extensively utilized in learning to rank as they are easy to collect and up-to-date. However, the data inevitably contain noisy labels attributed to users’ annotation mistakes, lack of domain knowledge, system failure, etc., making building a robust model challenging. On account of the remarkable nature of deep neural networks in fitting datasets, the noisy labels significantly degrade the performance of learning-to-rank algorithms. To cope with this problem, previous studies have put forward several methods for label de-noising. However, they are either susceptible to the noise distribution on datasets, raising the demand for clean data or incurring more computational costs. Moreover, most of them are tough to extend to different scenarios. This paper proposes a simple yet effective framework named PeerRank that can be applied in broad applications such as click-through rate prediction and commercial web search in learning-to-rank tasks. PeerRank is a robust, effective, and adaptable framework that can couple with numerous models with theoretical guarantees. Extensive experiments on three public real-world datasets with thirteen point-wise base models and four semi-synthetic generation datasets with four pair-wise base models show the consistent improvement of PeerRank. The results comparing PeerRank with seven classic and state-of-the-art de-noising methods validate the advantages of PeerRank framework for learning to rank over noisy labels.


I. INTRODUCTION
Learning to rank (LTR) approaches heavily rely on the large-scale labeled data to build ranking models. Editorial labeled training data that requires experts to annotate is time-consuming and costly to obtain. Thus, more and more research works use user-generated data, which are easily accessible and up-to-date, to train ranking models [1]. However, user-generated data inevitably contain noise for many reasons [2]. For example, the vague definition of relevance levels or the lack of domain knowledge makes it difficult for the users to give a reliable label to each data point. Xu et al. [3] proved that label noise in training data, no matter being randomly generated or existing in real-world data, can significantly degrade the performance of LTR algorithms. Fig. 1 shows an example to elaborate how noisy labels affect the ranking results. Suppose there are four items in candidate pool where x 1 , x 2 , x 3 are relevant items while x 4 is not. However, x 2 is observed as irrelevant from user interactions, i.e., noisy data point. Thus, the ranking model learned from the noisy data probably ranks the items asπ : ⟨x 1 , x 4 , x 3 , x 2 ⟩, which brings about a decrease of metric values compared with the ranking result π : ⟨x 1 , x 2 , x 4 , x 3 ⟩ from the clean data. If we consider a longer ranking list, a noisy interaction on a single item will lead to a large number of mislabeled pairs, which deteriorates overall ranking performance [4]. Making it worse, the state-of-the-art (SOTA) performance of LTR is achieved by deep models [5]- [7], where deep neural networks are more likely to fit noisy labels than traditional lower-rank ranking algorithms [8]. A robust model, defined as one that can tolerate perturbations in the data, is urgently needed to address the issues.
In the literature, several solutions have been proposed for designing a robust model and label de-noising. Recently, some works train models that are invulnerable to outliers by developing robust loss functions [9]- [13], applying regularization techniques [14]- [17] or selecting reliable samples [3], [18]- [22]. However, methods of these types either are significantly affected by the changes in noise distribution [23], improve marginally [24] or carry the risk of eliminating clean data [2] [23]. This paper proposes a novel framework, named PeerRank, that can be easily applied to a broad class of applications with noisy labels in LTR. The PeerRank framework is built on the Peer Loss [25], which deals with noisy labels in binary classification tasks. LTR can be mapped to classification tasks if we consider whether an item is relevant to a user or not, corresponding to point-wise LTR approaches, or whether an item is more relevant than another one, corresponding to pair-wise LTR approaches. PeerRank constructs a peer sample for each training instance. It trains a ranking model based on Peer Loss with both observed data samples and peer samples. The principle is that a model trained based on Peer Loss with observed samples and peer samples is equivalent to a model trained based on empirical risk minimization (ERM) with clean data.
PeerRank inherits a bunch of merits. Firstly, it can positively adapt to different data distributions and varying degrees of data noise without prior knowledge of the noise rate. Secondly, it can easily fit different LTR algorithms, referring to either point-wise or pair-wise and some list-wise approaches. The framework is relatively simple and general, requiring no dedicated architecture design for each algorithm or dataset. Thirdly, it does not restrict any specific ERM methods, so a wide range of loss functions is applicable. Finally, we theoretically prove that PeerRank has the properties of robustness and effectiveness, which are essentially desired by all de-noising approaches.
We conduct extensive and comprehensive quantified experiments to testify that PeerRank is easy to couple with thirteen SOTA point-wise and four pair-wise approaches and achieves better performance than those LTR approaches without PeerRank. We empirically prove the advantages of PeerRank over seven classic and SOTA de-noising methods. We also find the boundaries for the only hyper-parameter of PeerRank. Besides, we explore the effect of noise rate on the performance of PeerRank. And we do further experiments comparing PeerRank with some SOTA de-biasing methods.

B. DE-NOISING APPROACHES
Several previous robust training methods have been proposed for label de-noising [24]. Whereas most of them are applied in computer vision, and few are involved in information retrieval (IR).
Most of the works try to design a robust loss function. GCE [10], TCE [12] and SCE [11] are proposed based on mean absolute error and categorical cross-entropy to exploit the advantages of their robustness, fast convergence, and generalization [9]. Some adjustments to the loss functions improve robustness. For example, BootStrap (BS) [13] uses label refurbishment to update the training labels. Regularization [14]- [17] is utilized to prevent a model from over-fitting noisy labels. e.g., Label Smoothing (LS) [17] estimates the marginalized effect of label noise during training to prevent the neural network from fully calculating the loss of noisy training samples.
Sample selection is another widely used method trying to distinguish and remove noisy data samples to pursue robust learning. For example, Co-teaching (CT) [20] selects samples with low losses and feeds them to another network for further training. The noise rate (denoted as τ in CT [20]) is required for hyper-parameter setting. Reweight [2] contains two dedicated steps to calculate the probability of a label being noisy in the first step and reweight the loss in the second step. Similar ideas can be found in [3], [18], [19], [21]- [23].
However, several drawbacks exist in the methods mentioned above. The loss function design methods are sensitive to the changes in noise distribution [23], which greatly reduces their applicability. Since the definition of a clean sample is vague [2] [23], sample selection methods may eliminate numerous clean and sound samples while excluding noisy and unreliable samples. Other works applying metalearning [23] [43] [44] or semi-supervised learning [45] either require a certain amount of clean data that may be unavailable in real-world scenarios or bring about an inevitable increase in computational cost [24].
In view of all these limitations, we develop a more effective and more applicable method, named PeerRank, to deal with noisy training data for LTR based on Peer Loss [25], a new family of loss functions that copes with noisy labels in binary classification tasks without prior knowledge of the noise rate. Wide ranges of algorithms can be further im-proved with our proposed PeerRank framework. In this paper, we examine how PeerRank takes effect on three branches of LTR approaches. 1 We prove theoretically and empirically that PeerRank is more robust and effective for ranking with noisy labels than the existing approaches.

A. PROBLEM SETUP
We denote the clean dataset as D = (X, Y ) where X represents the features of instances, and Y represents the clean labels of the instances. Each sample in D is independently and identically drawn from an implicit data distribution D, i.e., D ∼ D. The objective of LTR is to give a permutation of n items where the items that the user is interested in are ranked ahead. Denote the permutation that optimally coincides with the user's interests as π * . Let π(·) be the permutation produced by a ranking model with n input items. The ranking objective is to minimize the ranking risk [47] measuring the gap between π(·) and π * : where ℓ is a loss function measuring the distance between π(·) and π * . The objective can be instantiated in three different LTR approaches. We mainly focus on the point-wise and pair-wise approaches widely used in commercial search engines to illustrate our proposed PeerRank framework. Peer-Rank is also applicable to specific list-wise algorithms that optimize the entire list but are trained in a pair-wise mode, such as LambdaRank [39]. In real world, however, only the observed data are available which contain both clean and noisy labels, i.e.,D = (X,Ỹ ) ∼D whereỸ can be clean or noisy. Following [25], we define error transition probabilities as e + = P r(Ỹ = −1|Y = +1), where e + (e − ) represents the probability of samples that should be positive (negative) are observed as negative (positive).
In the field of IR, for each user, the observed data isD = } where a sample x i ∈ X contains the features of the user, an item, and context, and c i = {0, 1} indicates whether the item is clicked (1) or not (0).

1) Point-wise Approach
The point-wise approach learns a scoring function that takes the feature vector of an instance as input where Θ is the model parameters. f predicts the relevance of the current item to the user. All items are then ranked according to the inferred scores from the learned scoring function. Without loss of generality, we adopt the widely used logistic regression to calculate the probability h(x i ) that the target item i is relevant, where σ is the shape parameter. A higher h(x i ) yields a higher ranking of the item. The objective of the model is to minimize the ranking risk point-wisely, where ℓ denotes the loss function. As click c i serves as the label in point-wise approach, i.e.,ỹ i = c i , noise in clicks directly influences the performance of the algorithm.

2) Pair-wise Approach
The goal of the pair-wise LTR approach is to minimize the number of misclassified item pairs [46] [48]. Any two candidate items, i 1 and i 2 , are paired to form a training indicates the pair-wise preference and we use Y i to denote the pair labels. The classifier h(x i1 , x i2 ) outputs the probability of item i 1 being more relevant than item i 2 to the user as where f is a linear [37] or non-linear [7] [39] [46] scoring function as in (3), excepting some algorithms like Greedy-Order [49]. The ranking model tries to minimize the ranking risk pair-wisely. The influence of noisy clicks is more severe in the pair-wise setting than in the point-wise setting since one noisy click incurs O(n) noisy pair labels.

B. PEER LOSS FUNCTION
Peer Loss is proposed in [25], which can be served as a robust loss function to deal with noisy training data without the prior knowledge of the noise rate.
Definition 1. Given classifier h and instance (x i ,ỹ i ), we randomly sample two additional instances (x j ,ỹ j ), (x k ,ỹ k ) fromD to form a peer sample (x j ,ỹ k ). The peer loss function is defined as Peer Loss draws inspiration from peer prediction, which is a method to truthfully elicit information from different sources with no ground truth verification. The noisy labels and classifier outputs are treated as two sources of information, and clean labels are treated as the information to elicit. The Peer Loss is served as a scoring function from peer prediction literature [50] [51] to evaluate the quality of information source, i.e., the model outputs. Intuitively speaking, the second term represents how well the model predicts artificially created noise labels and consequently "punishes" models that predict noise well. VOLUME 4, 2016

IV. PEERRANK METHOD AND ANALYSIS
This section presents our framework, named PeerRank, which is robust to unexpected noise in the training data. We firstly explain the general idea of PeerRank. Whereafter, the robustness, effectiveness, and adaptability of PeerRank are proven theoretically.
The structure of our framework is displayed in Fig. 2. A traditional LTR algorithm feeds the feature vector x i into a model, which might include hidden layers and a prediction layer. Then, the model outputs the ranking score for each item. In traditional LTR, the model is learned by optimizing the loss function ℓ(h(x i ),ỹ i ). In PeerRank, we construct a peer instance (x peer ,ỹ peer ) for each input in the batch training data. Both the features of batch training data x i and the features of generated peer instances x peer will be fed into the network, the predicted values of which are h(x i ) and h(x peer ), respectively. The model is learned by optimizing the peer loss function as shown in Fig. 2 where the peer instances are used to compute the second term in the loss function.
The generation of peer instances is illustrated in Fig. 3. For each sample (x i ,ỹ i ) used in a point-wise approach, we randomly and uniformly sample another two instances (x j ,ỹ j ), (x k ,ỹ k ) from the batch data where the feature vector of the first instance x j and the label of the second instanceỹ k are assembled as the peer instance (x j ,ỹ k ). In the pair-wise approach, for each input (x i1 , x i2 ,ỹ i ), the peer instance (x j1 , x j2 ,ỹ k ) is assembled from the two instances, i.e., (x j1 , x j2 ,ỹ j ) and (x k1 , x k2 ,ỹ k ), randomly and pairwisely sampled from the batch data.

A. PEERRANK LOSS FUNCTIONS 1) Point-wise PeerRank
Referring to the definition of peer loss function in Section III-B, we perform ERM on (5) as are randomly sampled from observed datã D while only the feature vector x j and the labelỹ k are used. ℓ can be 0-1 loss or any surrogate loss functions, e.g., ℓ usually refers to cross-entropy loss in click-through rate (CTR) prediction tasks.

2) Pair-wise PeerRank
We unify pair-wise approaches such as RankNet [38] and LambdaRank [39] into a framework The explanation of △Goal varies with different algorithms. For RankNet [38], △Goal(i 1 , i 2 ) equals to 1. For Lamb-daRank [39], △Goal(i 1 , i 2 ) refers to the change of the ranking metric after swapping the positions of these two items, such as NDCG. Suppose there are m item pairs from a list of  n items. Taking (8) and (11) into consideration, we perform ERM on (7) as where (x j1 , x j2 ) is randomly sampled from paired feature space andỹ k is randomly sampled from the set of pair labels.

B. THEORETICAL ANALYSIS
The point-wise PeerRank, inherited from [25], has the properties of robustness, effectiveness, and adaptability. Here, we prove that pair-wise PeerRank also satisfy those properties of robustness (Theorem 1), effectiveness (Theorem 2 and 3) and adaptability. With all these properties, PeerRank is able to withstand errors in the input and derives an optimal or nearoptimal model regardless of the label noise distributing in the data.

1) Robustness
Robustness of PeerRank refers to its ability to learn a model whose performance is stable despite the existence of noise in the data. We demonstrate that pair-wise PeerRank has maintained the nature of resisting noise.
Theorem 1. Optimizing the PeerRank in (13) over observed data is equivalent to optimizing that over the clean data. That is, (13) is invariant to label noise inD in expectation, The proof is given in Appendix A. We have the inequality 0 ≤ e + + e − < 1 holds for the reason that massive errors are unlikely to happen in real life based on the rationality of the vast majority of users. Theorem 1 shows that PeerRank on a noisy training data is proportional to that on a clean data. Therefore, the model trained with loss function (13) is robust since it is invariant to noise in the training data.

2) Effectiveness
Effectiveness of pair-wise PeerRank refers to the optimization guarantee that pair-wise PeerRank can produce an optimal or near-optimal model as if performing ERM on the clean data.
Denote the true risk measure of model f on clean data as When trained on clean data that is large enough, the empirical risk converges to the true risk. We now illustrate the connection between PeerRank's loss function and the true risk R D (f ). For the convenience of induction, we take ℓ(·) as 0-1 loss, which has the property: Theorem 2 is put forward for balanced datasets where the number of positive and negative instances is almost the same. Theorem 3 is put forward for unbalanced datasets. The proofs are given in Appendix B and C respectively. In a more general case where the dataset is unbalanced, i.e., p ̸ = 0.5, the gap between the risk of a model trained by PeerRank and the true risk is still bounded. Denote R D (f * ) the optimal true risk where f * = arg min f R D (f ). Let C 2 = 4|p − 0.5| max Xi,Xj △Goal(i, j) and ℓ(·) = 1(·). The discrepancy between risk R D (f * peer ) and R D (f * ) is bounded by C 2 . That is, Theorem 2 is a special case of Theorem 3 when C 2 equals to zero. In this special case, PeerRank is strongly guaranteed by Theorem 2 to produce an optimal model. For the case that the dataset is unbalanced, Theorem 3 guarantees that the optimized empirical model using PeerRank is near-optimal to minimize the risk on clean data.

3) Adaptability
Adaptability of pair-wise PeerRank refers to its ability to adapt to different datasets and different degrees of noise. For dataset that are severely unbalanced, i.e., p being distant from 0.5, Theorem 3 provides a loose bound. In this case, αweighted PeerRank can be adopted to optimizing the ranking risk.

α-weighted PeerRank
When p = P r(Y = 1) ̸ = 0.5, we adopt α-weighted PeerRank loss function in (13) where a parameter α is added to the second term: The hyper-parameter α can be regarded as a control parameter modulated with the label distribution on the dataset. Typically, when △Goal(i 1 , i 2 ) = 1, the optimal value of α can be calculated as claimed in [25] as Denote R D (f * α * −peer ) the true risk takingf * α * −peer as scoring function optimized empirically by (15) with α * .f * α * −peer is proven to converge to the optimal scoring function by the lemma below.
Lemma 2. ℓ(·) = 1(·), according to Hoeffding's inequality, with a probability of at least 1 − δ, we have Lemma 2 ensures the model trained by α-weighted Peer-Rank converges to the implicit optimal model. The value of α * depends on the error transition probabilities e + , e − as in (16), which might not be available in real-world data.
Following, we would like to discuss the valid range of α, and we find that in most scenarios of IR, α * ∈ (0, 1). Firstly, as we work in IR cases, the number of negative samples (nonclicked) is often much larger than positive ones (clicked). Thus, we consider δ p and δp in (16) to have the same sign. Also, 1 − e + − e − is positive if we consider the majority of users are rational and give clean labels. Then, we have α * < 1. In the following, we show that α * > 0 holds: where #e + (#e − ) stands for the number of samples that change from positive (negative) to negative (positive), and VOLUME 4, 2016 #Y + (#Y − ) stands for the number of positive (negative) samples in the clean data. The condition indicated by the inequalities in the last line is easy to satisfy in large-scale datasets. As #Y − is extremely large, #e − should be large enough in the real world to reverse the inequality, which is a rare scene in IR since most of the user behaviors make sense. Thus, mistakes on a large scale could hardly happen. On the other hand, #e + might be relatively large in IR since users often overlook relevant items not on the top pages. This can also be validated by our experiment detailed in Section V. In semi-synthetic Yahoo dataset using PBM as click model with 0.05 rate of noise, e − = 0.0373, e + = 0.5037, we have e + ≫ e − .

V. EXPERIMENTS
In this section, we conduct experiments based on two applications, CTR prediction and web search ranking, which are common scenarios of point-wise and pair-wise LTR approaches, respectively, to evaluate the performance of Peer-Rank. 2 We mainly focus on answering the following research questions (RQs).
• RQ1: Does PeerRank easily couple with SOTA point-wise and pair-wise LTR approaches and make significant improvement? • RQ2: Does PeerRank achieve better performance than other SOTA de-noising methods? • RQ3: How does the noise rate affect the performance of PeerRank? • RQ4: How does PeerRank perform compared to the debiasing methods? • RQ5: Does PeerRank works on data involving noise caused by biases?

1) Datasets
We start with the introduction of datasets and pre-processing details.
CTR Prediction. We use three large-scale public available real-world datasets for CTR prediction to evaluate the performance of PeerRank on point-wise approaches. These datasets naturally contain noisy clicks.
• We follow UBR [52] to process the datasets.
Web Search Ranking. Our experimental data for web search ranking is derived from two widely used expert-annotated LTR datasets. Both of the datasets supply the 5-level relevance label (0-4).
• Yahoo! LTR set 1 6 contains 29,921 queries and 701k documents, where 700 features are extracted from each querydocument pair. • Istella-S 7 is composed of 33,018 queries and 3,408k documents, where each query-document pair has 220 features.
To simulate the real-world scenario where only the implicit click feedback is available, we first generate click behaviors following user click model PBM [53] and CCM [54]. The probability of document i being correlated with the user is calculated by where s i is the expert-annotated relevance score with the maximum value of s max . We then adding noise manually to the generated clicks by randomly flipping the click labels at the noise rate ϵ. In our experiments, we choose ϵ = 0.05. 8 We also do extra experiments under the setting of [55] where noise is added by transforming (17) to P (r i = 1) = ϵ + (1 − ϵ) 2 s i −1 2 smax −1 , and ϵ = 0.1 according to [55]. It introduces feature-dependent noise into labels, which does not conform with the assumption in Peer Loss [25]. Whereas we conduct experiments in this setting and empirically prove that PeerRank still takes effect. More details can be found in Section V-F.

2) Evaluation Metric
We evaluate the performance of point-wise PeerRank with the area under the ROC curve (AUC) as it is a universal criterion for judging the merits of CTR prediction. For pairwise PeerRank, we record Mean Average Precision (MAP, document i is regarded as relevant if s i ≥ 1 [55]) and Normalized Discounted Cumulative Gain (NDCG@10, abbreviated as NDCG) calculated by original 5-level relevance labels.

3) Base Models
Multiple algorithms introduced in Section II served as the base models of PeerRank. We name the models PeerX for algorithms X coupled with PeerRank. For the CTR prediction task, we couple PeerRank with 13 point-wise LTR algorithms as in Table 1. We adopt the same hyper-parameter setting for the models and their peer versions as in [52]. The exclusive parameters for every model are tuned according to the best performance on the validation set. Specifically, we set an additional 2-layer cross net for DCN [28] and DCN-M [6]. Values with * are referred from [52]. All the results of PeerRank are significant with p-value<0.05 comparing with the base model.
The stacked way is chosen for DCN-M as it performs better than the parallel way. We set 3-layer compressed interaction network with 7 vectors per layer for xDeepFM [29]. We set 3 interacting layers with 2 heads per layer for AutoInt [5], and the dimension of Q, K, V are 6. For the web search ranking task, we combine PeerRank with 4 pair-wise algorithms as in Table 2. The initial learning rate is searched from {0.0005, 0.001, 0.005, 0.01}. The batch size is set as 256. All multi-layer perceptron models are configured with a 4-layer network of [512, 256, 128, 1].

4) State-of-the-art De-noising Methods
To demonstrate the superiority of PeerRank in de-noising, we compare our framework with several classic and widely recognized de-noising methods introduced in Section II as in Table 3. Two types of BS [13], "soft" and "hard", are implemented. Table 1 shows the performance of 13 point-wise base models on 3 real-world datasets, i.e., "base" columns, and the performance of the PeerRank on these base models, i.e., "+peer" columns. Since the experimental settings are the same, some results in Table 1 with notes * are directly referred from [52]. Table 2 shows the performance of 4 pair-wise base models on 4 semi-synthetic datasets and the performance of the PeerRank on these base models.

B. OVERALL PERFORMANCE (RQ1)
Three properties of PeerRank can be revealed from Table 1  and Table 2. (i) PeerRank can easily couple base models and significantly achieve better performance, particularly on the Alipay dataset (improve 0.20% to 14.4%). In experiments of web search ranking where noise is only added to the training data, the clean relevance labels are used when evaluating. As can be observed from Table 2, the superior performance of PeerRank demonstrates it is invariant to label noise in the training data and achieves better results concerning both MAP and NDCG over the base models. (ii) In the CTR prediction task experiments, the noise distribution remains the same in the training and test set since they are segmented from the same observed real-world dataset. PeerRank is insensitive to noise in the test data and beats the base models, verifying its robustness. (iii) PeerRank improves all the base models on multiple datasets. The improvement of PeerRank coupled with both linear [37] and non-linear [7] [38] [39] models in web search ranking also proves its adaptability.  We also cast sight into why PeerRank takes effect. We experiment PeerRankNet comparing with RankNet on Yahoo (PBM) and plot NDCG during training in Fig. 4. We find that the performance of RankNet first climbs high but suffers from downdrift later on. On the contrary, the performance of PeerRankNet remains at a high level as training going on. This is probably because when no de-noising method is applied, RankNet is sensitive to noise in the data and fits the outliers. While PeerRank can prevent over-fitting to such noise and behaves well.

C. COMPARISON WITH DE-NOISING MODELS (RQ2)
In this experiment, we fix DIEN, which is regarded as the SOTA model from Alibaba Group, as the base model for the point-wise approach, and RankNet, which has stable performance, for the pair-wise approach. We compare PeerRank with other de-noising methods introduced in Section V-A4 on these two models. We test on Alipay dataset for CTR prediction and Yahoo semi-synthetic dataset with PBM for  web search ranking, as displayed in Table 3. We specify the exclusive hyper-parameters for each model in the "Par." columns, with which those methods achieve their best performance.
From Table 3, we can observe that (i) In either pointwise or pair-wise, the PeerRank performs the best among all SOTA de-noising approaches, which testifies it is effective when training with noisy labels. (ii) We find that not all de-noising method takes effect. e.g., Reweight [2] performs poorly. This is because it calculates the probability of labels being noisy based on the expert-annotated relevance labels, which might not be available in many IR scenarios where only a binary click label is achievable. (iii) We find it difficult or impossible for some de-noising methods to apply to vast scenarios. e.g., CT [20] algorithm demands prior knowledge of the noise rate, which is not available in many real-world datasets, such as the point-wise cases in our experiment. Reweight [2] assumes the features of an item are independent of each other, which makes it not applicable to methods such as DIEN [36] where the sequential features are correlated. GCE [10] and TCE [12] are restricted to cross-entropy only, hindering them from extending to ranking models like SVM-Rank. PeerRank is not subject to these limitations. It requires no prior knowledge of the noise rate and can work with click labels. Furthermore, as displayed in Table 1 and Table 2, PeerRank is easy to couple with many loss functions despite the complexity of the base models. We conduct extra experiments on Yahoo (PBM) adding other rates of noise, i.e. ϵ = 0.01, 0.03, to explore how PeerRank works on different noise rates. The results are shown in Table 4. Overall speaking, the performance of both RankNet and PeerRankNet drops as the noise rate increases. Still, PeerRankNet achieves significant better performance than RankNet and drops slower, reflected in the rise of improvement recorded in "Impv." column, referring to the relative NDCG improvement of PeerRankNet over RankNet. This indicates the PeerRankNet adapts well to different noise rates and helps to alleviate the noisy labels issues.

E. COMPARISON WITH DE-BIASING MODELS (RQ4)
Since we are working with click data, we conduct extra experiments on Yahoo (PBM) to make a comparison with some de-biasing methods, IPW [56], DLA [57] and PairDebias [58], dealing with the click noise caused by biases, e.g., position bias, as shown in Table 5. To make the comparison fairly and reasonably, we adjust IPW and DLA to pair-wise manner. We find that the result of model with IPW is inferior to that of RankNet. This is because IPW highly relies on the generated click data to learn user exam propensity weights so as to learn a robust and effective ranking model. As our click noise simulates the real noise not only caused by position bias, the IPW approach might over-fit these noises and perform poorly. The other two SOTA methods improve RankNet as their intelligent way of de-biasing through either dual learning or pair-wisely updating, but both of them consider only the bias correlated to position while neglecting other noise factors like user randomness or system failure, so they do not perform as well as PeerRank.

F. PERFORMANCE ON DATA WITH BIASES (RQ5)
We also conduct extra experiments on semi-synthetic datasets following the setting in [55] to prove that PeerRank can also work in the case where data is involved with biases. The results are presented in Table 6.
The observations from Table 6 are similar to those discovered from Table 2. This is partly attributed to the data distribution of these datasets being similar according to the statistics in Table 7. The data distributions are slightly different because these two processing methods set up different distributions of noise. Whereas the statistics reflect no apparent difference between the data distribution on the  Yahoo datasets, which verifies the reasonability of our setting ϵ = 0.05 when flipping the clicks.
As bias can be regarded as one kind of the noise, theoretically, dealing with bias should fall within the scope of dealing with noise. As PeerRank is a general framework to deal with noisy data, and because of its ability to dispose of various data distributions, it is reasonable to see that PeerRank behaves well under noise caused by typical ranking biases.

VI. CONCLUSION
This paper proposes an easy-to-extend framework, PeerRank, for LTR from noisy data. Specifically, PeerRank randomly samples feature vectors and labels to construct peer samples for each training instance. We propose loss functions in the PeerRank framework for both point-wise approaches and pair-wise approaches. We theoretically prove that Peer-Rank inherits the properties of robustness, effectiveness, and adaptability. Extensive experiments are conducted on two real-world applications in LTR. Results on three real-world datasets for the CTR prediction task and four semi-synthetic datasets for web search ranking show the superiority of our work. We leave the investigation of other user click models to generate click labels in the pair-wise experiments for future studies. More experiments combining de-noising and debiasing methods remain to be conducted. Subtracting the first and second term on right-hand-side we get: