Multi-task Fine-tuning for Passage Re-ranking Using BM25 and Pseudo Relevance Feedback

Passage re-ranking is a machine learning task that estimates relevance scores between a given query and candidate passages. Keyword features based on the lexical similarities between queries and passages have been traditionally used for the passage re-ranking models. However, such approaches have a limitation; it is difficult to find semantic and contextual features beyond word-matching information. Recently, several studies based on neural pre-trained language models such as BERT overcome the limitations of traditional keyword-based models and they show significant performance improvements. Such ranking models have the advantage of finding the contextual features of queries and documents better than traditional keyword-based methods. However, these deep learning-based models require large amounts of data for training. Such training data is usually manually labeled with high cost, and how to utilize the data efficiently is an important issue. This paper proposes a fine-tuning method for efficient training of the neural re-ranking model. The proposed model utilizes data augmentation by simultaneously learning the ranking and MLM tasks during the fine-tuning stages. For the MLM task, different parts of a passage are masked at each training epoch. Even if only one pair of query and passage is given, the model is exposed to diverse cases with passages dynamically masked from the one. Also, the probability distribution of term importance is trained on the model.We calculate term importance weight by two novel methods using BM25 and pseudo relevance feedback. Terms are sampled and masked according to the importance weight. The ranking model learns representation based on the term weight distribution by executing the MLM task. A novel method with pseudo relevance feedback is applied for calculating term importance. It enables the neural ranking models to form representation according to feedbacks from an initial search stage. The proposed model is trained with data from the MS MARCO leaderboard for the re-ranking task. Our model achieves the state-of-the-art MRR@10 score in the leaderboard except for the ensemble-based method. In addition, our model demonstrates significant performance in three different evaluation metrics: MRR@10, Mean Rank, and Hit@(5,10,20,50).

FIGURE 1: Architecture of the proposed model placing the feature-based models in the re-ranking task [5,6]. Especially, rankers based on pre-trained language models demonstrate significant performance improvements [7,8,9]. Such approaches utilize semantic and contextual information between the query and passages while not depending on word matching features.
Nevertheless, large amounts of data are needed to train the pre-trained models for passage re-ranking. In reality, the relevance between a query and a passage in information retrieval needs to be manually labeled, which results in a costly process. We propose a new fine-tuning method based on a masked language model (MLM) that is typically used in pretrained language models. The proposed multitask-learningbased fine-tuning method for the ranking model jointly trains both the relevance prediction and MLM tasks. We employ MLM task as a self-supervised learning approach for passage re-ranking without the need for additional training data; it is usually trained with a general and large corpus for pretraining language models [10,11]. However, there are recent reports on some research cases for training MLM tasks on pre-trained models to enhance language understanding regarding specific domains or tasks [12]. Our multitask-based fine-tuning method is different from these methods in that it does not need much training time for the additional training task. Besides, we show that MLM training in the fine-tuning stage enables data augmentation empirically.
One of the recent changes in the re-ranking models is that word embedding methods replace handcrafted features for model input [5,6,9]. There is an advantage that the models can learn deep representation beyond term exact matching information. However, those methods take the query and the passage as a concatenated sequence without consideration of the relation between them. This paper suggests methods that trains the information related to the ranking task while sustaining the representation ability of deep learningbased word embedding. We devised two novel methods that calculate the importance weight for terms in the passage. The probability distribution of the weights is trained on the model. For the MLM task during the fine-tuning stage, the proposed method masks terms with masking probabilities based on the importance weights. It is different from previous studies with pre-trained language models that mask all the terms with equal probability [10,11]. For calculating the term importance scores, we refer to two statistical models frequently used in information retrieval: BM25 and pseudo relevance feedback. BM25 is a Poisson distribution-based model representing lexical similarities between the query and the passages [13]. Pseudo relevance feedback is one of the methods for query expansion utilizing feedbacks from the users about previous ranking results [14].
Our model obtained a state-of-the-art MRR@10 score on MS MARCO passage re-ranking task leaderboard [15] except for one ensemble-based model. Besides, the proposed model demonstrated a 3.4%p higher MRR@10 score and 8.3 higher mean rank than the baseline. The model achieved about 5%p higher scores at Hit@ (5,10,20,50). It was a significant performance in all three metrics with different evaluation standards. In addition to the significant performance, our model has another advantage of efficient processing time in both training and testing stages even though a self-supervised approach is used. This paper is an expanded version of our previous work [16] about neural re-ranking models. In this study, we calculated term importance scores using pseudo relevance feedback as well as BM25. Our model was evaluated by MRR@10, Mean Rank, and Hit@ (5,10,20,50) and it achieved significant improvements. Expanded contributions in this version are as follows: • Term importance scores are calculated using the method based on pseudo relevance feedback. It is an especially adaptive approach to information retrieval tasks where user feedback is important. • Through additional experiments, we prove that our multitask fine-tuning is more effective than existing approaches when the model needs to understand specific domains or tasks. • Our model with the new method achieved significant performance in diverse metrics with different evaluation standards.

A. PASSAGE RE-RANKING MODELS
Passage re-ranking involves finding passages that are relevant to a given query using the candidates from an initial ranking stage. The aim of re-ranking is to achieve balance between the ranking performance and efficiency. Traditional reranking models mainly depended on machine learning-based methods, also known as "learning to rank." Such methods utilized handcrafted features devised by a human. Ranking SVM [1] utilized a support vector machine model for solving ordinal regression problems. In the problem, it was needed to construct a ranking of the labels according to their values. RankNet [2] was a ranking model based on a two-layer neural network. The neural network architecture was also utilized in LambdaRank [3]. The model was trained by implicit loss function because it was hard to directly optimize the ranking models according to the evaluation metrics. LambdaMart [4] computed an optimal combination of multiple ranking models based on LambdaRank and gradient boosted decision trees.
Although deep learning methods reduced the need for feature engineering, several works tried to leverage useful information about query-passage relevance while employing the powerful learning ability of the neural networks. DUET [5] was a hybrid neural ranking model including layers for learning query-passage exact matching and other layers for semantic matching. The term interaction matrix between a query and a passage was entered into exact matching layers and term embeddings of input sequence into semantic matching layers. KNRM [6] also used query-passage matching features. In this model, a matrix of cosine similarity scores between query embedding and passage embedding was computed. The similarity matrix was converted to soft match features and it was entered into a linear layer for calculating ranking scores.

B. PRE-TRAINED LANGUAGE MODELS
In natural language processing, the goal of pre-trained language models [10,11,17] is learning distributed representations of words. For this purpose, two stages of the training process are executed: pre-training and fine-tuning. In the pretraining stage, the models learn general language representation from the massive corpus using self-supervised methods like masked language model (MLM) and next sentence prediction (NSP). In the fine-tuning stage, parameters of the pretrained models are updated again for a specific target task. It is a transfer learning-based approach to enhance the ability for a downstream task while leveraging language informa-   [9]. This work was different from our proposed model in that additional pretraining steps with the MLM and NSP on the training corpus were needed. Boualili et al. added new marking tokens to the input sequences in the BERT model to check for exactly matched words between a query and a passage [18]. This method was similar to the proposed one because it enabled the model to learn about the input tokens that are more important for ranking. However, our model estimates the term importance in a passage using the probability model instead of the exact matching features. We utilize BM25, the probability model widely used in information retrieval. The probability that a specific term appear at the passage is normalized and used for the importance score. More detailed explanation about BM25 including a formula is available later in this chapter.
A cross-encoder method and a dual-encoder method are pre-trained language model architectures for ranking. In the VOLUME 4, 2016 cross-encoder method, a query and a passage are concatenated as an input sequence. Full cross attention is computed between token representations for the query and the passage. This approach accomplishes powerful ranking performance in general even though it depends on costly full attention operation. In the dual-encoder method, the query and the passage are embedded separately, and the similarity between the two embedding vectors is computed. It is possible to process the query promptly using this method. However, the ranking performance can be undermined by the deficiency of learning interaction between the query and the passage. In the case of the re-ranking task, the cross-encoder model was mainly used for the performance [8,16,19]. A model in this paper also has an architecture based on the crossencoder method. The dual-encoder method is utilized for a full-ranking task processing a large number of passages in the corpus. Recently, there have been several works about efficient ranking using the dual-encoder method [20,21]. However, we do not cover that architecture in detail in this paper.

C. BM25 AND PSEUDO RELEVANCE FEEDBACK
This paper proposes a term importance weighting method to fine-tune the ranking model. We adopt statistical methods traditionally used in information retrieval systems for deciding whether each term is essential or not. BM25 is a probability model computing relevance between a given query and a passage. It is based on Poisson distribution that assumes two events cannot occur at the same time. The BM25 takes the appearance of a specific query term in a passage as an event.
In the method, query terms are evaluated by the TF-IDF method. The probability that a specific query term appears in a passage becomes high as the passage length becomes longer. Therefore, the TF-IDF scores for query terms are normalized by the passage length in the BM25 method. The Eq. 1. is a formula of BM25 score. In the equation, a query Q with terms {t 1 , t 2 , . . . , t i , . . . , t n } and a passage P are given. T F (t i , P ) is the term frequency score between a query term t i and the passage. IDF (t i ) is the inverse document frequency score of the term t i . L(P ) is a value dividing the length of passage P in terms by the average passage length. Lastly, k and b are hyperparameters of the BM25 score model. [22] is recommended for a more detailed explanation about the BM25 scheme.
Pseudo relevance feedback is a method for query expansion utilizing search results from an initial query. In this term distribution-based method, passage terms satisfying the below condition are evaluated high for query expansion: the terms have to appear frequently in passages relevant to the initial query while hardly in non-relevant passages. There can be diverse approaches to the criteria for selecting relevant passages in information retrieval. In the case of pseudo relevance feedback, passages ranked high in the initial search results are judged as relevant. The Eq. 2. is used for the estimation of the passage term weight [23]. w t is the relevance weight of a term t.
where p denotes the probability that the term t is assigned within the set of relevant sentences for the initial query, q the probability that the term is assigned withing the set of non-relevant sentences, r the number of relevant sentences with term t, R the number of entire relevant sentences, s the number of non-relevant sentences with term t, S the number of entire non-relevant sentences.

III. PROPOSED METHOD
In our work, a neural ranker based on the pre-trained language model learns the passage ranking and MLM tasks simultaneously as shown in Figure 1. In the MLM task, the proposed model is exposed to sequences with masked tokens and understands target corpus by predicting the original term tokens. The pre-trained language models tokenize each term in the input sequence to separated tokens, and we refer to the tokens as "term tokens" in this study. When the ranking model is trained with the MLM task during the fine-tuning stage, the model estimates the relevance to a query using masked passages, which are created by selecting different masked tokens in each epoch; this allows the model to utilize the advantage of learning with more diverse text sequences without additional data. Thus, such a multitask-learning method emulates the effects of data augmentation. Figure  3. is an example of data augmentation using the proposed masking method. In the original ranking task, the model has to predict the relevance between a query "school location?" and a passage "next to the police office." In our method, the model is trained to predict relevance even if some parts of the passage are masked. The positions of tokens to be masked are different at each epoch while the model is exposed to diverse sequences. In the MLM task, masking each term tokens with equal probability is a common approach for existing pre-trained language models. However, we suggest a new method masking the term tokens according to their importance weights. This method masks the less-important terms in the passage more frequently and avoids masking of each term token uniformly. It is assumed that masking contextually important terms too often could undermines training because these terms commonly play key roles in retrieving relevant passages for the given queries. This new weighted-masking method is devised to improve the model's ranking performance. Given the terms {t 1 , t 2 , . . . , t i , . . . , t n } in a passage P , we first calculate the BM25 score where score(t i ) for the i-th term t i , to compute the importance score.
where t ∈ P is one of every term in the given passage. About 15% of the term tokens in each input passage are sampled for masking. The i-th term token t i is sampled with the probability P mask (t i ) shown in Eq. 4. The probability each term token to be sampled increases as its importance score decreases, according to the following equation.
Besides, we utilize embedding vector representing whether each passage sentence takes a core role in the passage or not. To calculate the importance weight of a sentence, we use the BM25 scores that are already calculated for the terms in the sentence and the sequential order of the sentence in the passage. The approach utilizing location of sentences is inspired by existing related studies. Ko et al. proposed a method weighting sentences at the beginning of a passage more for snippet generation in information retrieval [23]. Jeong et al. scored the passages at the beginning higher for selecting salient sentences in summarization system [24]. For a passage P with n sentences {s 1 , s 2 , . . . , s n }, the importance score of the k-th sentence with m terms {t 1 , t 2 , . . . , t i , . . . , t n } is computed as shown in Eq. 5 and Eq. 6. The importance score is calculated by adding the sentence location weight to the sum of the normalized BM25 scores of the words in the sentence via min-max normalization. The sentence location weight is higher when the sentence is at the beginning of the passage. The sentences are then sorted by the scores, and about 10% of the sentences with the highest scores are selected as the core sentences. The ratio of core sentences is a hyperparameter, and we chose the ratio of 10% because of the best experiment result.
where t i denotes the ith term in a passage, s k the kth sentence in the passage, and s ∈ P one of all sentences in a passage. Our BM25-based method mainly depends on term matching information. We suggest an additional term weighting method leveraging results from the initial search based on the pseudo relevance feedback. In Figure 4, the terms "capital" and "Korea" have the same importance score when evaluated by inverse document frequency. However, the term "Korea" has a higher probability of appearing in two relevant passages than the term "capital" when the search results are considered. In the case of non-relevant passages, the probability for the term "Korea" is zero. Therefore, it is possible to conclude that "Korea" is more important for the ranking task than "capital." In the passage re-ranking task, candidate passages are provided for each query. Our method considers the candidate passages as results from the initial search in the pseudo relevance feedback. We utilize high-ranked passages in the candidates as relevant cases and low-ranked passages as nonrelevant cases. Besides, we refer to [23] to calculate term importance weights based on the pseudo relevance feedback.
Given a query Q and a passage P with terms {t 1 , t 2 , . . . , t i , . . . , t n }, the importance weight score P RF (Q, t i ) for the ith term t i is as following equation. Among the candidate passages for Q, we consider passages ranked from 1st to kth as relevant and those under kth as non-relevant. For the input sequence, 15% of term tokens are sampled for masking. The probability that each token is sampled is proportional to the importance score score P RF (Q, t i ).
where r denotes the number of relevant passages with term t i , R the number of entire relevant passages, s the number of non-relevant passages with term t i , S the number of entire non-relevant passages. VOLUME 4, 2016 score P RF (Q, t i ) = sof tmax(BM 25 (t i )) + sof tmax(P RF (Q, t i )) 2 (8)

C. SELF-SUPERVISED FINE-TUNING WITH MLM
Our architecture includes a pre-trained language model as a shared layer, a regression layer W RP ∈ R 1×H for the ranking task, and another layer W M LM ∈ R V ×H for the MLM task, as described in Figure 1. We especially utilized RoBERTa [11] as the pre-trained language model in the experiment. However, other models are also acceptable in our architecture. Pre-trained language models represent input sequence as hidden vector. Besides, the number of vocabularies in the language models needs to be defined before training. In the case of our notation, V is the number of vocabularies and H indicates the dimension of the hidden vector. For the input sequence to the model, a query and a passage are concatenated following the input format of RoBERTa with a [CLS] token at the head and [SEP] token at the ends. The hidden vector from the shared layer is entered into layers for ranking and MLM task separately. Each layer calculates training loss value and the two task losses are combined to total loss L total as in Eq. 9. In the equation, a hyperparameter λ is used to manipulate the impact of the MLM task on training the ranking model.
In the case of information retrieval systems, loss functions can be categorized into three types: pointwise, pairwise, and listwise approach. The proposed model is trained by the listwise approach that optimizes a ranking for several input passages to one given query. In this method, a query and several candidate passages are given and each query-passage pair is entered into the model as a concatenated sequence. When H CLS indicates the hidden vector of [CLS] token from the sequence, the cross-entropy loss 1 is minimized according to Eq. 10. for each query-passage pair x. In this notation, X is the set of pairs with the relevance label y ∈ 0, 1. The value of cross-entropy loss becomes higher as a difference between the value of actual label and the expected one increases. Therefore, the model has to be back-propagated in a way to minimize the loss value.

A. DATASETS
MS MARCO [15] is an information retrieval dataset sampled from large-scale query logs of the search engine Microsoft Bing. It includes about 8.8 million text passages extracted from web documents. The relevance labels between queries and passages were annotated manually by human editors. The results from the dataset can be evaluated through online leaderboard submissions. The passage ranking leaderboard includes full ranking task and re-ranking task. The leaderboard organizers officially provide 1,000 candidate passages for each query for the re-ranking task. For more detailed information about the dataset, please refer to [15].

B. EXPERIMENTAL SETTINGS
In the case of pre-trained language models for the proposed architecture, both base and large versions of RoBERTa were used in the experiments. The learning rate of the Adam optimizer was 5e-5 for the base model and 1e-5 for the large model. One TITAN X GPU was used for the base model, and one Tesla V100 GPU was for the large model. The training batch size was 16, and the maximum input length was 512.
One positive passage and seven negative passages to the given query were used for calculating the listwise loss. Finetuning was performed without additional pre-training steps, and the development set performance was evaluated at the end of each epoch. The final value of the hyperparameter λ, which is the proportion of MLM loss in the total loss, was 1. In the case of the term weighing method using pseudo relevance feedback, hyperparameter k divides relevant and non-relevant passages among the candidates for each query. 100 is selected for the value of hyperparameter k. The BM25 scores for the experiments were calculated by Pyserini, a Python interface of the Lucene-based retriever Anserini [26]. The evaluation metrics were MRR@10 and Mean Rank (MR). MR was used only for the development set because the test results are only provided as the total MRR@10 scores. In the case of methods with pseudo relevance feedback, Hits@k was additionally used for evaluation. It indicates whether relevant passages ranked among the top-k retrieved results. We adopted a listwise ranking model without MLM finetuning as the baseline. The MS MARCO is a massive dataset that requires about a week for training and test with our experimental settings. Therefore, we sampled 10% of the training set and 2,000 queries in the development set (about 30%) for baseline comparison with the proposed model. It was enough amount for proving our method achieved significant improvement over the baseline. However, the final model for leaderboard submission was trained and evaluated with entire dataset for the performance at maximum. Table 1 was comparison results between the baseline and proposed models using BM25 for the MLM task. LR Baseline was the listwise ranking baseline trained by fine-tuning without MLM. LR + M LM was the model fine-tuned with both ranking and random MLM tasks. The term tokens in the input passage were sampled for masking with a probability of     Table 1 showed that the MRR@10 score was improved by 3.5% points from the baseline when the weighted MLM was adopted for finetuning. Additionally, the approach with sentence embedding was helpful for improving MR. We used the paired t-test with a p-value threshold of 0.01, where * indicates a significant improvement over the baseline model.

B. THE PROPORTION OF THE MLM TASK
The hyperparameter λ in Eq. 9 was used to determine the proportion of the MLM task for the entire training loss. When the hyperparameter value was high, it was available to increase the impact of data augmentation. On the other side, decreasing the value of the hyperparameter enlarged the proportion of the ranking task and let the model focus on the original task. Table 2 demonstrated how the performance for the development set changed according to the value of the hyperparameter. Our model achieved at the best performance of both MRR@10 and MR when the hyperparameter value was 1. The decrease of the hyperparameter value also made the decrease of performance. It presented that the impact of data augmentation was significant even though the original task was less trained on to the model during the multitask learning.

C. COMPARISON WITH POST-TRAINING
Post-training is one of the self-supervised learning methods using the corpus about a specific domain or task [9]. It mainly uses the MLM task and the next sentence prediction (NSP) task just like the pre-training method. However, the posttraining method is for learning frequent language patterns  Comparisons between the proposed model and the post-trained models using 10% of the training set for finetuning in a certain domain or task, while the pre-training method enhances the broad range of language understanding. Our work is similar to the previous works with the post-training method in that self-supervised learning such as MLM is used to train a certain task. However, our proposed model does not need any additional training steps before the fine-tuning stage, and it is different from the post-training models.
To compare the post-training method and the proposed fine-tuning method in this paper, we post-trained RoBERTa base model using MS MARCO passages and fine-tuned the listwise ranking task on the model. Table 3 demonstrated the comparison of the performances between the post-trained model (LR + P T ) and our model without post-training (LR + W M LM + SE). We used only the random MLM task for the post-training according to the training method of RoBERTa. In Table 3, executing post-training for one epoch meant exposing 8.8 million MS MARCO passages to the model once. The experiment results showed that the ranking model fine-tuned after post-training achieved better performance than the baseline model. However, this improvement was insignificant when compared with the result from our proposed model. The proposed model obtained MRR@10 performance 3.5% points higher than the baseline, while the performance from the post-trained model was 0.8% point higher than that of the baseline at the maximum.

D. COMPARISON OF BM25 AND PSEUDO RELEVANCE FEEDBACK FOR MASKING
We proposed an additional term weighting method using pseudo relevance feedback. The method utilized information from both relevant and non-relevant passages to a given query. We calculated the importance score according to Eq. 8 and more frequently masked terms with high score values. In Table 4, LR + W M LM P RF meant a model with MLM strategy using pseudo relevance feedback. LR + W M LM P RF + SE denoted a model utilizing sentence embedding and the MLM method with pseudo relevance feedback simultaneously. Those models obtained significant performances at Mean Rank and Hits compared to models using only BM25. It meant that information from pseudo relevance feedback made relevant passages ranked higher among many candidates. The key difference between the two metrics and MRR@10 was that they evaluate the quality of the entire ranking while passages under the 10th rank were not considered with MRR@10. Even though the VOLUME

E. COMPARISON WITH MASKING PRIORITIES
The comparison between three weighted MLM models with different masking strategies demonstrated that masking less significant terms was helpful for training. In Table 5, LR + W M LM DOW N was a model with the important terms being masked less frequently, while LR + W M LM U P had the important terms masked more frequently. LR + W M LM M EAN computed the average value of the token weights in each passage and masked terms near the average frequently. In all three models, the proportion of the total masked tokens were 15%. The MRR@10 score from LR + W M LM DOW N was the highest while one from LR + W M LM U P was the lowest.

F. EFFICIENCY OF TEST TIME
The proposed model did not require the MLM task during evaluation as same as in the baseline.

G. COMPARISON WITH THE LEADERBOARD MODELS
For the MS MARCO leaderboard submission, we selected a model with BM25-based MLM and sentence embedding. Models using pseudo relevance feedback demonstrated better performance at other metrics. However, we submitted a model with the best performance at MRR@10, which was the official metric of the leaderboard. In the MS MARCO passage re-ranking leaderboard, our model outperformed the state-of-the-art, with the exception for an ensemblebased model of BERT, ELECTRA, and RoBERTa [19] from Google Research. Even though this model used the combination of three large-scale neural language models, difference between the MRR@10 scores of the ensemble model and the proposed model as a single model was only 1.3% point. The performance records from the leaderboard are summarized in Table 7 2 . DUET V2 [27] was the official baseline of the leaderboard. OpenMatch [28] was a library for several tasks in IR, whose implementation of ELECTRA Large was the best re-ranking model except for the ensemble-based model [19] before the proposed model was submitted. It was interesting that our model with the base version of RoBERTa also achieved significant performance as described in Table  7. High performance and fast query latency were available simultaneously given that the base version of RoBERTa required lower computational cost than the models of larger version. 2 The leaderboard records in this paper are based on the ranking of June 8, 2021. Those results are posted on the website https://microsoft.github.io/MSMARCO-Passage-Ranking-Submissions/leaderboard/. Our model's name in the leaderboard is "SSFT", an abbreviation of self-supervised fine-tuning. The team name of the model is anonymous.

VI. CONCLUSION AND FUTURE WORK
This study proposes a fine-tuning method using the MLM task for neural re-ranking models. The importance scores for the term tokens and sentences are estimated for selecting the tokens to be masked and core sentences in the model. The goal of the proposed model is to sustain balance between the ranking performance and efficiency. The ranking performance of the proposed model is the state-of-the-art in the MS MARCO passage re-ranking leaderboard among single models. We analyzed the data augmentation effect and time efficiency of the proposed model by various experiments. This study expands our previous work by utilizing pseudo relevance feedback method for calculating term importance, which is more adaptive approach to the ranking system.
Self-supervised learning in the fine-tuning phase can be applied for other tasks in natural language processing and IR using pre-trained language models. More discussions to develop efficient neural models using self-supervised learning would be available in the future.