ML-Augmented Automation for Recovering Links Between Pull-Requests and Issues on GitHub

GitHub provides a distributed and collaborative platform to develop and maintain open-source projects. This social coding platform achieves this collaborative development, with or without coordination, using pull requests and issues artefacts. When the number of daily submitted issues rapidly grows up, especially in popular repositories, managing issues becomes more complicated. To help the repository’s developers in issues processing, there are external contributors who fix issues by submitting pull-requests. On GitHub, a pull-request is frequently linked with a submitted issue to show that a solution is in progress. Unfortunately, contributors might be lazy or forget to link the Pull-Requests with their corresponding Issues. Only a very small share of these links are established, whereas a large portion of links is missed in the development history. In spite of that, even for senior developers, manually recovering the links between Pull-Request and Issues from evolutionary development history is a time-consuming, challenging, and error-prone task. In this article, we propose to build ML models to recover links between pull-requests and their issues using two Machine Learning algorithms (KMeans and BIRCH) based on lexical and semantic weighting measurements. These models are evaluated using PI-Link ground-truth dataset. The obtained results show that pull-request and issue links can be recovered with an accuracy of 91.5% using BIRCH clustering algorithm.


I. INTRODUCTION
In social coding platforms, such as GitHub, when a contributor finds an Issue, he informs other contributors of the hosted repository on GitHub via the issue tracking system [1]. An Issue is either a bug, feature enhancement, new feature, or any maintenance task. On GitHub, the integrated Issues tracking system allows the Issue submitter to provide textual information about the detected Issue containing a title and description. After reviewing and assigning Issues to internal developers (responsible developers), they are required to promptly respond to Issues identified in Issue trackers and The associate editor coordinating the review of this manuscript and approving it for publication was Ines Domingues . resolve such Issues with the least amount of work. When the number of daily submitted Issues rapidly grows up, especially in popular repositories, managing Issues becomes more complicated, adding to developers heavier workloads [2].
To help the internal developers in Issues processing, there are external contributors or volunteers who contribute to fix Issues by submitting pull-requests (PRs). Contributors clone (fork) a hosted repository on GitHub and make their changes locally on the cloned repository [3]. The changes include development tasks (such as adding new functional features), resolving Issues, or making enhancements (in terms of usability, reliability, performance, and so on). Then, they create PRs to notify the repository owners about these changes and merge them into the original repository after reviewing the submitted PRs [4]. Each PR includes a title and body (description) to explicitly clarify the goal of the PR submitter.
On GitHub, a PR should be linked with a submitted open Issue to show that a solution is in progress [5]. When the PR is ready to merge, the open Issue is closed automatically. The nature of the distributed and parallel process of the PR mechanism lets many contributors submit separate PRs to tackle the same Issue. As a result, the same Issue may be linked to one or more PRs where each submitted PR represents a proposed solution. Therefore, many PRs include in their titles and descriptions the linked Issue identifier (Issue number).
The established links between Issues and their corresponding PRs are valuable, as such links can be exploited to learn from previous Issues/PRs to predict potential future solutions. Also, such links are valuable assets from the development and maintenance point of view where they have different semantics. For example, these are used to link features with their corresponding implementations at different levels of abstraction [6], [7], [8]. In bug localization, they are used for training to build prediction models [9], [10], [11]. In Issues assignments, they are used to build a recommender system to propose relevant reviewers for upcoming Issues [2], [12], [13].
In GitHub, developers (or contributors) typically create PR-Issue links in a manual way. They should specify in PR title or description the Issue identifier, especially in large projects. Unfortunately, in such a way, contributors may forget or sometimes are busy to link PR and Issue as mentioned above. As a result, a significant number of links are missed while only a small fraction of PR-Issue links are defined. At the same time, manually recovering such links between PR-Issues is an error-prone, time-consuming, and challenging task, even for senior developers.
Nowadays, machine learning (ML) and deep learning (DL) achieve promising results in different tasks of natural language processing (NLP), such as clustering, prediction, sentiment analysis, etc. [15]. Many ML and DL algorithms can build models fed with datasets to discover the representations needed for detection or prediction automatically. To explore the possibilities of automatically recovering the missing PR-Issue links using ML, we conducted an empirical study to identify the characteristics of explicit PR-Issue links based on their textual representations. We found that the links between PR and Issue exhibit certain features. For example, the titles, body descriptions, and comments for Issues and their corresponding PRs share varieties textual similarities.
In this article, we propose to build ML models to recover PR-Issue links using two ML algorithms (KMeans and BIRCH) based on lexical and semantic weighting measures. We selected these algorithms rather than other ones for two reasons: The first reason is their capability to predict the clusters for new datasets. The second reason is that the number of clusters is known, which is the number of PR-Issue links in our case. Therefore, overcome the problem to determine the best number of clusters faced by the other clustering algorithms (the local optima problem [14]). The effectiveness of these algorithms is evaluated using a ground-truth dataset on the subject called PI-Link [15]. The experimental results show that the BIRCH algorithm outperforms the KMeans algorithm in terms of the most widely used metrics in this domain: homogeneity, completeness, and V-measure. Moreover, three features of PRs and Issues are used to build clustering models: Title, Body, and Comment. According to the experimental results, the Title and Body features are more important in this recovering task than the comment feature.
The rest of this paper includes the following sections. In section II, we provide the necessary background to easily understand the proposed approach. Section III provides a detailed explanation of how the clustering models are built. In section IV, the clustering models are experimentally evaluated with a ground-truth dataset. Related works are discussed in section V. Finally, the article's conclusions and future work are drawn in section VI.

II. BACKGROUND
In this section, we put up the essential background to understand the contribution presented in this article. This background includes GitHub flow, PR-Issue links, and clustering algorithms.

A. GitHub FLOW
GitHub is the most widespread and popular social coding platform for hosting open-source projects. By November 2022, more than 100 million developers contributed to over 352 million repositories (more than 41 million public repositories [16]) [17], [18]. It is built on top of a version control system called Git [19]. Therefore, GitHub provides a distributed version control in addition to other services, such as Issue tracking, parallel development, incremental development, integration, etc.
GitHub is a branch-based development. This helps developers to collaborate on development activities independently without impacting other branches of other developers for the same repository. In fact, each repository has one default main branch, called Master, and other custom branches created by developers. Figure 1 shows the GitHub flow. The flow starts by creating a new branch for adding new features, fixing bugs, or doing some enhancements. Then, developers made their changes to the new branch, which included (adding, deleting and editing files) as commits. To help the owner or future developers easily understand the committed changes (e.g., performance enhancement, fixing a bug, etc.), each commit includes a message describing it. Next, the developers create PR when their changes (commits) on that branch are completed. Also, developers should provide a summary of each PR (including title and description body) to describe their accomplished task. Additionally, it maintains a record of code changes. If the submitted PR is a solution to an open Issue, it is linked to that Issues to show that a solution is in progress. After several rounds of reviewing and discussion among developers and reviewers (e.g., leave questions, comments, and suggestions on the PR timeline), a decision is made to (approve, enhance, or reject) the PR. Finally, when the PR get approved and well-tested, then it is merged back to the Master branch to save changes permanently. Also, the state of the merged PR is changed from open to closed PR.

B. PR-ISSUE LINKS IN GitHub
PRs and Issues in GitHub are the main important assets for collaborative distributed development and maintenance activities. On the one hand, Issues are used for different purposes, such as asking questions, signalling bugs, and other maintenance tasks. They are typically stored in the Issue tracking system. On the other hand, PRs are used as a way to enable other collaborators and reviewers to submit their code changes. They are managed by the version control system. Also, both Issues and PRs represent artefacts of different domain spaces, as Issues represent problem space (requirements, features, bugs, etc.) while PRs represent solution space (fixing bugs, implementing a new feature, etc.). In fact, linking PRs and Issues is also a link between solution and problem spaces. As a result, such links have different semantics. It can be a documentation link between features or requirements and their corresponding code [20]. Also, it can be a maintenance link between bugs and their solutions [21], [22]. The links between PRs and Issues may be established in the same repository or across different repositories. When PRs are approved and merged into the main branch, the state of their linked Issues is changed to closed (solved).
There are two ways to link PRs and Issues: implicitly and explicitly [23]. In the implicit way (as shown in Figure 2), GitHub provides keywords to link PRs and Issues (e.g., resolve(s,d), close(s,d), and fix(es,ed)). These keywords are placed in the PR text (either description or the commit message). This way has two main concerns. Firstly, removing or changing PRs-Issues links is a time-consuming and boring process as the owner requires manually modifying the PR's description to remove or modify the link's keywords.  Secondly, a PR may include many commits with link keywords only placed in the message of some of these commits. When a PR is merged to the branch, the linked Issue is closed despite there being some commits that are not linked to that Issue. This leads to missing the main link between solution space and problem space. In the explicit way, these limitations are eliminated where PRs and Issues are linked manually by PRs sidebar or Issue sidebar as shown in Figure 3. In this way, reviewers and contributors can monitor the progress of development and maintenance activities on the linked Issues and their corresponding PRs. Therefore, any authorized contributor can easily change or remove the PR-Issue links.

C. CLUSTERING ALGORITHM
Clustering is the process of grouping similar items (based on patterns or structural properties) as a cluster. Clustering algorithms are unsupervised learning algorithms where the datasets are unlabeled. In the literature, there are many clustering algorithms such as, KMeans, BIRCH, DBSCAN, Spectral clustering, and hierarchical clustering [24]. Each one has its own considerations. One of the main considerations depends on the number of clusters provided in advance. For example, the KMeans and BIRCH algorithms require a predefined number of clusters to provide, whereas the DBSCAN and hierarchical clustering algorithms do not need this predefined number of clusters. Another important consideration is whether the algorithm is able to predict clusters for an untrained (unseen) dataset. For example, the KMeans and BIRCH algorithms can be used to predict the clusters for new datasets, whereas the DBSCAN algorithm can not be used for this purpose. In anyway, the ground-truth dataset is important for unsupervised learning techniques to compare the prediction accuracy of different machine-learning (ML) models and specify the appropriate one. In this article, we rely on KMeans and BIRCH algorithm to recover (cluster) PR-Issue links that are best fit to the requirements of the selected ground-truth dataset.
The KMeans is an unsupervised ML algorithm. It is used to group data points into a predetermined number of clusters (K). It starts with K randomly chosen items acting as the clusters' centroids. After that, the algorithm iteratively calculates the total of the squared distance values between the data points and the centroids of the established clusters. A data point is assigned to the cluster whose centroid is the closest to that data point. After that, it computes a new centroid for each cluster until the best centroid is discovered [25]. Like KMeans, BIRCH is driven by a predefined number of clusters. BIRCH stands for Balanced Iterative Reducing and Clustering using Hierarchies [26]. It converts data points into a tree data representation with the centroids. These centroids can be the final cluster centroid or the input for other iterative clustering round. BIRCH starts by clustering a given dataset into small summaries in one-time scan. Then, the summary can also be clustered by other clustering algorithms, such as agglomerative hierarchical clustering.

III. CLUSTERING MODEL FOR PR-ISSUE LINKS
In this study, the recovering process of PRs-Issues links is driven by three features which are always available in PRs and Issues contents. These features are: Title, Body, and Comment. Each of these features works as a clue to recover the links between PRs and their corresponding Issues. Similar Titles give a strong indicator that these PR and Issue addressing the same problem. Similar Bodies and Comments give details about the type of Issue the current PR is addressing and how it is being addressed. However, other features like PR and Issue labels are overlooked where they are missed and repetitive in many PRs and Issues.
In this section, we build the clustering models to link PRs with their relevant Issues. The clustering models take the raw data of Issues and PRs without their links as inputs. Then, the textual content of PRs and Issues are parsed to extract only the features of interest. Next, PRs and Issues go through preprocessing steps for cleaning. Then, lexical and semantic similarities are computed for PRs and Issues vectors. Finally, these vectors are clustered using KMeans and BIRCH algorithms to recover PR-Issue links by building clustering models. The recovered links are evaluated with ground PR-Issue links that existed in the ground-truth dataset. Figure 4 gives an overview of this recovery process. Below, we detail the steps required to build clustering models using KMeans and BIRCH algorithms.

A. FEATURE EXTRACTION
Feature extraction is the process of eliminating irrelevant and noisy data from the textual content of input PRs and Issues. VOLUME 11, 2023 Therefore, only relevant features should remain after this process to be ready to build the clustering models. As mentioned above, we are interested in three features as clues to link PRs with their corresponding Issues: Title, Body, and Comment of both PRs and Issues.

B. DATA PRE-PROCESSING
The following common and standard pre-processing steps are applied to prepare and clean the dataset.

1) REMOVING PUNCTUATION MARKS AND SPECIAL CHARACTERS
In this task, we remove punctuation marks, numbers, parentheses, and any special characters. A suitable regular expression has been used to carry out this cleaning task.

2) TOKENIZATION
We tokenize (split) the text of input PRs and Issues into words. This text includes the Title, Body, and Comment of both PRs and Issues. The tokenization strategy splits the data text into tokens using the space delimiter. Moreover, unnecessary symbols and punctuation marks are eliminated. To achieve that, the ''spaCy'' library [27] is used to end up with valuable and useful words.

3) REMOVING STOP WORDS
For this step, we eliminate all stop-words that frequently appear in the tokens list (obtained from the previous step) but do not carry any meaning and can not be used to compute the similarity between two PR and Issue (e.g., the, on, between, etc.). We use ''NLTK'' (Natural Language Toolkit [28]) to remove all stop words.

4) STEMMING
Stemming is a process of returning back tokens to their base forms (for example, makes to make, raising to raise). We utilize one of the most famous stemming algorithms, ''Porter stemmer'', developed by the NLTK library. Also, we convert the entire tokens into lowercase characters.

C. COMPUTING SIMILARITY FOR EACH FEATURE
In this step, we compute the similarity between pairs of PR and Issue by using each feature, individually. This similarity is computed using lexical and semantic weighting measures.
For the lexical measure, we rely on a well-known weighting measure called TF-IDF (Term Frequency Inverse Document Frequency) [29]. In this measure, a weighting score (TF-IDF score) is given to each token in a feature's corpus. This score increases when the frequency of a token in a given PR/Issue vector increases but it is reduced by the frequency of the token in the other PR/Issue vector. Also, this score indicates the importance of a token in a given PR/Issue vector; the content of PR/Issue is better represented by tokens with higher TF-IDF scores. In this step, three TF-IDF matrices are created; a matrix for each feature (Title, Body, Comment of both PRs and Issues in each time). In each TF-IDF matrix, rows represent PRs and Issues vectors, columns represent all tokens in the feature's corpus, and the content of this matrix represents TF-IDF weights. This is performed using ''Tfid-fVectorizer'' method provided by ''scikit-learn'' library. The similarity between PRs and Issues vectors is computed using the cosine similarity according to the equation 1. In this equation, V1 and V2 refer to vectors of PRs and Issues.
For the semantic weighting measure, we use the most recent semantic measure in natural language processing (NLP) named BERT (Bidirectional Encoder Representations from Transformers). It is a pre-trained language-based transformer model that is developed by Jacob et al. in Google [30]. BERT is a semantic model representation based on deep learning that learns contextual associations between words placed in the same text. Therefore, it can be used to embed the semantics of texts via vector representations. In this study, we exploit the BERT transformer that operates based on a pre-trained model called ''bert-basenli-mean-tokens'' in combination with a similarity measure to compute the semantic similarity score between PRs and issues. In our case, BERT is applied three times; one time for each feature. The BERT transformer converts textual contents of PR and Issue into vectors − → V 1 and − → V 2 , respectively. After that, the similarity score is calculated using the cosine similarity in Equation 1.

D. CLUSTERING ALGORITHMS
In this step, we apply clustering algorithms which are fit to our case. We find KMeans and BIRCH algorithms the best fit for our case (refer to background section II for further details about these algorithms). We adapted these clustering algorithms as follow. In both algorithms, the number of clusters is predefined, which is the number of linked issues. Also, the data points in both algorithms are issues and PRs vectors which are build in the previous step. For each issue vector, the clustering algorithms find PRs' vectors that are the closest to that issue according to the considered features (i.e. Title, Body, and Comment). The quality of predicted clusters can be compared with the ground-truth ones. The obtained Issue-PR clusters are compared with their corresponding ground-truth Issue-PR clusters (I.e. Issue-PR links). We relied on sklearn.cluster to implement KMeans [31] and BIRCH [32] algorithms.

IV. EVALUATION OF CLUSTERING MODEL
In this section, we aim to evaluate the effectiveness of the KMeans and BIRCH algorithms for recovering PR-Issue links.

A. GROUND-TRUTH DATASET
The considered clustering algorithms are applied to a ground-truth dataset in the subject (available on: Kaggle 3 ). This dataset includes 5742 Android projects which have PR-Issue explicit links. Table 1 illustrates some of the statistical information about the ground-truth dataset in the subject of the total number of Issues and PRs across all projects in the dataset. The rest statistical information is min, max, mean, and standard deviation (Std) but at the level of projects. For example, the dataset has 34732 Issue titles across all projects. Some of these projects have just one title (i.e. contains just one Issue), whereas a project has 974 titles (i.e. contains 974 Issues). The average and Std for the number of Issues overall projects are 6.04 and 30.13, respectively. The table also shows that the dataset has 34732 Issues linked to 50369 PRs. The difference in the counts of PRs and Issues is due to the cardinality ratio relationships between linked PRs and Issues, where a PR may be linked to many Issues and an Issue may be linked to many PRs.  Table 2 represents the number of empty data found in the ground-truth dataset. We observe that each PR and Issue has a title, but a few do not have a body (141 out of 34732 in Issues, and 129 out of 50369 in PRs). Besides, 1296 Issue and 8 PR do not have a comment. The reason behind this is that the PRs usually go via a review and discussion process utilising PR comments, whereas this process is not mandatory in case of Issues. Finally, we can notice that around 8% and 9% of Issues and PRs without labels, respectively.

B. RESEARCH QUESTIONS AND EVALUATION METRICS
To quantify the accuracy of the recovered links, we designed experiments for answering the following research questions: The goal of this research question is to measure the correctness of recovered links (clusters) between PRs and their associated Issues. To do so, we apply these clustering algorithms on our dataset after removing all links between PRs and Issues. This means that the aim of the clustering algorithms is to recover these removed links. We measure the correctness using the Homogeneity metric. The Homogeneity measures the percentage of how far the resulting clusters include similar members. The homogeneity score is calculated based on Equation 2 [33].
where H (C|K ) is the conditional entropy of the classes (i.e., PR-Issue links from the ground truth), and it is obtained by Equation 3.
and H (C) is the entropy of the classes, and it is obtained based on Equation 4.
where N is the total number of items, n c is the number of items belonging to class c, and n c,k refers to the number of items from class c that belong to cluster k.
The homogeneity score will be 100% when all items grouped in cluster k have the same class c (see Figure 5). In our case, it measures if all PRs and Issues belonging to a recovered cluster are associated to the same link or not. For example, the homogeneity will be 100% if all PRs and Issues in a cluster are members of the same PR-Issue link.

2) RQ2: HOW FAR ARE THE RECOVERED PR-ISSUE LINKS COMPLETE?
This RQ aims to measure the completeness of recovered link members (PRs and Issues). We measure the completeness using the completeness metric. The completeness metric is a VOLUME 11, 2023 complementary measure to the homogeneity metric. It measures the percentage of the true positive members in the recovered clusters to all members in ground-truth clusters. The completeness scores are formally obtained based on Equation 5 [33]: where H (K |C) is the conditional entropy of the clusters, and it is obtained by Equation 6.
and H (K ) is the entropy of the cluster, and it is obtained based on Equation 7.
where N is the total number of items, n k is the number of items belonging to cluster k, and n c,k refers to the number of items from class c they belong to cluster k. The completeness score equals 100% when all items of class c have been assigned to the same cluster k (see Figure 6). In our case, the completeness will be 100% if all PRs and Issues that are members of the same PR-Issue link are recovered in the same cluster.

3) RQ3: HOW FAR ARE THE RECOVERED PR-ISSUE LINKS ACCURATE?
We measure the accuracy based on the homogeneity and completeness of recovered links. For this aim, we use v-measure metric. The V-measure metric is the harmonic mean of homogeneity and completeness scores [33] and it is formally given by Equation 8.
The V-measure equals 100% when both homogeneity and completeness are equal 100% (see Figure 7). On the other hand, if a cluster does not satisfy the homogeneity or the completeness (i.e., the homogeneity or the completeness equals zero), then its V-measure will be zero. For example, the V-measure of a recovered link will be 100% if the whole and only its PRs and Issues members are from the same PR-Issue link.

C. CLUSTERING CONFIGURATION
To answer the RQs, we applied KMeans and BIRCH algorithms to the ground-truth dataset seven times. The first time, we trained one-sevenths of the dataset, and each time the dataset is doubled (i.e., the second time is 2/7, the third time is 3/7, etc.). This method aims to monitor the clustering models' performances over diverse dataset sizes. Moreover, it helps to overcome biased datasets and overfitting. For each experiment, we compute homogeneity, completeness, and v-measure scores for the selected features (the text of PRs and Issues titles, the text of PRs and Issues bodies, and the text of PRs and Issues comments). These scores are measured by using the sklearn.metrics module [34], [35], [36]. Table 3 demonstrates the parameters that we used for KMeans and BIRCH configurations. We used the English dictionary provided by sklearn library [37] to remove the stop-words. Furthermore, we build a vocabulary that merely considers the highest max-features (determined by TfidfVectorizer) ordered by term frequency across the dataset. As we know the number of PR-Issue links regardless of the cardinality among them, then the number of clusters will be the number of PR-Issue links. For example, if an Issue is linked to many PRs, these links are considered a cluster consisting of that issue and their corresponding PRs (and vice versa). Moreover, we fixed the randomness of the initial cluster centers to obtain consistent results for comparisons. Finally, the threshold used by the BIRCH algorithm denotes the maximum distance between the cluster's members from the cluster's center. After evaluating many thresholds on BIRCH, we found that the 0.03 threshold value satisfied our goal. Figure 8 illustrates the homogeneity scores for KMeans and BIRCH using the lexical weighting (TF-IDF) and based on the selected features on different sizes of the dataset. We can observe that the homogeneity scores of KMeans are strongly incremental with the size of the dataset, while BIRCH is almost consistent (roughly flat). Moreover, the homogeneity scores of trained titles and bodies are good by the two algorithms (between 73% and 90%) rather than the homogeneity scores of trained comments (31% -50%). Nevertheless, the highest homogeneity score is 90.4%, obtained by the BIRCH algorithm.   Figure 9 shows the homogeneity scores when KMeans and BIRCH are applied to the dataset using the semantic weighting (BERT) based on the selected features. Obviously, KMeans and BIRCH have roughly flat curves regardless of the data size and considered features. Also, Title and Body features in both algorithms have higher homogeneity scores than Comment score. The homogeneity scores of Title and Body take a range of (76% -89%) while Comment features scores take a low range of (30% -32%). Moreover, when we compare the homogeneity results of KMeans and BIRCH across all features, we deduce that BIRCH is also the best with maximum homogeneity score of 89% in the case of using semantic weighting.    Figure 10 illustrates the completeness scores for KMeans and BIRCH for all selected features on different sizes of the dataset using the semantic weighting. We can see that all selected features have high completeness scores (83.3% -92.7%). The highest completeness score is 92.7% obtained by the BIRCH algorithm. VOLUME 11, 2023 Figure 11 displays the completeness results in case of using the semantic weighting in both KMeans and BIRCH based on the selected features across all dataset sizes. In this figure, there is a minor incremental growth of KMeans and BIRCH curves against the dataset size. Also, KMeans and BIRCH have high completeness scores in all features as these scores take a high range in (84% -92%). Moreover, BIRCH outperforms KMeans with a maximum completeness score 92% while the maximum completeness score in KMeans equals to 91%.

1) RQ1: HOW FAR ARE THE RECOVERED PR-ISSUE LINKS CORRECT?
3) RQ3: HOW FAR ARE THE RECOVERED PR-ISSUE LINKS ACCURATE? Figure 12 illustrates the V-measure scores for KMeans and BIRCH for all selected features on different sizes of the dataset. We can observe that both Title and Body features give high scores (76.8% -91.5%), but the Comment feature gives a low one. The highest v-score obtained by the BIRCH algorithm is 91.5%.
In Figure 13, V-measure scores are presented for both KMeans and BIRCH using semantic weighting across all selected features and dataset sizes. As shown in this figure, BIRCH curves across all features and dataset sizes are flatter than KMeans curves. For selected features, we note that Title and Body features give high V-measure results, (81% -90%) respectively. Also, this figure confirms that BIRCH, as usual, outperforms KMeans in terms of V-measure with a maximum value of 90% while KMeans has a V-measure maximum value of 88%. Regardless of the weighting measure used (semantic or lexical), the Title and the Body features achieve promising and good homogeneity, completeness, and V-measure scores over different sizes of datasets in KMeans and BIRCH. Firstly, the homogeneity scores, respectively obtained by KMeans and BIRCH, are (72.9%-89.5%) and (88%-90.4%) for the Title, and 69%-85% and (81.1%-82.6%) for the Body. Secondly, the completeness scores, respectively obtained by KMeans and BIRCH, are (88.3%-91.3%) and (90%-92.3%) for the Title, and (86.4%-89.6%) and (89%-87.8%) for the Body. Finally, The V-measure scores, respectively obtained by KMeans and BIRCH, are in a range of (79.9%-90.4%) and (89%-91.5%) for the Title, and (76.8%-85%) and (84.9%-85.5%) for the Body.
The Comment feature performs poorly in homogeneity either using the semantic weighting or the lexical weighting over different sizes of datasets in KMeans and BIRCH as shown in Figure 8. The homogeneity scores take a range of (30%-50%) and (30%-32.7%), obtained by KMeans and BIRCH, respectively. This means that the PRs and Issues of the recovered links using their comments have come from different PR-Issue links. On the other hand, the completeness scores of the Comment feature are not bad. In Figure 10, the completeness scores obtained by KMeans and BIRCH are (83.3%-86.9%) and (83.3%-86%), respectively. This indicates that most of the PRs and their associated Issues are placed together in the same cluster. The low homogeneity and high completeness scores indicate that a recovered link has many PR-Issue links. However, the V-measure scores reveal a bad performance of the Comment feature, as shown in Figure 12. They are (44.9%-63.5%) and (44.9%-45.2%), obtained by KMeans and BIRCH, respectively. As a summary, we can observe that BIRCH outperforms KMeans in terms of homogeneity, completeness, and V-measures in both the semantic and lexical weighting. Also, BIRCH outperforms KMeans by using Title and Body features but Comment feature does not. However, these scores of both algorithms using the Comment features are not promising. Moreover, BIRCH has demonstrated its effectiveness on different sizes of the dataset using Title feature where maximum homogeneity, completeness, and V-measure scores are (90.4%, 92.3%, and 91.5%), whereas KMeans has varied scores based on the size of the dataset for using the same feature (89.5%, 91.3%, and 90.4%). Consequently, the BIRCH algorithm is a better fit in our case.

E. DISCUSSION
The evaluation revealed some PR-Issue links that the clustering models are not able to recover (8%) based on the Title feature. During the evaluation, we inspected the links that had been mispredicted by the model. As a result, we found three categories of mispredicted PR-Issue links; common topics, different levels of abstraction, and short titles.

1) COMMON TOPICS
We found some instances of Issues that have common descriptions (similar text). Where the dataset is extracted from Android projects, most of these instances are related to the Android platform. For example, some of these Issues arise when a new level of Android API is released. The new API might provide new features that should be implemented by Android developers. Accordingly, these Issues are raised due to the implementation of new features by Android developers in their projects. While these Issues represent the same topic, then the probability of having similar titles will increase. Consequently, this will group these similar Issues in the same cluster and leads the model to misprediction.

2) DIFFERENT LEVELS OF ABSTRACTION
We found some instances where the Title of an Issue has a different level of abstraction than its associated PRs. In most of these instances, when the Issue of a new feature or a new bug is described by using texts close to the feature requirement or bug use cases (conceptual level), whereas the PR associated with that Issue is described by using texts close to technical implementation (implementation level). Consequently, this will place this kind of Issue and its linked PRs in different clusters due to its textual dissimilarity, which leads the model to make a misprediction.

3) SHORT TITLES
We found some instances of PRs and Issues with too short titles (i.e., only one or two words). Whenever the inputs are sufficient for comparisons, the prediction of the clustering model will be more accurate. The short text makes insufficient inputs to the clustering model, which leads to misprediction.
As shown in Figures 8, 10, and 12, the Comment feature shows poor performance in recovering PR-Issue links. This is due to two reasons. Firstly, many Issues do not have any Comment (1296 Issues). Another reason is that comments are usually inserted by different types of contributors with different points of view, including reviewers, developers, and other involved people. Therefore, each Comment represents a different level of matter and abstraction. This leads to have comments with a variant textual similarity which provides misleading information of ML prediction model.

F. THREATS TO VALIDITY
We determinated a set of issues as potential threats to the validity of this research work. These threats are as follows.
As an internal threat, the used clustering algorithms are sensitive to the vocabulary used in Titles, Bodies, and Comments of linked PRs and Issues. In the case of using different vocabularies, this may degrade the prediction model's accuracy. However, this threat is shared among all research works employing textual matching to find the similarity.
As an external threat, the effectiveness of KMeans and BIRCH algorithms is evaluated using Android-based groundtruth dataset. This gives a first-glance impression that these clustering algorithms can not be applied to other types of applications. In fact, these algorithms work with any type of application hosted on GitHub.
Another external activity regarding the extensibility of our approach is to recover PR-Issue links from projects hosted on other social coding platforms. We can apply our approach to other social coding platforms that provide PR-Issue linking features like GitLab and integrated Bitbucket/JIRA. However, there are some slightly different changes between GitHub and the aforementioned platforms. These changes come from different nomenclature features and metadata supported by social coding platforms. For example, we can assume that GitLab's ''mergerequest'' feature (Git merge) is equivalent to GitHub's ''pull-request'' feature (Git pull). As same GitHub, Git-Lab and Bitbucket/JIRA support the linking feature between Issues and pull-request/merge-request. However, GitLab supports just implicit links by mentioning issues in merge request comments, whereas Bitbucket/JIRA supports only explicit links by using integrated features provided by JIRA issue tracking system. Moreover, the metadata names are different between these social coding platforms. For example, we can assume that Issue ''Body'' in Github is equivalent to GitLab and JIRA ''Description''. In this paper, we focused on GitHub because we have a ground-truth dataset for explicit pullrequest/issue links extracted from GitHub projects.

V. RELATED WORK
To the best of our knowledge, there is no research work in the literature on this subject for recovering the links between PRs and Issues in GitHub and other social coding platforms. The closest works in the literature aim to recover Issue-Commit links but not PR-Issue links. These works are evaluated using different and non-benchmark datasets. In this section, we present the most relevant and recent works that are related to our work. These research works are classified into two categories depending on the type of linked software artefacts: recovering links between Issue and Commit, and recovering links between other software artefacts.

A. RECOVERING LINKS BETWEEN ISSUE-COMMIT
In the literature, there are many proposed approaches for identifying and recovering Issue-Commit links that are based on Issue tracking (such as Bugzilla) and version control (such as Git) systems. The traditional approach for recovering such links involves searching for Issues identifiers in version control systems logs [38], [39] [40], [41]. Many researchers, VOLUME 11, 2023 however, discovered that these traditional approaches are ineffective because they can only retrieve a limited number of Issue-Commit relationships while missing a significant number. This is because developers may not always provide Issue IDs in version control system logs.
Heuristic-based approaches are proposed to identify or recover the Issue-Commit links. Bachmann et al. [42] and Bird et al. [43] introduced a simi-automation tool called LINKSTER to facilitate recovering Issue-Commit links. Also, Wu et al. [44] proposed a tool called ReLink to recover such links taking into account three heuristics: Issue and commit reports should be similar, their resolving times should be the same or close to each other, and their contributors should be the same. ReLink recovered Issue-Comit links with better accuracy than traditional heuristics approaches, with 89% precision and 78% recall compared with 91% precision and 64% recall for the traditional heuristics ones. In [45], Bissyande et al. proposed an improvement for ReLink. In [46], Nguyen et al. proposed another heuristic-based approach called MLink. This approach is based on code heuristics to recover Issue-Commit links based on the code snippets' similarity in Issue report and Commits. Their empirical evaluation with related approaches showed that MLink improved the Issue-Commit recovery links by 17%, 13-17%, and 11-18% in precision, recall, and F-score, respectively.
In the literature, there is another direction to recover Issue-Commits links using machine learning algorithms. In [47], Le et al. implement a clustering tool (called RCLinker) based on textual similarity and metadata of Issues and Commits to predict Issue-Commit links. The accuracy results of RCLinker reported from six software projects outperform MLink in F-measure by 138.66%. Similarly, Rath et al. [48] employ textual similarity and metadata to recover such links. In [49], Sun et al. suggested two approaches (called FRLink and PULink) to recover Issuecommit links. FRLink was evaluated on the same software projects used by RCLinker, and the results showed that FRLink outperforms RCLinker in F-Measure by 40.75%.

B. RECOVERING LINKS BETWEEN OTHER SOFTWARE ARTIFACTS
The research in this category focuses on recovering links with different semantics between different software artefacts like documentation, source code, requirements, design components, etc.
In [50], Antoniol et al. introduced a new probabilistic approach for recovering Code-Documentation links. In [51], Hayes et al. proposed a vector space model (VSM) based approach to recover Code-Requirement links. In [52], Eyal Salman et al. proposed an approach to link functional features with their implementing source code elements using a combination of information retrieval (IR) and formal concept analysis (FCA). In [53], Ali et al. introduced an approach, called COPARVO, for identifying the links between code and its requirements. COPARVO dividing the code into different conceptions of information (i.e. class names, comments, methods names, and class variables) by using VSM. In [54], Poshyvanyk et al. suggested an approach that integrates the latent semantic indexing technique with a dynamic technique called scenario-based probabilistic ranking (SPR). SPR is used to identify the link between features with their corresponding source code methods.

VI. CONCLUSION AND FUTURE WORK
In this paper, we proposed an approach to recover links between PRs and Issues in collaborative software development repositories on GitHub, based on the textual representation made in issues and PRs. The proposed approach investigates the results of two clustering algorithms (KMeans and BIRCH) to recover these links based on semantic and lexical measures. To do so, we trained KMeans and Birch using different features on the ground-truth dataset in the subject. The results show that PRs-Issues links can be recovered with an accuracy of 91.5% using Birch clustering algorithm.
We plan to employ deep-learning techniques to recognize the rules and patterns between similar Issues and similar PRs that reside in different GitHub repositories. Therefore, when a developer has a new Issue, the PR associated with this Issue could be found in another repository. Therefore, the developer can gain benefits from other developers' experiences. Consequently, it helps to create a recommendation system for software engineering.