Using Tri-Relation Networks for Effective Software Fault-Proneness Prediction

Software modules and developers are two core elements during the process of software development. Previous studies have shown that analyzing relations between 1) software modules; (2) developers; and (3) modules and developers, is critical to understand how they interact with each other, which ultimately affects software quality. Specifically, relations such as developer contribution relation, module dependency relation, and developer collaboration relation have been used independently or in pairs to build networks for software fault-proneness prediction. However, none of them investigate the joint effort of these three relations. Bearing this in mind, in this paper, we propose a tri-relation network, a weighted network that integrates developer contribution, module dependency, and developer collaboration relations to study their combined impact on software quality. Four network node centrality metrics are further derived from the proposed network to predict the fault-proneness of a given software module at the file level. Moreover, we have explored a mechanism to refine current networks in order to further improve the effectiveness of software fault-proneness prediction.

Meanwhile, social network analysis has been frequently applied in software engineering.Ghosh [24] reports that many open source projects at SourceForge are organized as social networks.Xu et al. [85] classify people working an open source project at SourceForge into project leader, core developer, co-developer, and active user.Ohira et al. [52] apply social network analysis and collaborative filtering to The associate editor coordinating the review of this manuscript and approving it for publication was Hui Liu. 1 In this paper, we use ''bugs'' and ''faults'' interchangeably.We also use the terms ''software'' and ''program'' interchangeably.identify experts across different projects.Howison et al. [30] also use data collected from SourceForge to investigate how the social structures in projects are changing.
With regard to software quality control, study conducted by Cataldo et al. [16] indicates that logical dependency (i.e., two files are modified in the same commit) is a more accurate representation of product dependency affecting the development effort and it also explains most of the variance in faultproneness [17].Bird et al. [12] and Meneely [46] point out that ownership (e.g., a developer contributes a commit on a software module) can have a strong relationship to software defects.Ell [21] and Simpson [64] use the Failure Index (FI) to determine the failure-inducing possibility of developer pairs in developer social networks.In general, these studies characterize software quality from a specific relation between either developers, modules, or developers and modules.
In this paper, 2 three types of relations during the process of software development are investigated: the developer contribution relation (who works on which software modules), the module dependency relation (which modules are dependent on others), and the developer collaboration relation (which developers work together on the same modules).These relations have been used independently or in pairs in social network analysis to construct different networks to predict which modules are likely to contain faults at different levels such as developer contribution network (DCN) [59], module dependency network (MDN) [87], socio-technical network (STN) [11], [62], and developer collaboration network (DN) [46].Encouraging results in prior research indicate that software modules that play key roles and are central in these networks tend to be more fault-prone than modules in the surrounding areas of the network [11], [46], [50], [59], [87].Although these networks are useful for fault-proneness prediction, they are built either by a single relation or by a pair of relations mentioned above.We therefore propose the Tri-Relation Network (TRN), a weighted social network that integrates all three types of relations.Four network node centrality metrics are correspondingly derived from TRN.The design of TRN not only merges the features of DCN, MDN, STN, and DN, but also includes additional relationship (i.e., logical dependency).Moreover, a calibration mechanism based on developer quality [40] for edge weights on TRN and other four networks is explored for further enhancement as well.After all, it is developers who make mistakes and introduce faults into software.Case studies are conducted on six software projects to evaluate the effectiveness of TRN-based metrics in predicting software fault-proneness.In our study, we answer the following three research questions, which are thoroughly discussed later in Section V.
R1 Are centrality metrics derived from TRN important indicators for the number of post-released bugs in a file? 3 R2 Do centrality metrics derived from TRN effectively improve software fault-proneness prediction models?R3 Will the fault-proneness prediction effectiveness be improved if applying the proposed edge calibration mechanism on TRN and other four networks.The remainder of the paper is structured as follows.Section II presents related work whence the proposed TRN arises.Section III explains the proposed TRN.Four network node centrality metrics used in our study and ten software metrics that are commonly used for predicting faultproneness are introduced in Section IV.Our case studies are detailed in Section V. Section VI discusses some threats to the validity of our study.Finally, our conclusions and plans for future work appear in Section VII.

II. RELATED WORK
The TRN arises from four existing types of networks that have been developed for predicting software fault-proneness: the developer contribution network (DCN), the module 3 Although we use the term ''software module'' in previous paragraphs, we want to emphasize that a software module is a generic term to represent a unit of a software system.It can have different representations depending on how a software system is described on a particular architectural level.For example, it can represent a single function, a single class, or a single file.In our case studies, a software module refers to a file.A contribution network is an undirected graph G that is formally defined as G = (D, N , E). D and N are the two sets of vertices, and E is a set of edges between vertices the set of developers and N the set of software modules.An edge e ∈ E denotes a contribution of a developer d ∈ D to a module n ∈ N .A contribution refers to a commit of a developer to a module.Edges are always between a developer and a module, and there are no self-loops (i.e., neither modules nor developers can contribute to themselves).Edge weights are used to denote the number of commits a developer has made to a module.Figure 1 depicts a sample developer contribution network.Circles represent developers, rectangles represent software modules, and edges represent developer contributions to modules.For example, developer Bob has made 6 commits to module A. Developer Dan has made 3, 1, and 4 commits to Modules A, B, and C, respectively.

B. MODULE DEPENDENCY NETWORK (MDN)
Zimmermann and Nagappan [87] construct a network from dependency information for software modules in Windows Server 2003.They also find that social network analysisbased metrics derived from the dependency network are good indicators of the number of post-release faults and module fault-proneness, which is consistent with the results presented in [50] and [59].Generally, a dependency network models the dependency relationships (e.g., call graphs, class inheritance, class coupling, etc.) between software modules within a software system.It is a directed graph that is formally defined as G = (N , E) where N is the set of software modules and E is the set of directed edges such that (n 1 , n 2 ) ∈ E if Module n 1 has a dependency on Module n 2 .Figure 2 shows a simple dependency network where rectangles represent software modules and directed edges represent module dependency relationships.For example, Module A has a dependency on

C. SOCIO-TECHNICAL NEWORK (STN)
In [11], Bird et al. argue that the dependency relations and contribution history should be used together for faultproneness prediction.They construct a socio-technical network by combining the developer contribution network and the module dependency network.
In the socio-technical network, there is a bidirectional dashed edge (denoted as the contribution edge) between a developer and a software module if the developer has made a commit to the module.The weight on the contribution edge is set as the number of commits from a developer to a module, and the weight of module dependencies is set to 1. Figure 3 shows a sample socio-technical network with 2 developers and 5 modules.For example, developer Bob has made 6, 2, and 3 commits to Modules A, E, and B, respectively.Module E has dependencies on Modules B, C, and D, respectively.

D. DEVELOPER COLLABORATION NETWORK (DN)
Meneeley et al. [46] construct a developer collaboration network consisting solely of developers in which edges between developers are based on collaboration on common modules.The authors use social network analysis to assign values of metrics to developers.The value of a metric for a module is based on the values of the developers that contributed to that module (e.g., the sum of a metric for developers for a module).
Figure 4 depicts a sample developer network with 4 developers.Circles represent developers and edges represent common files that two developers have both worked on in a particular release.For example, developers Bob and Pan have both worked on Module A during Release R1.
As mentioned above, DCN, MDN, STN, and DN are built either by a single relation or by a pair of relations.These networks also seem to miss an important factor developer quality as it is developers who make mistakes and create bugs during software development.Therefore, we intend to propose an enhanced social network which integrates the features of these four networks with additional adjustments.

III. THE PROPOSED TRI-RELATION NETWORK (TRN)
The motivation behind TRN is that a network integrating developer contribution, module dependency, and developer collaboration can provide a more fully comprehensive insight into the interactions between developers and modules than the use of networks based on either a single or a paired relation.This insight is expected to ultimately enhance the effectiveness of software fault-proneness prediction.
In a TRN, there is a directed edge (denoted as the developer contribution) between a developer and a software module if the developer has made a commit to the module between two consecutive releases (e.g., between Release R and Release R + 1).The weight on the contribution edge is set as the normalized 4 number of commits made from a developer to a module between Release R and Release R + 1. Dependencies between modules are represented as directed dash-dot edges with arrows pointing to the modules upon which other modules depend.It is worth noting that we consider two types of dependency: functional dependency (i.e., a function in a file calls another function in a different file) and logical dependency (i.e., two files are modified in the same commit).We believe the use of both dependency types provides a more accurate representation of module dependencies affecting the development effort.The well-known commercialized tool Understand from SciTools [69] is used to quantify the normalized dependencies between two modules.The weights for logical dependency between two modules are computed as the normalized number of times that these two modules are modified in the same commit.The resulting module dependency is computed as the sum of normalized functional dependency and normalized logical dependency.In addition, there is a bidirectional dotted edge between one developer and another if these two developers have made at least one commit on the same module between release R and Release R+1.The weight on collaboration edge is computed as the normalized number of modules two developers have worked on together between Release R and Release R + 1. Figure 5 presents a TRN with 3 developers and 4 modules.For example, the weight on the developer contribution edge Bob-to-Module A is 0.3.The module dependency edge Module A-to-Module C is 0.2 (normalized functional dependency) + 0.4 (normalized logical dependency) = 0.6.The developer collaboration edge Bob-to-Dan is 0.1.

IV. METRICS
In this section, we first introduce four network node centrality metrics that are used in our study.Then we present another ten software metrics that are commonly used for predicting software fault-proneness.
Network node centrality metrics stem from social network theory and are used to quantify the location of a node to the rest of the network.There are three types of network node centrality5 : (1) degree centrality, (2) closeness centrality, and (3) betweenness centrality.Degree centrality metrics are computed based on the number of edges that a node has.The more edges a node has, the more central is the node.Two degree centrality metrics are used in our study: Freeman degree centrality (denoted as M FDC ) and Bonacich's power (denoted as M BP ).M FDC here is calculated as the number of direct edges a node has to its neighbors.M BP is based on the adjacencies.It takes into account the connections of one's connections, in addition to one's own connections.M FDC and M BP focus on the number of developers and other modules it directly connects to, the impact of direct interactions on the module.The more people that are working on the module, the higher the probability of introducing faults due to inconsistent coding style, especially when these people have never worked/communicated with each other before.Due to the direct dependency relationship, the more changes that are made on its neighbor modules, the higher the probability that appropriate changes should be made on the module accordingly, thus the more difficult to maintain the module.Closeness centrality emphasizes the distance of a node to other nodes in the network.In this paper, we use one such node distance measure: eigenvector of geodesic distances (denoted as M EGD ).M EGD finds the most central nodes (i.e.those with the smallest farness from others) in terms of the ''global'' or ''overall'' structure of the network, and pays less attention to patterns that are more ''local''.Specifically, M EGD applies factor analysis to identify ''dimensions'' of the distances among nodes.The location of each node with respect to each dimension is called an ''eigenvalue'', and the collection of such values is called the ''eigenvector''.Usually, the first dimension captures the ''global'' aspects of distances among nodes; second and further dimensions capture more specific and local sub-structures.Betweenness centrality denotes the extent to which information flows through a node to get from one node to another.The more information flows through a node, the higher its betweenness centrality.For betweenness centrality, we use one such metric, Freeman node betweenness (denoted as M FNB ).It counts how frequently each node falls in the geodesic paths between all pairs of nodes.M EGD and M FNB focus on the connection strength of the module to all modules and developers that it either directly or indirectly connects to.The closer the module to other modules and developers, the stronger the connection, the more likely the module can be affected by other modules and developers in a way.In total, these four centrality metrics (i.e., M FDC , M BP , M EGD , and M FNB ), which are also widely used in previous studies [11], [46], [50], [59], [87], are used in our study.We use a tool, Ucinet [8], to compute the values of these four centrality metrics based on the instructions given by Hanneman and Riddle [29].
Additionally, we introduce another ten software metrics, as shown in TABLE 1, that are commonly used for predicting software fault-proneness, including Lines of Code [51], McCabe Complexity [45], all six CK metrics [18], number of commits [47], and number of developers [58], all of which will be used later in our case studies.We use a tool, Understand [69], to compute the values of these ten metrics.
For the sake of both simplicity and consistency, we use the notations provided in TABLE 2. For example, X represents a generic weighted network such as TRN, DCN, MDN, STN, or DN.M Cen represents a generic network code centrality metric (e.g., M FDC , M BP , M EGD , or M FNB ).M X−Cen represents a M Cen derived from X.For example, M TRN−FDC represents the FDC network node centrality metric derived from a TRN.Meanwhile, all four network node centrality metrics derived from a TRN (i.e., M TRN−FDC , M TRN−BP , M TRN−EGD , and M TRN−FNB ) can now be simplified to M TRN .We use M CO to denote a metric set that contains the ten software metrics described in TABLE 1.In addition, we use (M X ) and (M CO ) to denote a software fault-proneness prediction model using all four network code centrality metrics derived from X and a prediction model using the ten commonly used metrics, respectively.Since for each network we have four centrality metrics, a total of 20 metrics can be derived as shown in TABLE 3.

V. CASE STUDIES
In this section, we examine the three research questions related to our study, followed by a discussion of the software programs and the data analysis techniques used in our case studies.Results are presented at the end of this section.

A. THREE RESEARCH QUESTIONS
Here we simply revisit the three research questions from Section I.
R1 Are centrality metrics derived from TRN important indicators for the number of post-released bugs in a file?R2 Do centrality metrics derived from TRN effectively improve software fault-proneness prediction models?R3 Will the fault-proneness prediction effectiveness improve if applying the proposed edge calibration mechanism on TRN and other four networks?Answers to these three questions can help determine whether TRN-based centrality metrics are more powerful in building fault-proneness prediction models than not only DCN-, MDN-, STN-, or DN-based centrality metrics, but also software metrics that are commonly used for faultproneness prediction.Moreover, valuable insights will be gained regarding the contributing factors that can be used to refine current networks in order to further enhance the prediction effectiveness.

B. SIX SOFTWARE PROGRAMS STUDIED
Our experiments use six programs, Camel [1], Flume [2], Tika [3], Gedit [23]  an open-source integration framework to define routing and mediation rules in a variety of domain-specific languages.Flume is a distributed service for collecting, aggregating, and moving log data from different sources to a centralized data store.Tika detects and extracts metadata and text from different file types such as PPT, XLS, and PDF.Gedit is the GNOME text editor.Nginx is an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server.Redis is an open source, in-memory data structure store, used as a database, cache and message broker.TABLE 4 summarizes the information for these six programs used in our case studies.The columns, starting from the left, give project name, release version, lines of code (including blanks and comments), number of files, and number of faulty files.Each program contains two consecutive releases.The values of all metrics are collected at file level.

C. EXPERIMENTAL METHODOLOGY
In order to answer R1, we use Spearman rank correlation coefficient [65] to measure the correlation between each metric (described in TABLE 5 through TABLE 10) and the number of post-released bugs for Camel 1.4.0,Flume 1.5.0,Tika 1.6, Gedit 2.25.4,Nginx 1.3.0,and Redis 2.6.0.8 respectively.The coefficient is between +1 and −1, inclusive, in which +1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation.
In order to answer R2, we use a data mining tool, Weka [72], to construct different fault-proneness prediction models.For each program, a total of six datasets are formed based on TRN, DCN, MDN, STN, and DN, as well as a dataset consisting of ten commonly used metrics (denoted as CO-based dataset) from two consecutive software releases (say Release 1 and Release 2).We use all data points collected from Release 1 as the training set.Because of the class imbalance issue in the training set we apply SMOTE [43], [70], [86] to oversample the minority class (i.e., classified as faultprone) so that the size of fault-prone samples is equal to the size of samples that are non-fault-prone.For the training set, we randomly split all fault-prone classes in the training set into 30 equal-sized groups.We do the same for all nonfault-prone classes in the training set.Later, we randomly combine one fault-prone group with one non-fault-prone group to make one training subset.In this way, we have (3) 6 We use BayesNet in our experiment because it is robust to overfitting and does not assume data independence.As a matter of fact, many machine learning techniques such as neural networks [7], [37], [67], decision trees [6], [27], [28], case-based reasoning [36], [38], [55], Naïve Bayes [15], [31], [44], fuzzy logic [56], logistic regression [5], [9], [16], SVM [20], [25], [26], random forests [39], [63], and so on have been used for predicting software fault-proneness in the past.We want to emphasize that the focus of this study is to evaluate prediction effectiveness of the metrics derived from newly designed social networks.All these techniques can be used to train our dataset.However, the selection of the training algorithm used in the experiment is beyond the scope of this study.3) and ( 4), the precision is 25/(25+10) ≈ 71.43% and the F1 is (2 × 71.43% × 83.33%)/(71.43%+ 83.33%) ≈ 76.92%.For two different fault-proneness prediction models 1 and 2 , if 1 has a higher recall or F1 than 2 , then it can be said that 1 is more effective than 2 with respect to recall or F1.If 1 has a lower FPR than 2 , then it can be said that 1 is more effective than 2 with respect to FPR.
For each program, we compute and compare the respective average recall and FPR of 30 (M TRN ), 30 (M DCN ), 30 (M MDN ), 30 (M STN ), 30 (M DN ), and 30 (M CO ).For example, regarding R2, if (M TRN ) has a higher average recall than (M DCN ), then (M TRN ) is more effective than (M DCN ) with respect to average recall.
In addition, we employ the paired Wilcoxon signed-rank test [60] to investigate R1 and R2.For example, regarding R1, we can make the following null hypothesis with respect to the computed Spearman rank correlation coefficient using M TRN−Cen and the same coefficient using M DCN−Cen : H 0 : The computed rank correlation coefficient using M TRN−Cen is equal to or smaller than the computed Spearman rank correlation coefficient using M DCN−Cen .If H 0 is rejected (i.e., the alternative hypothesis is accepted), then it implies that M TRN−Cen is more correlated with the number of post-released bugs than M DCN−Cen .
Regarding R2, we can make the following null hypothesis with respect to the recall of (M TRN ) and (M DCN ): H 0 : (M TRN ) has equal or lower recall than (M DCN ).If H 0 is rejected (i.e., the alternative hypothesis is accepted), then it implies that (M TRN ) will correctly predict more fault-prone files than (M DCN ).This also implies that (M TRN ) is more effective than (M DCN ) with respect to recall.
In order to answer R3, we first propose an approach to calibrate the edge weight of TRN and other networks.The modified networks with calibrated edges are denoted as CaX (e.g., CaTRN).Later, we investigate the relationship between the centrality metrics derived from CaX and the number of post-released bugs, as well as the performance of predicting fault-proneness using CaX-based metrics.

D. RESULTS
To answer R1, we use the Spearman rank correlation coefficient to measure the correlation between each metric and the number of post-released bugs in a file. 7The results are shown in TABLE 5 through TABLE 10.Each entry in the tables gives the coefficient between a metric and the number of bugs.For example, let us look at the first row of TABLE 5.The correlation between metric M TRN−FDC and the number of bugs is 0.79, and the correlation between M DCN−FDC and the number of bugs is 0.73.The corresponding correlations between metrics M MDN−FDC , M STN−FDC , and M DN−FDC and the number of bugs are 0.62, 0.71, and 0.63, respectively.Therefore, the centrality metric, M FDC , derived from TRN (i.e., M TRN−FDC ) has the strongest correlation with the number of bugs compared to the corresponding M FDC derived from DCN (i.e., M DCN−FDC ), MDN (i.e., M MDN−FDC ), STN (i.e., M STN−FDC ), and DN (i.e., M DN−FDC ).Let us now look at the second column of the same table.The correlation between M TRN−BP and the number of bugs is 0.55, the correlation between M TRN−EGD and the number of bugs is 0.54, and the correlation between M TRN−FNB and the number of bugs is 0.44.Therefore, the M FDC derived from TRN (i.e., M TRN−FDC ) has the strongest correlation with the number of bugs compared to M TRN−BP , M TRN−EGD , and M TRN−FNB which are derived from the same TRN.
In general, from 5 to TABLE 10, we observe that: (1) M TRN−FDC has the strongest correlation (i.e., 0.79) with the number of bugs among all metrics; (2) for any M Cen derived from TRN, M TRN−Cen has the strongest correlation with the number of compared to the corresponding M Cen derived from DCN (i.e., M DCN−Cen ), MDN (i.e., M MDN−Cen ), STN (i.e., M STN−Cen ), and DN (i.e., M DN−Cen ).
In addition, we use the paired Wilcoxon signed-rank test to investigate R1 from a statistical point of view.TABLE 11 presents the results of a Wilcoxon signed-rank test showing the confidence with which it can be claimed that M TRN−Cen is more correlated with the number of bugs than the corresponding M DCN−Cen , M MDN−Cen , M STN−Cen , and M DN−Cen .Each entry in the table strengthens the conviction that the alternative hypothesis stands.Furthermore, for each program in TABLE 11, at the 0.05 level, the Let us look at the third row of TABLE 11; for Flume 1.5.0, it can be said with 97.97%, 99.98%, 98.97%, and 99.97% confidence that M TRN−Cen is more correlated with the number of bugs than the corresponding M DCN−Cen , M MDN−Cen , M STN−Cen , and M DN−Cen .In general, from TABLE 11 it can be claimed with high confidence (at least 97%) that M TRN−Cen is more correlated with the number of bugs than the corresponding M DCN−Cen , M MDN−Cen , M STN−Cen , and M DN−Cen for all six programs.If we change our alternative hypothesis to ''M TRN−Cen is equally/more correlated with the number of bugs as/than the corresponding M DCN−Cen , M MDN−Cen , M STN−Cen , and M DN−Cen ,'' then the confidence is 100% for almost very scenario.
Summary With Respect to R1: • Metrics derived from the proposed TRN are significant indicators for the number of bugs in a file.
• Metrics derived from the proposed TRN are generally more correlated to the number of bugs than corresponding metrics derived from DCN, MDN, STN, and DN.
• The FDC metric derived from TRN, M TRN−FDC , has the strongest correlation with the number of bugs among all metrics used in our case studies.This also indicates that for a software module (a file in our case), (1) the number of direct interactions with its contributing software developers, (2) the contribution frequency of these developers, (3) the number of modules with which it has a direct dependency relationship (both functional and logical), and ( 4) their mutual dependence intensity, jointly have a significant impact on the quality of the module itself.To answer R2, for each program we compute and compare the average recall, FPR, and F1 score of (M TRN ), (M DCN ), (M MDN ), (M STN ), (M DN ), and (M CO ).The results are shown in TABLE 12 through TABLE 18.For example, in TABLE 17, the average recall, FPR, F1 score of (M TRN ) are 69.79%,4.60%, and 43.86%, respectively.In the same table, the average recall, FPR, F1 score of (M DCN ) are 62.95%, 4.86%, and 36.07%,respectively.Therefore, (M TRN ) has a larger average recall, a lower average FRP, and a higher F1 score compared to (M DCN ).From TABLE 17, we observe that (M TRN ) has the highest average recall (i.e., 69.79%), the lowest average FPR (i.e., 4.60%), and the highest average F1 score (i.e., 43.86%) among all fault-proneness prediction models in the table.The same also applies to TABLE 18 and TABLE 17 where (M TRN ) has the highest average recall (i.e., 87.01%, 64.60%, 78.40%, 61.09%, and 72.20%), the lowest average FPR (i.e., 4.51%, 2.10%, 4.56%, 3.05%, and 3.15%), and the highest F1 score (i.e., 72.19%, 59.42%, 58.03%, 46.95%, and 62.67%) among all fault-proneness prediction models in these two tables.
Once again, from a statistical point of view, we employ the paired Wilcoxon signed-rank test to compare the recall and FPR of (M TRN ) against (M DCN ), (M MDN ), (M STN ), (M DN ), and (M CO ).TABLE 18 presents the results of a Wilcoxon signed-rank test showing the confidence with which it can be claimed that (M TRN ) is more effective (in terms of recall, FPR, and F1 score) than (M DCN ), (M MDN ), (M STN ), (M DN ), and (M CO ).Each entry in the table gives the assurance with which the alternative hypothesis stands.Furthermore, for each program in TABLE 18, at the 0.05 level, the recall/FPR/F1 distributions are different between (1) (M TRN ) and (M DCN ), (2) (M TRN ) and (M MDN ), (3) (M TRN ) and (M STN ), (4) (M TRN ) and (M DN ) , and (5) (M TRN ) and (M CO ), respectively.To take an example from TABLE 18, it can be said with 99.98%, 99.99%, 99.64% confidence that (M TRN ) has a higher recall, lower FPR, and higher F1 score respectively, than (M DCN ) for Camel 1.4.0.It also implies that (M TRN ) is more effective than (M DCN ) in terms of recall, FPR, and F1 score, respectively.In general, from TABLE 18 we observe that it can be said with at least 99% confidence that (M TRN ) has a higher recall, lower FPR, and higher F1 score than the corresponding (M DCN ), (M MDN ), (M STN ), (M DN ), and (M CO ) for all six programs.If we change our alternative hypothesis to consider equalities, then the confidence is 100% for almost every scenario.
Summary With Respect to R2: Fault-proneness prediction models using network node centrality metrics derived from the proposed TRN are more effective than prediction models using the same metrics derived from DCN, MDN, STN, and DN as well as prediction models using the ten common metrics, in terms of recall, FPR, and F1 score.
To answer R3, we propose CaTRN.The motivation behind the construction of the CaTRN is to investigate whether integrating additional factors that describe the development effort in the current TRN will better present the interactions between developers and modules and therefore further improve the fault-proneness prediction using the metrics derived from CaTRN.Consequently, in order to construct a CaTRN, for each type of relation in a TRN, a particular mechanism is applied to further calibrate the corresponding relation strength (i.e., the weight on the corresponding edges).
Specifically, we introduce developer risk score (DRS) [40], which computes the risk of a developer working on the modules, and use it for further edge weight calibration.DRS is based on two heuristics: (1) with respect to a given program, the more frequently a developer has introduced bugs in past releases, and the greater the severity of those bugs, the higher the risk that this program will contain a bug if this same developer makes a commit on the current release; and (2) the greater the complexity of a program, the greater the difficulty a developer has in working on this program and the higher the risk that the developer will introduce a bug into the program.For a given software system, assume that m j is the j th module in the system, and c k is the k th bug-introducing commit made by developer d in the j th module.We retrieve the bug severity of each bug-introducing commit (i.e., critical, major, minor, or trivial) from JIRA [33] and use the function SeverityScore(f j , c k , d, R−1) to map it to one of the following scores: 4 (critical), 3 (major), 2 (minor), and 1 (trivial).A score of 4 is assigned to the variable MaxSeverityScore.The bug severity ratio of the k th bug-introducing commit made by developer d in the j th module in release R−1 is defined as: The overall complexity value of the j th module in release R−1 is computed by Complexity(m j , R−1) as the sum of normalized LOC [51], McCabe Complexity [45], and all six CK metrics [18].TotalCommits(d, R−1) gives the total number of commits made by developer d in release R−1.DRS(d, R), the developer risk score of developer d at release R, is defined as: Complexity(m j ,R−1) To calibrate the weight a developer contribution edge, we multiply the original weight (i.e., the number of commits made by a developer) by the DRS value 8 of this developer.To calibrate the weight of a module dependency edge, we multiply the original weight (i.e., the quantified dependency value for a pair of modules) by the sum of DRS values of distinct developers who have worked on the two modules.To calibrate the weight of a developer collaboration edge, we multiply the original weight (i.e., the number of modules on which two developers have both worked) by the sum of DRS values of these two developers.Let us assume the DRS values for developers Bob, Dan, and Jim are 0.5, 1.2, and 3, respectively.The CaTRN is shown in Figure 7. Compared to TRN, CaTRN contains additional information by considering developer risk, program complexity, and bug severity, thus describing the development effort from a more comprehensive perspective.
The same calibrating strategy can also apply to DCN, MDN, STN, and DN.As a result, a total of seven modified networks are obtained (i.e., CaTRN, CaDCN, CaMDN, CaSTN, and CaDN).Once we have these modified networks, the corresponding network centrality metrics from each modified network are derived, respectively.Then, we investigate R3 by re-conducting similar data analysis which has been used to investigate R1 and R2.TABLE 19 through TABLE 24 present the Spearman rank correlation coefficient used to measure the correlation between each metric derived from the corresponding modified networks and the number of bugs in a file.For example, in the first row of TABLE 19, the correlation between the FDC metric derived from CaTRN (i.e., M CaTRN−FDC ) and the number of bugs is 0.87.As you may recall, the same FDC metric derived from TRN (i.e., M TRN−FDC ) in TABLE 5 is 0.79.This indicates that the FDC metric derived from the modified TRN (i.e., M CaTRN−FDC ) that consider calibrated edge weight has a stronger correlation with the number of bugs than the same FDC metric derived from the original TRN that does not.
In general, from TABLE 19 to TABLE 24, we observe that: (1) the metrics derived from modified networks (i.e., M CaX−Cen ) have stronger correlation with the number of 8 For a newly joined developer, its DRS value is set to the median of the DRS value set which is currently available.bugs than the corresponding metrics derived from the original networks (i.e., M X−Cen ) as shown from TABLE 5 to TABLE 10; (2) M CaTRN−FDC has the strongest correlation with the number of bugs among all metrics derived from modified networks; and (3) for any M Cen derived from CaTRN, M CaTRN−Cen , it has the strongest correlation with the number of bugs compared to the corresponding M Cen derived from CaDCN (i.e., M CaDCN−Cen ), CaMDN (i.e., M CaMDN−Cen ), CaSTN (i.e., M CaSTN−Cen ), and CaDN (i.e., M CaDN−Cen ).
In addition, the results of the paired Wilcoxon signed-rank test in TABLE 25 indicate that with high confidence (at least 98%), M CaX−Cen is more correlated with the number of bugs than M X−Cen .The confidence increases to 100% for almost every scenario when considering equalities.In general, metrics derived from the modified networks that consider calibrated edge weights are more correlated with the number of bugs than the same metrics derived from the corresponding networks that do not.
Similarly, we also compute the average recall, the average FPR, and the average F1 score of (M CaTRN ), (M CaDCN ), (M CaMDN ), (M CaSTN ), and (M CaDN ).The results are shown in   • Metrics derived from the modified network that considers calibrated edge weight using DRS are generally more correlated to the number of bugs than the same metrics derived from the corresponding networks that do not consider calibrated edge weight.
• Software fault-proneness prediction models using metrics derived from the modified networks that consider calibrated edge weight are more effective than prediction models using the same metrics derived from the corresponding networks that do not.

VI. THREATS TO VALIDITY A. INTERNAL VALIDITY
The data analysis techniques used in Section V are suitable and commonly adopted for measuring the effectiveness of metrics in predicting software fault-proneness, but by themselves they do not provide a complete picture of the prediction performance.Therefore, we can only observe correlation through statistical measures, not causation.In order to investigate possible causal effects, a root-cause analysis along each variable still needs to be carried out.During the experiment we apply oversampling using SMOTE which may lead to overfitting.First, unlike traditional oversampling methods which simply duplicates minority classes, SMOTE operates in the ''feature space'' rather than the ''data space'' and generates synthetic samples of the minority classes.In this way, it effectively forces the decision region of the minority class to become more general, partially solving the generalization/overfitting problem.Second, we only apply SMOTE on the training data while the test data remain untainted.Third, we have tried random oversampling and the prediction performance is not ideal due to the impact of overfitting.In other words, SMOTE is more appropriate and effective to handle data imbalance and overfitting in our case.Our work lays the foundation for future investigations, by identifying potential connections between social interactions and software quality.

B. EXTERNAL VALIDITY
Only six open-source programs are used to investigate the four research questions.Therefore, our conclusion may not be generalized either to projects developed by other programming languages or to those that are commercialized.Our program selection is based on the information availability for the construction of TRN.To mitigate these threats, our study needs to be repeated on a wider variety of programs.Additionally, all data are collected from an issue tracking system, JIRA [33].However, the information extracted from the issue tracking system may be incomplete.This potential threat can be alleviated by conducting more thorough data collection in future work.The developers may not be as well trained and experienced as average professional programmers.In this paper, we also use consecutive releases which may prompt replication of same bugs.However, using consecutive releases has been a common and successful practice in SFP.Our intention has never been catching similar or same bugs (according to some static code metrics extracted from modules) in the next release but by studying the impact of interaction between connected software developers and modules as well as the interaction within connected developers/modules (i.e., building a TRN) on module fault-proneness to guide us identify fault-prone modules which do not necessarily belong the same type but are due to a serial of combined social and technical influence during development.

VII. CONCLUSIONS AND FUTURE WORK
Previous studies have shown that the developer contribution relation, module dependency relation, and developer collaboration relation have been used to build networks for software fault-proneness prediction.However, none of these studies consider the combined influence of all three relations.Motivated by this, we integrate all three relations into one comprehensive network, our proposed tri-relation network (TRN).In addition, four network node centrality metrics (i.e., M FDC , M BP , M EDG , and M FNB ) are derived from the corresponding network to predict the fault-proneness a given file on six programs.The results our study indicate that (1) TRN-based centrality metrics are more correlated with the number of bugs than the corresponding DCN-, MDN-, STN-, and DN-based centrality metrics as as the ten software metrics that are commonly used for software fault-proneness prediction; (2) fault-proneness prediction models using TRN-based centrality metrics outperform the models using DCN-, MDN-, STN-, and DN-based centrality metrics as well as the models based on ten commonly used software metrics; (3) centrality metrics derived from a modified network that consider calibrated edge weight using developer risk score are more correlated to the number of bugs than those derived from the same network that does not; and (4) fault-proneness prediction models using centrality metrics derived from a modified network outperform the models using centrality metrics derived from the same network that does not.In the future, we plan to repeat our study on a wider variety of programs and include additional software metrics for comparison to further validate the effectiveness of our TRN-based metrics.We also intend to search for potential intelligent algorithms for better prediction model training and further network refinement.In addition, it would be interesting to investigate whether our TRN-based centrality metrics can be used for cross-project software fault-proneness prediction when the historical information is limited or unavailable.Last but not least, we also plan to tailor TRN to the needs of industrial enterprise.

FIGURE 1 .
FIGURE 1.A DCN with 2 developers and 3 software modules.

FIGURE 5 .
FIGURE 5. A TRN with 3 developers and 4 software modules.

TABLE 11 .
Confidence that M TRN−Cen is more correlated with the number of bugs than the corresponding M DCN−Cen , M MDN−Cen , M STN−Cen , and M DN−Cen .

FIGURE 7 .
FIGURE 7. A CaTRN with three developers and four modules.

TABLE 1 .
Ten commonly used software metrics.

TABLE 2 .
Notations relevant to metrics, networks, and fault-proneness prediction models.

TABLE 5 .
Correlation analysis using spearman rank correlation coefficient for Camel 1.4.0where correlation is significant at the 0.05 level.

TABLE 6 .
Correlation analysis using spearman rank correlation coefficient for Flume 1.5.0where correlation is significant at the 0.05 level.

TABLE 7 .
Correlation analysis using spearman rank correlation coefficient for Tika 1.6 where correlation is significant at the 0.05 level.

TABLE 8 .
Correlation analysis using spearman rank correlation coefficient for Gedit 2.25.4where correlation is significant at the 0.05 level.

TABLE 9 .
Correlation analysis using spearman rank correlation coefficient for Nginx 1.3.0where correlation is significant at the 0.05 level.

TABLE 19 .
Correlation analysis using spearman rank correlation coefficient for Camel 1.4.0where correlation is significant at the 0.05 level.

TABLE 20 .
Correlation analysis using spearman rank correlation coefficient for Flume 1.5.0where correlation is significant at the 0.05 level.

TABLE 21 .
Correlation analysis using spearman rank correlation coefficient for Tika 1.6 where correlation is significant at the 0.05 level.

TABLE 22 .
Correlation analysis using spearman rank correlation coefficient for Gedit 2.25.4where correlation is significant at the 0.05 level.

TABLE 23 .
Correlation analysis using spearman rank correlation coefficient for Nginx 1.3.0where correlation is significant at the 0.05 level.

TABLE 25 .
Confidence that M CaX−Cen is more correlated to the number of bugs than the M X−Cen .

TABLE 26 throughTABLE 31 .
For example, inTABLE 26, the average recall, FPR, and F1 score of (M CaTRN ) are 71.50%,4.59%,and44.82%,respectively.As stated previously, the average recall, FRP, and F1 score of the corresponding (M TRN ) in TABLE 12 are 69.79%,4.60%,and43.86%,respectively.This indicates that the faultproneness prediction models based on modified TRN that consider calibrated edge weights (i.e., (M CaTRN )) are more effective than the models based on the corresponding TRN that do not (i.e., (M TRN )) in terms of average recall, FPR, and F1 score.In general, from TABLE26to TABLE31, we observe that: (1) fault-proneness prediction models based on modified networks that consider calibrated edge weights (i.e., (M CaX )) are more effective than the prediction models based on the corresponding networks that do not (i.e., (M X )) as shown from TABLE 12 to TABLE 17 in terms of average recall, FPR, and F1 score; and (2) (M CaTRN ) has the

TABLE 32 .
Confidence (M CaX ) is more effective than the corresponding (M X ).