Extending Developer Experience Metrics for Better Effort-Aware Just-In-Time Defect Prediction

Developers use defect prediction models to efficiently allocate limited resources for quality assurance and appropriately make a plan for software quality improvement activities. Traditionally, defect predictions are conducted at the module level, such as the class or file level. However, a more recent trend is to perform defect prediction for a single or consecutive commits to the repository, which is known as just-in-time (JIT) defect prediction. JIT defect prediction finds error-prone changes instead of error-prone modules, and as a result, the developer only needs to investigate error-prone changed lines instead of the entire module. When building JIT defect prediction models, researchers used various metrics, including developer experience metrics which measure the developer’s experiences. Despite the fact that software defectiveness is likely to be affected by the experience of developers, developer metrics were understudied in the literature. In this work, we investigate the impact of various novel developer experience metrics and their combinations on JIT defect prediction. Our experimental results are positive. We found that it is possible to improve the cost-effectiveness of defect prediction models consistently and statistically significantly by using our new developer experience metrics.


I. INTRODUCTION
Defect prediction helps developers identify software artifacts that are likely to have defects. It provides a sorted list of the potentially defective artifacts (those that are likely to be more error-prone appear earlier in the list), based on which developers can make a quality improvement plan considering limited resources [1]. To predict the error-proneness of each The associate editor coordinating the review of this manuscript and approving it for publication was Francisco J. Garcia-Penalvo . software module, 1 most defect prediction techniques use a prediction model trained with various code metrics, such as the code complexity [2] and change histories [3] known to be effective in estimating error-proneness.
Although a defect prediction technique helps developers narrow down the modules to inspect, module inspection still requires a large effort because the module predicted to be error-prone may contain many lines of code (LOCs). To overcome this drawback of module-level defect prediction, recent studies have introduced just-in-time (JIT) defect prediction models. Given a set of code changes committed to the repository, JIT defect prediction models predict whether this set of code changes contains a defect. Because the number of changed lines at each commit is typically much smaller than the size of a potentially defective file, predicting defect-inducing code changes is known to be more effective in locating the defective lines. Moreover, JIT defect prediction techniques can provide feedback at the earlier stage of development compared to file-level defect prediction because defect prediction can be conducted as soon as a change is made into the source code repository, rather than waiting until a new version of the software is ready for inspection [4], [5].
To predict whether a code change induces a defect, researchers have proposed various change-level metrics, some of which are shown in Table 1. These metrics measure, for example, how big a code change is (Size), how many modules are changed (Diffusion), which developer modified the code and how experienced the developer is (Experience).
To obtain accurate and useful prediction results from a defect prediction model, it is important to use metrics effective for defect prediction. To improve defect prediction performance, we propose novel metrics extending the existing developer experience metrics. Despite the fact that software defectiveness is likely to be affected by the experience of developers, developer metrics were understudied in the literature. In this work, we investigate the impact of various novel developer experience metrics and their combinations on JIT defect prediction. In particular, we compare the cost-effectiveness between different metrics. That is, we aim to find metrics that can reveal as many defects as possible after investigating the limited number of lines of changed code.
This paper makes the following contributions: 1) We propose new developer experience metrics extending the existing ones. We extend developer metrics with two different dimensions. For the first dimension, we consider various granularities of modules (e.g., systems, subsystems, and files). For the second dimension, we consider various ways to measure how recently the developer made a change on the module (e.g., commit time and version numbers). We also consider the combinations of both dimensions. 2) We empirically evaluate our new developer experience metrics. We compare the performance (i.e., the cost-effectiveness of trained models) of our metrics with the existing ones. We find that our new experience metrics improve the cost-effectiveness of defect prediction models. In particular, we report which combination of our metrics results in consistent and statistically significant improvements. The rest of this article is organized as follows. Section II discusses the background of our study. Section III describes the new developer experience metrics that we propose. Section IV discusses the design of our experiments, and Section V presents the results of the experiments. Section VI reports the threats to validity, and Section VII provides the related works. Finally, Section VIII presents the conclusions and future work of this study.

A. DEFECT PREDICTION
When developing software, software quality is one of the most critical aspects to the stakeholders of software development (e.g., customers, developers, and managers). Given limited resources, it is usually not feasible to take a close look at all parts of the software. Thus, it is important to know which part of the software is likely to be buggy so that developers can improve the software quality as much as possible within a limited time. For this purpose, defect prediction was proposed [6].
Defect prediction is typically performed using a machine learning model. Figure 1 shows the typical workflow of defect prediction consisting of two phases. In the first phase (the upper right box of the figure), a defect prediction model M is trained with data where each item of the data consists of the values of n metrics (where n represents the number of metrics used to train M ) and the label indicating whether the item is defective or not. As for metrics, diverse data such as code size, code complexity, and developer experience have been used in the literature, as shown in Table 1. The ''Name'' and ''Description'' column describes the name of the metrics and their descriptions, respectively. In defect prediction research, metrics are often categorized into several groups, as shown in the ''Group'' column of the table. Training data is typically mined from version-control systems. Once a defect prediction model M is trained, in the second phase (the lower right box of the figure), M is used to locate modules that are likely to be defective.

B. JUST-IN-TIME DEFECT PREDICTION
In a modern software development environment where software is constantly changing, developers continue to commit their code changes to a code repository. In this environment, developers need to know whether their code changes are defective or not. For this reason, JIT (Just-In-Time) defect prediction was introduced -JIT defect prediction predicts how much the current code changes are likely to be buggy.

C. CHANGE-LEVEL DEVELOPER EXPERIENCE METRICS USED IN JIT DEFECT PREDICTION
Metrics related to developer experience have been used in previous work [7], [8], [9], [10], [11], with the rationale that more experienced developers are less likely to write faulty code. While developer experience is difficult to measure, it is commonly approximated by counting the number of changes made by the developer-typically, one commit to a version control system is considered one change.
Specifically, the following three metrics have been used widely in the literature on defect prediction to measure developer experience: EXP (developer experience metric), SEXP (subsystem developer experience metric), and REXP (recent developer experience metric).

1) EXP
First, EXP for a change c is defined as: where d c and Changes(d c ) represent the developer who made the change c, and the changes d c made on the project until c is made (including c), respectively.
Consider Figure 2 where the developer has made <Change #4> after making three changes from <Change #1> to <Change #3>. Then, EXP(<Change #4>) is 4 since the developer made a total of 4 changes, including <Change #4>. See the EXP column of Table 2.

2) SEXP
In a modern distributed software development environment, developers typically work on specific subsystems. Thus, a developer usually has different levels of expertise on different subsystems. A developer who has substantial expertise on a subsystem S 1 may not be familiar with another subsystem S 2 and is more likely to induce fault on S 2 than on S 1 . However, the EXP metric does not distinguish the developer experience between different subsystems. Meanwhile, the second metric, SEXP, measures the developer's experience with the subsystems under change. The following is the definition of SEXP.
where S(c) and Changes(d c , S(c)) represent the subsystems modified by the change c, and the changes d c made on a subsystem s ∈ S(c) until c is made (including c), respectively.
In our running example, SEXP(<Change #4>) is 3 since with <Change #4>, the developer has made a change on two subsystems, subsys1 and subsys2, and she also changed one of those two subsystems at <Change #1> and <Change #2>. See the SEXP column of Table 2.

3) REXP
As developers accumulate experience on the system under development, they are less likely to write faulty code. REXP, defined in the following, correlates the error-proneness of the  change with how recently that change was made.
where Y (c) refers to the year when change c is made.

III. NEW DEVELOPER EXPERIENCE METRICS
In this section, we describe our new developer experience metrics.

A. MODULE-BASED METRICS 1) MEXP
Although SEXP considers developer experience at a finer granularity level (i.e., a subsystem) than EXP, it may not be fine-grained enough. We propose a more fine-grained metric, MEXP, which considers developer experience at the module level -in this work, we define a module as a file. We define MEXP as follows: In our running example, MEXP(<Change #4>) is 2 since at <Change #4>, moduleA and moduleE are modified, and one of these two modules is modified at <Change #1>. See the MEXP column of Table 2.

2) AVG_MEXP AND AVG_SEXP
Consider the following two scenarios. In the first scenario, the developer has created the initial version of 100 modules and committed them to the repository. In the second scenario, the developer has modified a single module she previously modified 99 times. Although the first change would be more error-prone than the second one, MEXP does not distinguish between them -in both cases, MEXP is 100. While this example is an extreme case, it reveals the limitation of MEXP. Motivated by this problem, we suggest a new metric AVG_MEXP defined as follows, which computes the average value of MEXP: |M (c)| where M (c) refers to the modules modified by change c. In our running example, AVG_MEXP(<Change #4>) is 1 since MEXP(<Change #4>) is 2 and <Change #4> modifies two modules.
Similar to AVG_MEXP, we also use AVG_SEXP defined as follows: |S(c)| where S(c) refers to the subsystems modified by change c.

3) SimEXP
Suppose that a developer d made the following two code changes. In the first code change c 1 , she modified module m 1 . In the second code change c 2 , she modified two modules, m 1 and m 2 . The developer may gain different experiences when performing these two changes. While the task performed at c 1 is concentrated on a single module m 1 , the second change c 2 is likely to involve interaction between the two modified modules. When the same developer d later makes a new change only on a module m 1 , the experience she gained from c 1 is more likely to be relevant to the current task than the other experience gained from c 2 . Conversely, when d modifies the two modules m 1 and m 2 , the opposite is more likely to be true. More generally, we conjecture that the more similar the past experience of the developer is to the current change, that experience is likely to have a bigger influence on the current change. Here, the similarity between two changes c 1 and c 2 is proxied by the similarity between the two sets of modules modified in c 1 and c 2 , which we formally describe as follows: where M (c i ) represents the modules modified in change c i . Based on this notion, we define a new metric SimEXP as follows: In our running example, SimEXP is 1.5 as shown in Table 2 is |{moduleA}|/|{moduleA, moduleE}|, which equals 0.5. The rest of the changes can be handled similarly.

B. TEMPORAL METRICS 1) RVEXP AND RvEXP
The motivation of REXP is to assign a larger weight to a more recent change. To define how recent a past change c is, REXP compares the time when c was made and the time when the current change is made.
There is another way to measure how recently the past changes were made. Many software products are maintained using the semantic versioning scheme [12], n 1 .n 2 .n 3 , where n 1 , n 2 and n 3 represent the major version number, the minor version number, and the patch version number, respectively. By comparing the versions of two different changes (i.e., commits), one can measure the interval between those two changes.
In this study, we experiment with two new metrics (RVEXP and RvEXP), each of which computes the interval at the granularity of the major versions (RVEXP) and the minor versions (RvEXP), respectively. We define these two new metrics as follows: where V (c, c ) and v(c, c ) represent the version difference between c and c at the granularity of major versions and minor versions, respectively.
RvEXP is computed similarly based on minor versions, as shown in Table 2. For example, the minor version difference between <Change #3> and <Change #1> is 5 (the version change history is shown in the bottom part of Figure 2), and the value 1/(1+5) is obtained.

2) RSEXP, RMEXP, RVSEXP, RVMEXP, RvSEXP AND RvMEXP
We also define temporal metrics at the subsystem and module levels as follows, similar to before.

3) AVG METRICS
For the six metrics shown in Section III-B2, we define the AVG metrics. For example, AVG_RSEXP and AVG_RMEXP are defined as follows: In this paper, our main goal is to see whether using our new developer experience metrics improves the performance of defect prediction. We accordingly ask the following research questions. 1) Do our module-based metrics improve the performance of defect prediction? 2) Do our temporal metrics improve the performance of defect prediction? 3) Does the performance of defect prediction improve when our module-based and temporal metrics are combined?

B. DATASET
We extracted the dataset from the five open-source projects listed in Table 3. The table shows meta-data about the projects, including (from left to right) the domain of the projects, the programming languages used in the projects, LOC , 2 the number of commits, the number of developers who contributed to the projects, and the ratio of defective commits. Our dataset covers diverse domains and programming languages. Note that the datasets used in the previous studies [4], [9], [13], [14] do not contain our new metrics and cannot be directly used for our experiments. We extracted all 14 change-level metrics shown in Table 1. 3 In addition, we extracted our new experience metrics descried in Section III. To label whether a change is defective or not, we use the standard SZZ algorithm [16].

C. JIT DEFECT PREDICTION MODELS
We train our JIT defect prediction models using random forest [17]. Random forest is a classification (and regression) technique using multiple decision trees (we used 100 decision trees in our experiments). A classification decision (e.g., whether a change is defective or not) is made by performing the majority voting with the trained multiple decision trees. Random forest is commonly used in the literature of defect prediction research due to its high effectiveness [18], [19], [20]

D. PERFORMANCE MEASURES
To measure the performance of effort-aware JIT defect prediction, we use the Area Under the Cost Effectiveness Curve (AUCEC). AUCEC is commonly used in the literature on defect prediction to measure the cost-effectiveness of defect prediction [20], [21], [22]. Figure 3 illustrates AUCEC. In the figure, the X and Y-axis represent the ratio of inspected changed lines and the ratio of detected defects, respectively. Each point (x, y) of the curve denotes the portion of the detected defects (represented with the y value) after investigating the x portion of the changes (represented with the x value). Note that the ratio of the 2 We obtained LOC using CLOC (https://github.com/AlDanial/cloc) 3 We used the CodeRepoAnalyzer tool [15]. inspected changed lines is used as a proxy for the effort the developers put in.
When measuring AUCEC, we assume that the developers investigate changes c in the order of their cost-effectiveness scores ce(c) computed using the following formula.
where p(c) represents the error-proneness of c returned from the trained JIT defect prediction model, 4 e(c) represents the effort proxied by the number of changed lines of c, and Changes represents a set of changes to investigate. Notice that as p(c) increases (i.e., c is predicted to be more errorprone) and e(c) decreases (i.e., c modifies the smaller number of lines), the value of ec(c) increases.
If a defect prediction model A shows a higher AUCEC value than another model B, this implies that after investigating the same amount of lines of code, more defects are detected by A than B. In the literature on defect prediction, it is often assumed that developers usually investigate only N % of the changed lines within a limited time. As for the value of N , 20 is most often used, and we also use the same. We use the notation AUCEC 20 to denote the AUCEC score obtained after inspecting the 20% of SLOC (source lines of code).

E. EVALUATION METHODS
To assess how useful our extended metrics are, we perform defect prediction with two different datasets, M base and M target where M base contains the common existing metrics used in previous studies while M target is defined as M base ∪ 4 Random forest computes the error-proneness score by computing the ratio of the number decision trees which determine the given change is defective over the total number of decision trees.  4. AUCEC 20 of our module-based metrics; positive improvement rates (shown in the '' rate'' column), p-values less than or equal to 0.5, and effect sizes (shown in the ''effect'' column) larger than or equal to 0.1 are highlighted in yellow.
{extended metric(s)}. More specifically, we first add all metrics shown in Table 1 except for two developer experience metrics, SEXP and REXP, into M base . To refer to these common base metrics, we use the notation M common . Then, depending on which extended metrics are evaluated, we add either SEXP or REXP into M base . When evaluating modulebased metrics, we define M base as M common ∪ {SEXP}. Meanwhile, when evaluating temporal metrics, we define M base as M common ∪{REXP}. This is to evaluate module-based (or temporal) metrics separately without them being affected by temporal (or module-based) metrics. When assessing RQ3 where we consider both module-based and temporal metrics, and accordingly define M base as M base as M common ∪ {SEXP, REXP}.
Given M base and M target , we compare their performance using the two validation methods described in the following.

1) 30 TIMES 10-FOLD TIME-AWARE CROSS VALIDATION
The 10-fold cross-validation method is commonly used to evaluate machine-learning models. This method splits the dataset into 10 folds and uses 9 for training and the remaining one for testing. In total, 10 different pairs of training/testing sets can be obtained, and all of them are used for validation. We repeat this process 30 times.
Considering the fact that a defection prediction model is trained with the past data, we make sure all commits in training set S train are made before the commits in the testing set S test , using the following method. For a given testing dataset S test , we sort S test in reverse chronological order. We also prepare two lists, S test and S train , initialized with an empty set and S train , respectively. Then, we perform the following two tasks in a row. 1) We move the first item M test from S test to S test . 2) We find items M train ∈ S train committed later than M test and then remove M train from S train . We repeat these two steps as long as |S train |/|S test | is larger than 9.

2) TIME-AWARE HOLD-OUT CROSS VALIDATION
We found that the 30 times 10-fold time-aware crossvalidation often results in low statistical power. To compensate for this problem, we also apply hold-out cross validation. We sort our dataset in chronological order and use the first 90% of the data for training and the last 10% for testing. We train and test a model 300 times for each metric we evaluate.

A. RQ1. DO OUR MODULE-BASED METRICS IMPROVE THE PERFORMANCE OF DEFECT PREDICTION?
Tables 4(a) and 4(b) show the results for RQ1 from 30 times 10-fold time-aware cross validation and time-aware hold-out cross validation, respectively. The first column of the table  shows the subjects under evaluation, and the second column shows the median AUCEC 20 score obtained when the base metrics M base is used. Recall that for RQ1, we define M base as M common ∪ {SEXP}.
To assess our extended module-based metrics, we measure the AUCEC 20 score after replacing SEXP with each of those extended metrics. The third to fifth columns of the table show the results. For each extended metric, we report the median AUCEC 20 score, the improvement rate ( rate) (which we describe shortly), p-value, and effect size. We compute the p-value and effect size using the Wilcoxon-Mann-Whitney test [23]. The improvement rate shows how much AUCEC 20 score improves when SEXP is replaced with the metric under consideration. We define the improvement rate as ((median target − median base ) / median base ) ×100.
The results of the first validation (30 times 10-fold time-aware cross-validation) show that our module-based metrics tend to cause a positive effect, although statistical significance is not observed except for one (AVG_SEXP for React). However, the hold-out validation results show that in most cases, statistically significant improvement is observed. Our three metrics, MEXP, AVG_MEXP, and SimEXP outperform SEXP across all subjects. In particular, AVG_MEXP outperforms SEXP with statistical significance (i.e., ≤ 0.05) across all subjects except for Notepad++ where the p-value is 0.058.  20 comparison between REXP and the metrics behind the double vertical bar (||); positive improvement rates (shown in the '' rate'' column), p-values less than or equal to 0.5, and effect sizes (shown in the ''effect'' column) larger than or equal to 0.1 are highlighted in yellow. Figure 4 illustrate the observation that MEXP, AVG_MEXP, and SimEXP outperform SEXP.

B. RQ2: DO OUR TEMPORAL METRICS IMPROVE THE PERFORMANCE OF DEFECT PREDICTION?
For RQ2, we define M base as M common ∪ {REXP}. Similar to RQ1, we measure the AUCEC 20 score after replacing REXP with each of our temporal metrics. Tables 5 and 6 show the results for RQ2 from 30 times 10-fold time-aware cross validation and time-aware hold-out cross validation, respectively. As compared to the module-based metrics, positive effects are less observed. Nonetheless, we can observe from Table 6 that AVG_RVMEXP outperforms REXP across all subjects. Also, performance improvement is observed for all subjects except for React when RSEXP, RVSEXP, RvSEXP, RvMEXP, or AVG_RvMEXP is used.

C. RQ3: DOES THE PERFORMANCE OF DEFECT PREDICTION IMPROVE WHEN OUR MODULE-BASED AND TEMPORAL METRICS ARE COMBINED?
Since combining 6 module-based metrics and 18 temporal metrics results in too many combinations (108), we combine the three best module-based metrics with which performance improvement is observed across all subjects (i.e., AVG_MEXP, SimEXP, and AVG_RVMEXP) and the six temporal metrics with which performance improvement is observed in at least 4 subjects (i.e., AVG_RVMEXP, RSEXP, RVSEXP, RvSEXP, RvMEXP, and AVG_RvEXP). We compare each of the 18 combinations with the base case where we define M base as M common ∪ {SEXP, REXP}. Note that SEXP and REXP are the existing module-level and temporal metrics, respectively. Tables 7 and 8 show the results. Table 7 shows the result of the 30 times 10-fold time-aware cross-validation, and it is observed that SimEXP+AVG_RvMEXP outperforms M base across all subjects.  Table 8 shows the result of the time-aware hold-out crossvalidation. It is observed that in most cases, p-values are less than 0.05, and effect sizes are larger than 0.1, indicating statistically-significant non-negligible results are obtained. As compared to RQ1 and RQ2, combining our module-based and temporal metrics tends to cause more visible changes in performance.
SimEXP+AVG_RVMEXP outperforms M base across all subjects, with statistical significance. In addition to that, in three more combinations (i.e., AVG_MEXP+AVG_ RVMEXP, AVG_MEXP+RSEXP, and AVG_MEXP+ AVG_RvMEXP), performance improves across all subjects. The box plots in Figure 5 illustrate the observation that these four combinations outperform the base case using the combination of SEXP and REXP.

VI. THREATS TO VALIDITY A. CONSTRUCT VALIDITY
We collected independent variables (i.e, the change-level metrics) based on the CodeRepoAnalyzer [15], and the dependent variable (i.e., the variable indicating whether a commit is defective or not) using the SZZ algorithm. Although these algorithms have been widely used in defect prediction studies [4], [24], they may produce incorrect results (e.g., non-defective change may be labeled defective). The computation of REXP and its extended metrics (i.e., RSEXP, RMEXP) are computed based on the commit history. There is a potential threat to the validity in case the developers ''squash'' (merge) multiple commits since by doing so, the commit order between commits is lost.

B. INTERNAL VALIDITY
When measuring AUCEC, we used 20% as the cutoff point, as commonly conducted in the literature on defect prediction. Nonetheless, it is unknown which cutoff point is best. To mitigate this threat, we also evaluated the performance with the 10% cutoff point and observed the same general tendency.

C. EXTERNAL VALIDITY
We conducted the experiments with data from five opensource projects. Although we carefully chose various projects with different sizes, domains, and programming languages used, our subjects may not represent all software projects. Nonetheless, to the best of our knowledge, this is the first study that investigates the impact of extended developer experience metrics on defect prediction. We expect our positive results to foster further studies on developer experience metrics.

VII. RELATED WORKS A. IDENTIFICATION OF BUGGY PATTERNS BASED ON DEVELOPER EXPERIENCE FACTORS
Matsumoto et al. [25] defined five metrics that characterize a developer's activities for a specific version of the software and analyzed the correlation between those metrics and the ratio of the buggy commits authored by a developer. The five metrics they defined for each developer for a specific software version are 1) the number of commits made by a developer 2) the number of lines revised by a developer 3) the number of unique modules revised by a developer, 4) the number of unique packages revised by a developer, and 5) the ratio of buggy commits by a developer for the previous version. Analysis results showed that the number of unique modules revised and the ratio of buggy commits for the previous version significantly correlated with the ratio of buggy commits for the chosen version. Although the authors used version information in collecting a developer's experience, those developer experience metrics are defined per developer for a specific version rather than per change. Moreover, whether these metrics are a good predictor of defect prediction was not determined, even though the authors showed that developer experiences may have an impact on software quality.
Bird et al. [26] examined the effects of code ownership on software quality. For each file, they counted the number of contributors; the number of minor or major contributors, which is distinguished by the ratio of the contribution, whether it is higher than 5%; and the ownership, which is computed by using the top contributor's contribution  20 comparison between SEXP+REXP and the metrics behind the double vertical bar (||); positive improvement rates (shown in the '' rate'' column), p-values less than or equal to 0.5, and effect sizes (shown in the ''effect'' column) larger than or equal to 0.1 are highlighted in yellow.
ratio. To evaluate the effects of four ownership metrics on software quality, they conducted a correlation analysis of the pre-and post-release failures and built linear regression models using code metrics and ownership metrics as the independent variables and failures as the dependent variable. They specified that their purpose for building the linear regression models was not to predict whether a file contains any defect but to check whether the ownership metrics can be effectively used in classification models. Based on their experimental results, they recommended that developers should review the changes made by minor contributors more carefully since their limited experiences may induce defects. Unlike our work, this work is conducted at the file level, not at the change level.
Eyolfson et al. [27] studied the correlation between the error-proneness of a commit and the developer's experience, which was proxied by the days passed after the first commit the developer made on the Linux kernel and PostgreSQL projects. They reported that there are several threats that may affect the interpretation of the relationship between the developer experience metric and the bugginess of a commit, such as more experienced developers working on more complex source code or inflation of the developer experience metric value caused by his/her extremely low commit frequency. Nonetheless, the authors observed that data from both projects showed that the error-proneness of a commit decreases as the author's experience increases in general, and they reported that this correlation could be exploited in predicting the locations of buggy code. Although the authors showed the possibility of using developer experience metrics in defect prediction, they used a very basic method in quantifying a developer's experience.  20 comparison between SEXP+REXP and the metrics behind the double vertical bar (||); positive improvement rates (shown in the '' rate'' column), p-values less than or equal to 0.5, and effect sizes (shown in the ''effect'' column) larger than or equal to 0.1 are highlighted in yellow.
Moreover, the performance impact of the developer experience metric they proposed for defect prediction models was not evaluated.
Tufano et al. [28] analyzed the effect of the experience level of developers on the bugginess of a commit on five Java open-source projects. They defined four different developer experience metrics at the change level that consider the lexical experience and the frequency of experience on modified files. More specifically, the lexical experience metric is calculated by using the textual similarity between the texts in a modified file and the concatenated texts from all files modified by an author of the change. After obtaining the lexical experience on the files that were modified in a commit, the authors computed the mean value of the lexical experience on multiple files to ensure that the metric is defined at the change level. The frequency of experience was computed by counting the number of commits that were made by the author on the file modified in the target commit, and then dividing that counted number by the number of the commits the author made in the past. Furthermore, two additional developer experience metrics were defined in the same manner as the previous two metrics, except that these metrics only consider the commits from the past six months. The authors concluded that the mean value of the four developer experience metrics from fix-inducing commits and clean commits was significantly different. Although they defined four new developer experience metrics and showed the possibility of the usefulness of these metrics in defect prediction models, they used fixed time windows for calculating the developer's recent experience, and the performance impact of these metrics on JIT defect prediction was not shown.

B. DEVELOPER EXPERIENCE METRIC ON JIT DEFECT PREDICTION
Mockus and Weiss [8] suggested various change-level metrics, including EXP, SEXP, and REXP described in Section II. Kamei et al. [4] used various metrics, including the developer experience metrics of Mockus and Weiss [8] to evaluate the performance of JIT defect prediction. However, they did not extend developer experience metrics. McIntosh and Kamei [9] proposed the author awareness metrics, which is defined as the proportion of past changes that were made to a subsystem that the reviewer has authored or reviewed. They did not find this metric useful in improving the performance of the JIT defect prediction. In this work, we proposed another developer experience metrics that show a positive effect on the performance of JIT defect prediction.

VIII. CONCLUSION AND FUTURE WORK
In this work, we have proposed novel developer experience metrics. In particular, we extended the widely-used two experience metrics, SEXP and REXP. SEXP is defined at the granularity of subsystems, and we have proposed MEXP defined at the file granularity. We also proposed SimEXP, which measures the similarities between commits. Regarding REXP, which measures how recently the developer made a change with the unit of the year, we have suggested RVEXP and RvEXP, which measures the same with the unit of the major and minor versions, respectively. We also suggested the variation of those metrics (i.e., AVG_RVMEXP) by averaging the experiences, instead of summing them up. We also combined these metrics together when conducting experiments.
Our experimental results show that our new metrics often improve the cost-effectiveness of defect-prediction models. When we combined module-based metrics and temporal metrics, we obtained stronger results. In particular, when combining SimEXP and AVG_RVMEXP, the statistically significant performance improvement was observed across all 5 subjects. In future work, we plan to experiment with more subjects to study how general our findings are. YEONGJUN CHO received the master's degree in computer science from the Korea Advanced Institute of Science and Technology (KAIST), in 2019. He is currently a Site Reliability Engineer at Naver Corporation, South Korea. His research interests include software engineering and site reliability engineering. More information about him is available at: https://www. linkedin.com/in/yeongjuncho/. His research interests include services computing, web engineering, and software engineering. His recent research focuses on service-oriented software development in large-scale and distributed system environments, such as web, the Internet of Things (IoT), and edge-cloud environments. More information about him is available at: https://webeng.kaist.ac.kr/webengpress/professor/. VOLUME 10, 2022