What Ruined Your Cake: Impacts of Code Modifications on Bug Distribution

In the current decade, software systems have been more intensively employed in every aspect of our lives. However, it is disappointing that the quality of software is far from satisfactory. More importantly, the complexity and size of today’s software systems are increasing dramatically, which means that the number of required modifications is also increasing exponentially. Therefore, it is necessary to understand how function-level modifications impact the distribution of software bugs. In addition, other factors such as a function’s structural characteristics as well as attributes of functions themselves may also serve as informative indicators for software fault prediction. In this paper, we perform statistical methods and logistic regression to analyze the possible factors that are related to the distribution of software bugs. We demonstrate our study from the following five perspectives: 1) the distribution of bugs in time and space; 2) the distribution of function-level modifications in time and space; 3) the relationship between function-level modifications and functions’ fault-proneness; 4) the relationship between functional attributes and functions’ fault-proneness; and 5) the relationship between software structural characteristics and functions’ fault-proneness.


I. INTRODUCTION
Currently, we are in an era in which software plays an essential role in everyday life. The complexity and scale of software systems have increased tremendously, especially for those safety-or security-critical systems. According to [1], in the past 40 years, the size of software has increased exponentially. Yet the disappointing fact is that various qualityrelated issues still exist in the entire life cycle of software development. According to a National Institute of Standards and Technology report [2], software bugs cost the U.S. economy an estimated $59.5 billion annually. More importantly, due to the size and complexity of modern software systems, the situation even deteriorates, and there is also pressure from tighter schedules and reduced development budgets. In addition, the report also states that more than one-third of the costs would be eliminated if more proper quality assurance approaches were applied.
Software has been widely applied in various complex systems in every aspect of our lives. More importantly, the application of software in safety-critical areas, such as medicine The associate editor coordinating the review of this manuscript and approving it for publication was Steve Li . and nuclear power generation, has a great impact on people's daily lives. When utilized in these and other, similar industries, software systems are usually used to control the behaviors of electromechanical components and monitor the interactions between them to ensure the smooth operations of the systems. Ideally, software should operate as expected and not contribute to hazards [3]. However, with the increasing complexity and size, software systems are becoming unexpectedly large with advanced algorithms, interactions with external systems, as well as the huge amounts of data transmitted. For example, if the system turns into an unsafe state or the operations performed by engineers directly or indirectly lead the system into hazard, the outcome can be catastrophic. In a recently published paper [4], the authors examined 104 accidents in recent decades to address the roles bugs played in those accidents and demonstrate lessons learned to improve our development practices in order to prevent similar accidents from happening in the future.
Boehm [5] carefully analyzed the costs to fix potential errors or bugs in different phases of the software life cycle and suggested that the later a bug is detected, the costlier it will be. In other words, it is optimal that developers can detect the potential defects in their systems under development as early as possible. Hence, it is beneficial to identify fault-prone software modules in order to alleviate the cost of software debugging and bug fixing ( [6]- [9]). Encouraging results in prior research indicate that it is possible to predict which modules are likely to be locations of defect occurrence using Random Forest ( [10], [11]), Neural Network ( [12], [13]), Support Vector Machines ( [14], [15]), Logistic Regression ( [16]- [18]), and Naive Bayes ( [19]- [21]). In addition, there has been research on defect classification ( [22], [23]), justin-time defect prediction [24], and so on.
However, shortcomings still exist with respect to these existing approaches. First, most of these approaches assume functional-level modifications to be the indicators of bug locations. Undoubtedly, bugs are the direct outcomes of these modifications; however, other issues with respect to software systems can also serve as the reasons why specific modifications are very likely to induce bugs. For example, suppose that the complexity of one component is significantly higher than its normal acceptable level. In this case, the likelihood that the modifications towards this component introducing bugs also increases, not only because of its own complexity but also due to the complex interactions among the component and other components within the system.
Another issue is that no study focused on exploring how closely functional-level modifications are related to software bug distributions currently exists. To what degree can these modifications serve as an informative indicator of bug locations? Do all the modifications towards different functions have the same likelihood of inducing bugs? These questions all remain unanswered. In addition, most studies merely focus on file and class level, so it is impossible to provide more accurate prediction results for bug locations.
Many studies focus primarily on the analysis of code level measurements. However, software development is a dynamic process, and changes along with this process can have either positive or negative impacts on the distribution of software defects. Although researchers are aware of this, a close examination of process-related information from new perspectives and a careful evaluation of their relationship with fault proneness are still needed [25].
To solve these issues, we propose a thorough study on various factors that could have an impact on the distribution of software bugs. There are three novel aspects of our study.
First, we introduce the concept of software evolution into the area of software fault prediction. Unlike existing studies, we take the evolution of software systems into account in the process of determining influential factors of fault distribution. Instead of predicting bug locations based on one or several specific versions, we analyze more than 1,000 consecutive versions of open source projects to understand the impacts of these factors as well as how they change during the entire development process.
Second, we closely analyze the relationships between functional-level factors and bug distributions. Unlike existing metrics focusing on either class or file level, we introduce different functional-level factors that can be used as indicators of bug distributions.
Third, we also analyze how structural attributes of software systems can relate to fault distributions. Three functionallevel attributes as well as 24 network metrics are included in our study to represent the software structures and then serve as the candidate factors for predicting software fault distributions.
The remainder of the paper is organized as follows: Section II describes in detail the research questions we are trying to solve. What follows in Section III contains the experimental setups used in this study. In Section IV, we introduce the evaluation methods used to demonstrate the correlation relationships. Our results and discussions are presented in Section V, followed by the threats to validity in Section VI. Other existing studies related to this paper are discussed in Section VII, while we draw our conclusions in the final section.

II. RESEARCH QUESTIONS
In this paper, we carefully examined the following five research questions to elicit informative and influential factors of software fault distribution as well as the degree to which they are related to fault proneness.

A. WHAT IS THE DISTRIBUTION OF DEFECTS IN TIME AND SPACE?
In this paper, we represent software from two unique dimensions: time and space. On one hand, we use time to represent the evolution process of software projects. As introduced in Section I, we consider software development as a continuous process and recognize that fault distribution may also vary during this process. For example, with respect to the 492 versions of Nginx, we analyze how bugs distribute within these versions, such as how many bugs each version contains and how the number of bugs changes during the software project development process. On the other hand, space is used to evaluate how bugs are distributed among functions within software projects. Unlike most existing projects, which only focus on file-or classlevel fault distributions, we examine how many bugs are contained within distinct functions. Exploring the distribution of software defects in time and space helps us better understand software defects and serves as guidance in subsequent research.

B. WHAT IS THE DISTRIBUTION OF MODIFICATIONS IN TIME AND SPACE?
The terms time and space here are the same as in RQ 1. In this paper, the functional-level modifications are divided into six types, and the distribution of them on the releases (or versions) and the associated functions are studied respectively. To solve RQ 2, the distribution of modification types is carefully examined, and the results will serve as the fundamentals of the remaining research questions. VOLUME 8, 2020 C. WHAT IS THE CORRELATION BETWEEN  FUNCTION-LEVEL MODIFICATIONS AND THEIR  CORRESPONDING FAULT-PRONENESS? Though we acknowledge that modifications of software source code may induce bugs, it is still unknown to what degree these two aspects are related. Which type(s) of modifications are more likely to induce defects? To what degree are these modifications related to functions' defect-proneness? With the answers to these questions, we will have more detailed and effective indicators for software fault prediction analyses.

D. WHAT IS THE CORRELATION BETWEEN ATTRIBUTES OF FUNCTIONS AND THEIR CORRESPONDING FAULT-PRONENESS?
Instead of considering all functions equal, we elicit different attributes of functions to represent the characteristics of functions in a more demonstrative way. For example, in this paper, the duration of a software function is defined as how many consecutive versions have the specific function. Are the functions with longer or shorter durations more likely to contain bugs? We also examined several other attributes of functions and analyzed their relationships with functions' fault-proneness.

E. WHAT IS THE CORRELATION BETWEEN SOFTWARE STRUCTURAL CHARACTERISTICS AND THE FAULT-PRONENESS OF FUNCTIONS?
Software systems have been widely employed in various areas of our lives, which results in different software having unique structural characteristics. Instead of treating different software systems the same, we also performed experiments to demonstrate the correlation between the structural characteristics and functions' fault-proneness. This information can also help us better customize the fault-proneness analysis approach to produce more effective and accurate prediction results.

III. EXPERIMENTAL SETUPS
In the following section, we first briefly introduce the subject open source projects that have been used in this paper.

A. SUBJECT PROJECTS
In this paper, we analyzed four open source projects from GitHub. Table 1 presents the detailed information of these projects. The Total Functions and Total Calls in Table 1 are calculated based on the most recent version.
Nginx [26] is a free and open source project that provides a HTTP server, a mail server, and a generic TCP/UDP proxy server. The software has been widely used in a large number of websites and is known for its stability and performance as well as its ease of usage and configuration.
Gedit [27] is a small and lightweight UTF-8 text editor that is part of the GNOME environment. The project uses the lastest GNOME and GTK+ libraries with support for Drag  and Drop (DnD) from Nautilus (the GNOME file manager), the use of the GNOME help system, the Virtual File System GVfs, and the GTK+ print framework.
Nagioscore [28] is the core of an open source project Nagios from GitHub, which can be utilized to monitors systems, networks, and infrastructures. In addition, the project also provides supports for monitoring and alerting services for servers, switches, and applications.
Redis [29] is another open source project that can be used as cache or databases. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, and geospatial indexes with radius queries.

B. SUBJECT PROJECTS
In this subsection, we describe the procedures used in this paper to collect data related to bugs, functions, and software structural characteristics.

1) CONSTRUCTION OF SOFTWARE NETWORKS
In this paper, we use Function Call Graphs (FCGs) to form software complex networks, which helps to more explicitly illustrate the structures of software systems. The networks are then used as foundations for the follow-up analyses. An example of FCG based on Nagioscore V1.0 is presented in Figure 1. In the FCG, we use nodes to represent unique functions, while edges stand for the calling relationships between pairs of functions. Necessary information to establish the FCGs is extracted using Doxygen [30], a de facto standard tool for generating documentation of multiple programing languages, including C, C++, Java, Python, and so on.

2) EXTRACTION OF DEFECT-RELATED INFORMATION
In order to locate bugs in open source projects, information related to commits for fixing bugs is extracted from the issue tracking system. In our study, we first collect information related to bug fixing commits. With these commits, a list of functions that are likely to contain bugs can be generated. Then, we try to pinpoint the exact location of each bug as well as their introduction time/version. In our study, we use the issue tracking system provided by GitHub and the BugTracer implemented by our team to help us retrieve information related to bugs. Details of the extraction process are described in the following steps: Step 1. Similar to previous works ( [31], [32]), we use keywords such as ''fix,'' ''bug,'' ''defect,'' ''fault,'' and ''patch'' in the commit comments to filter the commits that are related to bug fixing. These commits are closely related to bugs, since their goal is to fix potential bugs in their previous releases. By doing this, potential bugs within each version can be collected. For instance, consider the following comment of a commit: ''Fixing the segmentation faults during the execution of garbage collections.'' It is obvious that this commit is closely related to a bug, and the modified code should be responsible for fixing this bug.
Step 2. With Step (1) performed, a number of records that contain the information related to bugs would be stored in our database. In addition, the versions with the code introduced to fix bugs can also be identified, and we call them ''bug-fixing versions,'' denoted as With the bug-fixing version set achieved in Step 2, we then determine the locations of bugs within versions of software systems based on commits that corrected a failure in the program's source code. Ideally, with respect to each bug, we can get two unique versions, V intro and V fix , representing the version introducing the corresponding bug and the version fixing it, respectively.
Though developers evidently marked these version as bugfixing versions, it is possible that the modifications made in this version are not purely related to fixing a specific bug. In other words, the faulty statements should be isolated from other modified statements that focus on implementing features. Therefore, the procedures listed below are used to derive the version set {V intro , V fix } and the corresponding modified functions set {F fix−1 , F fix−2 , . . . , F fix−k } to fix a particular bug. A similar approach has also been utilized in [33].
Step 3. Locate potential bug-fixing statements. With respect to a specific version in U fix , say V fix−m , which fixes the mth bug, we compare the current version with its prior version and locate statements that are modified between the two versions. For a specific bug-fixing version, the statements form another set, denoted as Step 4. Determine the last versions that modify each statement in S fix−m . These versions are denoted as With the bugs locating and introducing versions, we can determine the faulty functions of each version during the entire life cycle of software systems as well as the introduced and fixed time of each bug.

3) EXTRACTION OF INFORMATION RELATED TO FUNCTION-LEVEL MODIFICATIONS
With respect to the function-level modifications, we focus primarily on the following six types of modifications: Addition, denoted as AD. Addition of newly implemented functions can be observed frequently during the development of software systems. We determine the addition of new functions by comparing the network of a version with its previous version. If a specific node does not exist in the current version, we define it as a newly added function, and the AD value for this function is one; otherwise, the value is zero.
Modified Code, denoted as MC. This category of modification is used to represent whether the code within a function has been modified when compared to its prior version. In our study, we extract the code within each function and evaluate the corresponding md5 value. If the md5 values remain the same between two consecutive versions, it represents that the function is not modified in these two versions, and the MC value is zero. Otherwise, the contents of the function have been modified, and the MC value is one.
In-Degree Change, denoted as Deg-In. Once the function calls from other functions to a specific function, the in-degree of this function changes. In this study, we use a Boolean value to represent whether this set between two consecutive versions has changed. In other words, Deg-In equals one if the set of functions calling the corresponding function changes; otherwise, Deg-In equals zero. VOLUME 8, 2020 Out-Degree Change, denoted as Deg-Out. Contrary to In-Degree change, for a particular function, Deg-Out represents changes in the set of functions called by the corresponding function. Similar to Deg-In, we also use a Boolean value to represent it.
Degree Change, denoted as Deg-Ch. Deg-Ch represents when either the Deg-In or Deg-Out is one. If one of the two measurements is one, it means that the calling relationships between this function and other functions are changed. Therefore, the Deg-Ch should also be one.
Overall Modification, denoted as OM. If any of the previously discussed modifications exists, the OM value for a specific version is one. Otherwise, the OM value is zero.

4) EXTRACTION OF FUNCTION ATTRIBUTES
In the process of software development, when analyzing a newer release, we should not only consider the modifications of the current version, but also consider the influence of the historical changes made to the functions. Therefore, in addition to different types of modifications, we also consider several other attributes with respect to each function. The following three unique types of attributes, namely Time, DiffCount, and HisBug, are included in our study and are described in more detail as follows: Time: This attribute measures the duration during which a function has existed in the project. The Time of a particular function equals the total number of versions from the version in which it is introduced to the version where it is deleted. If the function exists in the most recent version of the project, this version is used as the last version of the function's duration.
DiffCount: With the function-level modification information collected, we use DiffCount to represent the total number of function-level modifications that have been made to a particular function during the life cycle of the project. From a mathematical point of view, the DiffCount of a function equals the sum of the AD, MC, Deg-In, and Deg-Out of this function in all previous versions.
HisBug: In this paper, we also consider the factor of whether a function contains bug(s) in any of its previous versions. Our intuition is that a function that had bugs in the past may also contain bugs in the subsequent releases. For example, fixing one bug may introduce additional new bugs in the same functions.

5) NETWORK MEASURES
For each function node, we use an ego network and a global network to analyze its structural characteristics from two angles: local and global. For the ego network, its nodes include a sole central node (''ego'') and all the nodes that the ego connects directly (''alters''), and its edges include the edges between the ego and alters and the edges among the alters. The ego network focuses on the study of the characteristics of a single node in the local network. At the same time, each alter and its alters can also form an ego network, and the ego network of all nodes is combined to form the complete  ([38], [39]). global network. The global network contains all the nodes and edges.
Similar to [34], we use the following 24 network measures shown in Table 2, including 12 ego network measures and 12 global network measures, to analyze the relationship between software structural characteristics and defects from two unique perspectives: global and local. Based on the FCGs, a widely-used library Networkx [35] is applied to help us obtain the measurements of the software network structures.

IV. EVALUATING METHODOLOGIES
In this section, we will describe in detail all the methods used to demonstrate whether a specific factor is related to the distribution of bugs. In Section A, we discuss the basic knowledge with respect to the power law distribution. What follows in Section B is a conditional probability-based method to analyze the correlation between modifications and bug distributions. In Section C, the logistic regression is introduced.

A. POWER LAW DISTRIBUTION
In statistics, the power law is a functional relationship between two quantities, where a relative change in one quantity results in a proportional relative change in the other quantity, independent of the initial size of those quantities; one quantity varies as a power of another. For example, we use the following formula to calculate the area of a circle: Suppose that the radius r is doubled; the area S is therefore multiplied by a factor of four. In this case, we conclude that S and r conform to the power law distribution.
In this study, we find that the distribution of bugs and function-level modifications conforms to the power law distribution, which can be defined as follows: The above definition can be utilized to verify if the distribution under analysis is consistent with the power law distribution. In the above equation, p(x) represents the number of bugs or the value of a particular type of modifications on software element (function or version) with an index of x. Then, Plfit [36], which is a Python implementation of a power law distribution fitter, is used to check the consistency with respect to the power law distribution. Plfit examines if the power law is an acceptable fit based on the ''p-value'' calculated using the Monte-Carlo test. If the p-value is higher than 0.1, we would say that the data conforms to the power law distribution [37].

B. CONDITIONAL PROBABILITY-BASED METHOD
The following method is also employed to help us further explore the correlation between a specific type of modifications and the faulty functions: For each type of modification, we divide the functions in a software system into four disjoint groups, which can be represented as Func XB , Func XB , FuncX B , and FuncXB, respectively. Func XB indicates the groups of faulty functions that have one or several modifications of type X; Func XB contains the functions that do not contain any bugs but have been modified by X modifications; FuncX B possesses faulty functions with no X modifications; and FuncXB represents the groups of correct functions with no X modifications.
With the four groups divided, N X ,B , N X ,B , NX ,B , and NX ,B are used to represent the number of functions within each group. In addition, N represents the number of functions included in a specific version. We have the following: We define the P X as the proportion of newly-added functions in the total functions, which can be calculated with the following formula: P X = N X N Also, we use P X ,B|B to demonstrate the percentage of faulty functions that have at least one modification of type X: Following the same evaluation criteria utilized in [40], if P XB|B is significantly greater than P X , the bugs are more likely to be in functions with X modifications.
Take the addition of new functions as an example. Without loss of generality, assume that the P AD,B|B of a specific version is 0.85, and it has a P AD value of 0.25. These two numbers indicate that 85% of bugs in the software system are actually from those 25% newly added functions.

C. LOGISTIC REGRESSION
The following method is also employed to help us further explore the correlation between a specific type of modifications and the faulty functions: In our study, we used logistic regression to help us determine if a specific network measure discussed in Section III B serves as a predictive variable for fault-proneness. Logistic regression is a widely-used classification model that can be utilized to determine which of two values the dependent variable should be. Therefore, it is appropriate for our study, since a function of a software system either contains a bug or is bug-free.
When performing logistic regression, an important aspect that needs to be considered is the identification of influential observations, which can change the regression results [41]. If a small subset of influential observations is included, it could have a disproportionate impact on the estimation. In other words, the model estimates are more likely to be calculated based on these outliers than the rest of data [41].
Similar to other studies [34], we applied the method described in [41] to exclude influential observations (whose Cook's distance is equal to or larger than 1 [42]) to ensure an accurate analysis of the correlation between various factors and their abilities to predict fault-proneness.
Two different types of logistic regression models exist. The first is the univariate logistic regression model, which can be used to determine the individual impacts brought by the independent variable on the dependent variable. Another type of model is the multivariate logistic regression model, which tries to figure out the combinatorial effects brought by multiple independent variables.
Obviously, the univariate logistic regression is a better fit for our study to identify the relationships between each software structural measure and functions' fault-proneness.
The univariate logistic regression model is based on the following: where X is the independent variable, and Y is the dependent variable, which can only be equal to either zero or one. β 0 and β 1 are the regression coefficients and can be estimated by maximizing the following: where x i is the ith value of independent variable X , and y i is the corresponding value of dependent variable Y .
With the logistic regression model, we use the following statistics to evaluate whether each network measure is closely related to bug location: • p-value. The p-value is the most widely-used statistic to determine if the dependent variable Y significantly depends on the independent variable X . The null hypothesis is H 0 : β 1 = 0, which means the independent variable X has no impact on the dependent variable Y . If the null hypothesis is rejected, it represents that Y is dependent on X . Usually, we set the cutoff for significance as 0.05. We reject the null hypothesis if we achieve a p-value smaller than 0.05 and conclude that Y is dependent on X .
• Odds ratio. The odds ratio is another commonly used statistic to quantify the correlation between two variable X and Y . The independent variable is considered to have no impact on the dependent variable if the odds ratio is equal to 1. A positive relationship between X and Y can be established if the odds ratio is greater than 1, while an odds ratio smaller than 1 indicates a negative relationship.

V. RESULTS AND DISCUSSIONS
In this section, we will discuss the results regarding the five research questions proposed previously.

A. RQ1: WHAT IS THE DISTRIBUTION OF DEFECTS IN TIME AND SPACE?
The cumulative distributions of software bugs with respect to software versions and functions for the projects under analyses are shown in Figure 2. In this part, we only include the figures for the distribution of bugs based on data collected from Redis and Nginx. However, similar results can also be observed in figures for other software systems used in this study.
In the figures, the X-axis represents the number of bugs in each version (see Figure 2  free. For example, with respect to Redis in Figure 2(a), the y value for x = 1 is 0.325, which represents that only (0.325 × 100% =) 32.5% of all versions of Redis contain at least one bug and the remaining 67.5% of versions are bugfree. Similarly, as illustrated in Figure 2(b), the y value for x = 1 is 0.317, which means that (0.317 × 100% =) 31.7% of all versions of Nginx are defected, while the rest are bugfree. Consequently, our first observation is that bugs are more likely to exist in a relatively small portion of versions.
When it comes to the distribution of bugs in space, similar findings can be observed. It is noteworthy that most bugs concentrate on a relatively small percentage of functions, instead of being evenly distributed in all functions. As demonstrated in Figure 2(c), the data point (2, 0.032) represents that only (0.032 × 100% =) 3.2% of all functions contain at least two bugs. Moreover, when x = 1, the y value is 0.105, which means that (1 − 0.105 × 100% =) 89.5% of all functions in Redis are bug-free. Similar observations can also be made in Figure 2 In addition, statistical validation is also performed to verify that the distribution of bugs in time and space fits to the power law distribution. The fitting parameters are shown in Table 3. According to [37], if the p-value is greater than 0.1, the data sets satisfactorily fits the power-law distribution. As can be   Table 3, all p-value for the eight scenarios are greater than 0.1, which means that the distribution of bugs in time and space can be fitted using the power law distribution.
In summary, with respect to RQ1, our observations are as follows: The distribution characteristics of bugs and the distribution characteristics of modifications are also discussed. Figure 3   to Figure 5 illustrate data regarding the distribution of different types of modifications. Here, we only present the data for Gedit, Nagioscore, and part of Nginx. However, we would like to emphasize that similar phenomena can also be observed for other projects used in this study. In these figures, the X-axis represents the number of a specific type of modification in each version (see Figure 3 and Figure 4) or in each function (see Figure 5), while the Y-axis VOLUME 8, 2020 stands for the cumulative percentage of versions or functions, respectively. For example, the data point (50, 0.248) in Figure 3(a) represents that (0.248 × 100% =) 24.8% versions of Gedit contain more than 50 AD modifications.
If we consider the distribution from the time point of view, it can be observed that all the curves in Figure 3 and Figure 4 roughly follow similar trends: • For most of the curves, the beginning part is relatively steep. This can be interpreted as the fact that the number of modifications for most of the versions is relatively small. For example, as demonstrated in Figure 3(a), when x = 10, the y value is 0.515; however, when x increases to 100, the y value drops dramatically to 0.206. This indicates that for the majority of versions, the number of modifications is relatively small. However, there do exist exceptions. Compared with other figures, the first part of the curve in Figure 4(b) exhibits a linear-style decrease. In this case, the MC modifications with respect to Nagioscore tend to be more evenly distributed in all versions when compared with all other modification types.
• The remaining parts of these curves resemble each other and show similar trends. After the first part discussed above, the remaining parts follow a gentler decreasing pattern. For example, as demonstrated in Figure 4(d), the data point (74, 0.255) represents that (0.255 × 100% =) 25.5% of versions of Nagioscore contains more than 74 Deg-Out modifications; when the x value increases to 220, the y value only decreases by less than 13%, from 0.255 to 0.194. Therefore, we can conclude that a large number of modifications are deployed on a small portion of versions.
Similar observations can also be noted from the space point of view. Without loss of generality, we only include the figures for Nginx, as illustrated in Figure 5. Except for Figure 5(d), all other figures exhibit similar characteristics as those in Figure 3 and Figure 4, which means that a large number of modifications focus on a small number of functions. Though Figure 5(d) is slightly different from the other three figures, we can still achieve the same conclusion.
Statistical validation is also performed to verify that the distribution of modifications in time and space fits the power law distribution. The fitting parameters are shown in Table 4. As can be observed in Table 4, all p-values for AD, Deg-In, Deg-Out, and Deg-Ch are greater than 0.1, which means that the distribution of modifications in time and space can be fitted using the power law distribution. However, based on Figure 3 to Figure 5, MC and OM do not conform to the power law distribution significantly. Therefore, MC and OM are not included in Table 4.
In summary, with respect to RQ2, our observations are as follows: • Modifications are more likely to concentrate on a relatively small number of versions In this subsection, we further investigate the correlation between different function-level modifications and faultproneness of functions. Here, we use the conditional probability-based method discussed in Section IV B. Based on the description in Section IV B, P X ,B|B represents the percentage of defected functions that go through the modification X. Generally speaking, with respect to a functionlevel modification type X, if P X ,B|B is significantly greater than P X ,B|B , it indicates that the modification X is closely correlated with whether a specific function contains bug(s).
In Figure 6, we only include the figures for AD, Deg-In, Deg-Out, and Deg-Ch collected from Nagioscore. Note that similar phenomena can also be observed for other types of modifications with respect to other projects. In these figures, we only include those versions that contain at least one bug, and other bug-free versions are ignored in this subsection. In  these figures, we use orange bars to represent P X ,B|B and blue bars for P X ,B|B .
As seen in Figure 6, we find that for most of the scenarios, P X ,B|B is significantly greater than P X ,B|B . As illustrated in Figure 6(a), P AD,B|B is greater than P AD,B|B in 35 of 36 versions. For example, with respect to version 2.1, P AD,B|B is 0.932, while P AD,B|B is only 0.005. Taking version 3.0b2 in Figure 6(a) as another example, P AD,B|B is 0.670, whereas P AD,B|B is 0.057.
However, several exceptions exist. For example, as presented in Figure 6(c), for version 2.1, P Deg−In,B|B is 0.024 while P Deg−In,B|B is 0.102. This indicates that for this particular version, bugs distributed more on the functions without Deg-In modifications.
For further verification, we use logistic regression discussed in Section IV C to analyze the correlation between function-level modifications and defects. The results are shown in Table 5. Based on the table, all the p-values for AD, Deg-In, Deg-Out, and Deg-Ch are smaller than 0.005, which shows that those function-level modifications are closely correlated with whether the function contains bug(s) or not. In addition, their corresponding odds ratios are all greater than 1, which represents that these modifications are positively related to the functions' fault-proneness. In other words, the increase in these four types of modifications will very likely induce defects into the functions being modified.
We would also like to mention that the correlation between MC/OM and bug distribution is not remarkably significant and is not included in Figure 6 and Table 5.
In summary, with respect to RQ3, our observations are as follows: In this subsection, we will explore the correlation between attributes of functions and their corresponding faultproneness. As discussed in Section III B, the following three attributes will be considered: Time, DiffCount, and HisBug.

1) DISTRIBUTION OF BUGS WITH TIME
Intuitively, our conjecture is that the shorter a function exists in a software system, the more likely that the function may contain bug(s). If a function is removed very shortly after it is added into the system, it is extremely likely that the function is not compatible with other functions and therefore can be rather difficult to implement. In such cases, the function could induce more bugs than other functions normally do. In Figure 7, It is noteworthy that the percentage of defected functions decreases significantly with the increase in a function's Time. For example, the average percentage of defected functions with Time values between zero and 15 is (0.548 × 100% =) 54.8%. However, the value decreases significantly to (0.083 × 100% =) 8.3% for functions with Time values greater than 15. The downward trend indicates that a function is more likely to contain bug(s) if its Time value is small. Therefore, if the lifetime of a function is relatively short, the possibility of the function containing bug(s) is high.

2) DISTRIBUTION OF BUGS WITH DIFFCOUNT
As discussed in Section III B, we use DiffCount to represent the total number of function-level modifications made to a particular function during the life cycle of the project. Intuitively speaking, DiffCount should be positively related to the possibility of the function being defected. This is because the more modifications that are made to a particular function, the more likely programmers may make errors and introduce bugs.
In Figure 8, the X-axis represents the total number of function-level modifications in different functions. The Y-axis on the left shows the number of functions, and the color bar on the right represents the number of bugs in those functions in the same vertical bar. For example, when 84030 VOLUME 8, 2020 DiffCount is equal to three, the grey vertical bar at x = 3 has a length of 202, indicating that 202 functions with three function-level modifications do not contain any bug. Similarly, the orange bar underneath has a length of 37, which means that 37 functions with three modifications contain one bug. It can be observed that with the increase in DiffCount, the number of functions with the number of modifications decreases. The same applies to the number of bug-free functions and of defected functions.
We also include a red line indicating the percentage of defected functions with a specific number of DiffCount. For example, the data point (12, 0.260) represents that (0.260 × 100% =) 26.0% of functions with a DiffCount of 12 are defected. Unlike the number of functions, the percentage of defected functions increases significantly with the increase in DiffCount. In other words, the more function-level modifications that are made to a function, the higher the possibility that the function is defected.

3) DISTRIBUTION OF BUGS WITH HISBUG
We also use the mosaic plots (as in Figure 9) to describe the correlation between HisBug and the functions being defected. In these plots, we use the rectangular area to represent the percentage of functions with a specific property. For example, the area of the rectangle marked with (0, 0) represents the percentage of bug-free functions that are also bug-free in all its previous versions. Similarly, the area of the rectangle marked with (1, 1) shows the percentage of defected functions which also had bugs in their previous versions.
Without loss of generality, we only include the mosaic plots for Nagioscore 2.6 and 3.0b2. However, most versions of projects share similar characteristics. According to these plots, most of the functions do not contain bugs. For example, as illustrated in Figure 9(a), 72.5% of all functions are bugfree, not only in their current version but also in their previous versions.  As shown in Figure 9(b), the rectangular area (0, 0) is 0.710, whereas the area is 0.200 for rectangle (0, 1). This indicates that for functions without bugs in their previous versions, only one-third of them are defected in the current version. However, the areas of rectangles (1, 0) and (1, 1) are 0.025 and 0.065, respectively. In other words, for the functions that were defected in previous versions, almost 61.5% are still defected.
Therefore, the possibility of a function containing defects increases if it contains bugs in a previous version.
For further verification, we use logistic regression discussed in Section IV C to analyze the correlation between functions' attributes and fault-proneness. The results are shown in Table 6. It can be observed that all the p-values are less than 0.05, which represents that these three attributes are closely correlated to functions' fault-proneness. In addition, the odds ratios of these attributes also indicate that Time is negatively related to functions' fault-proneness, while DiffCount and HisBug are positively related to faultproneness.
In summary, with respect to RQ4, our observations are as follows: • Time is negatively related to whether a function contains bug(s) or not In this section, we use logistic regression to analyze the relationship between FCNs structural characteristics and function's defect-proneness. The results of global network measures are shown in Table 7, while Table 8 presents the results of ego network measures. VOLUME 8, 2020 In Table 7 and Table 8, the rows marked gray indicate that the measures do not satisfy the ''relevant'' standard. In other words, these measures are not related to whether a function contains bugs or not.
In summary, with respect to RQ5, our observations are listed in Table 9.

VI. CROSS-COMPARISON
For comparison, we use logistic regression to analyze the relationship between 19 traditional source code metrics shown in Table 10  The results of traditional source code metrics are shown in Table 11. In Table 11, the cells marked gray indicate that the metrics satisfy the ''relevant'' standard, that is to say, these metrics can be considered as ''relevant'' to whether a function contains bugs or not. Obviously, only (10 / 152 =) 6.7% of the scenarios can prove the ''relevance''. Moreover, the same metric has different correlation with bugs in different projects, so it has no universal significance. Compared to the metrics we proposed in this paper, these metrics are of no statistical significance.

VII. THREATS TO VALIDITY
The first threat involves whether the software projects selected in this paper is representative enough to serve as the basis of a solid study. In our study, we analyzed more than 1,000 versions from four open source projects, which is far more than the numbers of versions used in other studies. For example, the authors of [34] selected 39 versions, which is far less than the number of versions we selected. In the next stage, we will include more projects to make the results more universal.
Second, the defects used in this study are extracted using the steps described in Section III B. However, in order to determine the exact locations of defects, manual inspection is required. As a result, there is a possibility that the locations of these defects are incorrect. However, we did our best to prevent this from happening. For the bugs that cannot be reproduced, they will be dropped without further investigation. In addition, several engineers are included in this process. With respect to each bug, they should all agree on the actual location of the bug. If no agreement can be reached, one more engineer will be asked to investigate the bug and make the final decision.
Finally, the network measures used in this study may not be sufficient enough. There is the possibility that more attributes should be included, and some of the included attributes only apply to the selected projects. To make our study more representative of most software projects today, more open source projects as well as network measures will be included in the next step.

VIII. RELATED STUDIES
In the last few decades, several studies have been conducted to explore the empirical principles with respect to the distribution of faults in software systems ( [16], [45]- [48]). The Pareto principle is the most frequently cited. It suggests that VOLUME 8, 2020 most of the faults in a software system lie in a small number of modules. Fenton and Ohlsson [45] not only verified the Pareto principle but also reported a ''counter-intuitive relationship between pre-and postrelease faults.'' The least fault-prone modules in postrelease are the most fault-prone in prerelease, and the most fault-prone modules in postrelease are the least fault-prone in prerelease. The authors of [49] also observed the same with respect to a large industrial software system. Other studies ( [16], [47], [48]) also verified the Pareto principle.
Aside from the Pareto principle, studies such as [47] have also been proposed to examine the quality assurance efforts spent on various software modules. Grbac et al. [47] suggested that defects are unevenly distributed in software systems, making quality assurance efforts spent on different parts of a software system also varied. This finding needs to be further validated by more empirical data.
Various software metrics have also been proposed to help developers better predict possible locations of bugs. They can be divided into two types: product metrics and process metrics.
Product metrics can be further divided into two categories: object-oriented metrics and source code metrics. Many studies focus on object-oriented metrics for fault prediction, among which the CK metrics are most frequently referenced [50].
Cross-comparisons among different object-oriented metrics have also been conducted. Goel and Singh [51] compared ten class level metrics with respect to their abilities in fault prediction for object-oriented software. Gyimthy et al. [52] evaluated the performance of seven object-oriented metrics, including Weighted Methods per Class (WMC) and Depth of Inheritance Tree (DIT). Briand et al. [6] used 28 coupling measures, 10 cohesion measures, 11 inheritance measures, and a small selection of size measures to show that many coupling and inheritance measures are strongly related to the probability of fault detection in a class. Models derived by some of the coupling and inheritance measures can accurately predict most of the faults in some classes. Similar studies such as [53] were also reported. However, fault prediction based on object-oriented metrics can only help developers predict bugs at a class or file level. Moreover, they cannot be replicated across different environments and systems to obtain generalized conclusions.
Studies ( [47], [54]- [56]) have shown the limitation of using source code metrics (e.g., lines of code) to predict fault-proneness. Zhou et al. [56] reported that many source code metrics have ''moderate or almost moderate'' ability to detect fault-prone classes. Xu et al. [54] examined the risk prediction abilities with respect to twelve metrics and showed that these metrics are not strong enough to establish a precise prediction model. Similar results can also be found in ( [47], [55]).
Unlike product metrics focusing on the static aspects of software, process metrics emphasize the lack of considering aspects related to the development process. Graves et al. [57] conducted a study to predict fault incidence based on the change history of software. Illes-Seifert et al. [58] examined the relationship between a file's history and its corresponding fault-proneness. They also used open source programs to validate their findings. In recent years, software network measurements with good fault prediction ([59]- [61]) have attracted attention from both academia and industry.
The theory of complex networks has been widely used in the area of software engineering to help engineers analyze software systems, especially for systems with extremely large scale and complexity. Li et al. [62] considered software systems as complex networks and proved that the power consumption of software is non-linearly related to its network characteristics. Chong and Lee [63] used a weighted complex network to represent object-oriented software systems and to capture the structural characteristics of these systems. The maintainability and reliability of these systems were then carefully evaluated based on the networks. Later, they [64] provided an approach that can help practitioners automatically achieve the clustering constraints from the implicit structure of software systems based on graph theory. In [65], Myers discussed the relationships between several network topological measurements to software engineering practices. Qian et al. [66] applied the widely-used community detection algorithm in the clustering of software modules and demonstrated that the clustering performance is better than that of the Bunch method. In addition, Le and Panchal [67] proposed an analysis method for product structures at both the product level and module level. In [68], the authors presented an approach that models software packages as nodes and dependencies among them as edges. Then, methods in the field of complex networks were applied to explore properties of network models. In [69], community structure detection was proposed to evaluate the cohesion of object-oriented software programs.
In recent years, software network measurements have also been attracting practitioners from both academia and industry. Zimmermann and Nagappan [70] suggested that by analyzing the dependencies exist among various pieces of code, engineers could identify around 60% of the critical binaries, which is twice as many as the percentage identified by complexity metrics. Tosun et al. [60] indicated that the network measurements are closely related to defective modules, whereas the correlation is not significant for small-scale projects. T.H.D. Nguyen et al. [59] demonstrated that several of the dependency network measurements are strongly correlated with post-release failures, while other measurements are not. They also indicated that testing can benefit from bug predictions and reduce costs during the entire testing process. In [61], the authors proposed an MHCP model to help engineers analyze open source projects based on software complex network measures.

IX. CONCLUSIONS AND FUTURE WORK
In this paper, we first analyze the distribution of bugs from both time and space points of view. Different types of function-level modifications are proposed, and further investigations are performed to determine the correlation between these modifications and functions' fault-proneness. Other attributes related to functions as well as software networks are also utilized as possible indicators for fault prediction.
Our results indicate that: 1) bug distributions in both time and space conform to the power law distribution; 2) of the six types of function-level modifications, four of them (AD, Deg-In, Deg-Out, and Deg-Ch) fit the power law distribution, while MC and OM show no such pattern; 3) AD, Deg-In, Deg-Out, and Deg-Ch are all positively related to functions' fault-proneness, while MC and OM's correlations with functions' fault proneness are not remarkable; 4) of the three functional attributes, two of them (DiffCount and HisBug) are positively related to functions' fault-proneness, while Time exhibits significant negative correlation with the possibility that a function contains bug(s); 5) several software network measures are also proven to be closely related to functions' fault proneness. Overall, we have identified a set of critical factors that can serve as valuable indicators for existing fault prediction approaches to help increase their prediction accuracy.
In the next step, analyses towards more software projects will be performed to achieve more in-depth knowledge with respect to factors that have an influence on functions' faultproneness. We would also like to establish a novel fault prediction model based on not only the factors we have proposed in this paper but also the existing metrics to provide better precision in fault prediction. In addition, other software network measures will also be introduced into the area of software fault prediction so that engineers can achieve better fault prediction strength.