A Survey of Challenges in Spectrum Based Software Fault Localization

In software debugging, fault localization is the most difficult, expensive, tedious, and timeconsuming task, particularly for large-scale software systems. This is due to the fact that it requires significant human participation and it is difficult to automate its sub-tasks. Therefore, there is a high demand for automatic fault localization techniques that can help software engineers effectively find the locations of faults with minimal human intervention. This has led to the proposal of implementing different types of such techniques. However, Spectrum Based Fault Localization (SBFL) is considered amongst the most prominent techniques in this respect due to its efficiency and effectiveness. In SBFL, the probability of each program element (e.g., statement, block, or function) being faulty is calculated based on executing test cases and then using their results and corresponding code coverage information. However, SBFL techniques are not yet widely adopted in the industry. The rationale behind this is that they pose a number of issues and their performance is affected by several influential factors. For example, the characteristics of bugs, target programs, test suites, and supporting tools make their effectiveness differ dramatically from one case to another. There are massive studies on SBFL that cover its usage, formulas, performance, etc. So far, no dedicated survey points out comprehensively the issues of SBFL. In this paper, various SBFL challenges and issues have been identified, categorized, and discussed alongside many directions. Also, the paper raises awareness of the works being achieved to address the identified issues and suggests some potential solutions too.


I. INTRODUCTION
S oftware cover many aspects of our everyday life as they are used in different application domains such as healthcare, military, automobile, and transportation. Thus, our modern life cannot be imagined without software. The extensive use of different software products in our day-today activities has led to a significant increase in their size and complexity [1]. As a result, the number and types of software faults have also increased. Software faults not only lead to financial losses; but also loss of lives. Finding the locations of faults in software systems has historically been a manual task that has been known to be tedious, expensive, and timeconsuming, particularly for large-scale software systems [2]. Besides, manual fault localization depends on the developer's experience to find and prioritize code elements that are likely to be faulty. Developers spend almost half or more of their programming time on finding faults alone [3]. Therefore, there is a serious need for developing automatic fault localization techniques that can help developers effectively find the locations of faults in software systems with minimal human intervention. Different types of such techniques have been proposed and implemented by researchers and developers. However, Spectrum Based Fault Localization (SBFL) is considered amongst the most prominent techniques in this respect due to its lightweight, language-agnostic [4], easy to use [5], and relatively low overhead in test execution time [6] characteristics.
In SBFL, the probability of each program element (e.g., statement, block, or function) being faulty is calculated based on executing test cases and then using their results and corresponding code coverage information [7]. Currently, SBFL techniques are not yet widely adopted in the industry as they pose a number of issues and their performance is affected by several influential factors [8], [9]. For example, the characteristics of bugs, target programs, test suites, and supporting tools make their effectiveness differ dramatically from one case to another. In the literature, there are massive studies on SBFL covering its formulas, performance, and applications. However, no dedicated survey points out comprehensively the issues of SBFL. Thus, it is crucial to present and categorize various SBFL challenges and issues to offer a comprehensive survey on the topic.
The main contributions in the paper can be summarized as follows: 1) Conducting s systematic literature survey on the challenges of SBFL. 2) Identifying and presenting 18 SBFL challenges and issues.
3) The paper also raises awareness of the works being achieved to address the identified challenges and issues, and suggests some potential solutions in order to help those working on this topic and those interested in making contributions to it. The study begins with the formulation of a research question (RQ) that addresses several aspects of the considered topic. It then identifies the related papers that should be read in order to answer its RQ. Finally, it discusses potential research opportunities in the field. To accomplish the aforementioned goals, relevant papers were collected and thoroughly analyzed in a systematic manner.
The remainder of this paper is organized as follows: Section II briefly introduces the background of SBFL and its main concepts. Section III presents the most relevant works. Section IV describes the research methodology employed to perform the study. Section V presents the study's findings. Section VI outlines the threats to validity and the steps considered to overcome them. Finally, in Section VII, the conclusions of the study are given.

II. BACKGROUND OF SBFL
SBFL is a dynamic program analysis technique that is performed through program execution [10], [11]. The goal of SBFL is to address the problem of finding the root causes of bugs by utilizing information from program elements executed by test cases, in particular the outcomes of tests and their code coverage [12]. Thus, to identify and locate elements more likely to be faulty. In SBFL, code coverage information (also called program spectra), which is obtained from executing a set of test cases with recording their results, is used, by a ranking formula, to calculate the probability of each program element (e.g., statement, block, or function) being faulty [13]. Code coverage provides information on which program element has been executed and which one has not during the execution of each test case, while test results are classified as passed or failed test cases. Passed test cases are executions of a program whose outputs are expected, whereas failed test cases are executions of a program whose outputs are unexpected [14].
The idea of program spectra was mentioned for the first time in 1987 [15]. However, the use of program spectra for fault localization was first proposed in a study on the year 2000 problem (also known as the Y2K problem) aimed at discovering errors in calendar data formatting and storage for dates in and after the year 2000 [16]. It is worth mentioning that Tarantula is one of the first approaches, proposed in 2002 [17], that uses a ranking formula to calculate elements' suspiciousness in SBFL [12]. Afterwards, many other formulas have been introduced and the roots for many of them are from biology such as Ochiai [18] and Binary [19] To illustrate the work of SBFL, assume that there is a Python mid() function that takes three numbers as input and returns the median value. The mid() function, which is a well-known code segment used for describing fault localization [20], comprises twelve statements S i (1 ≤ i ≤ 12) and six test cases T j (1 ≤ j ≤ 6) as shown in Figure 1 that have been executed and the spectra (the execution information of statements in passed and failed test cases) have been recorded as presented in Table 1. There is a fault in statement 7 (the correction is m = x), and only two test cases, T1 and T6, hit that faulty statement. An entry of 1 in the cell corresponding to statement S i and test case T j means that the statement S i has been executed by the test case T j , and it is 0 otherwise. Also, an entry of 1 in the row labeled R, which represents test results, means that the corresponding test case failed, and it is 0 otherwise. Intuitively, a statement that is executed by more failed test cases is more likely to be considered as a faulty statement.   Statement  T1  T2  T3  T4  T5  T6  ef  ep  nf  np  1  1  1  1  1  1  1  1  5  0  0  2  1  1  1  1  1  1  1  5  0  0  3  1  1  1  1  1  1  1  5  0  0  4  1  1  0  0  0  1  1  2  0  3  5  0  1  0  0  0  0  0  1  1  4  6  1  0  0  0  0  1  1  1  0  4  7  1  0  0  0  0  1  1  1  0  4  8  0  0  1  1  1  0  0  3  1  2  9  0  0  1  0  1  0  0  2  1  3  10  0  0  0  1  0  0  0  1  1  4  12  1  1  1  1  1  1  1 The spectra information is then used by a spectra formula (also called a ranking metric [21], a suspiciousness metric [22], a risk evaluation metric [23], or a fault locator [24]) to compute how suspicious each element is of being faulty. Table 2 presents a few numbers of spectra formulas proposed in the literature.
Often, a formula is expressed in terms of four counters [25] that are calculated from the spectra as follows: • ef: the number of times a statement is executed (e) in failed tests. • ep: the number of times a statement is executed (e) in passed tests. • nf: the number of times a statement is not executed (n) in failed tests. • np: the number of times a statement is not executed (n) in passed tests cases. Finally, the statements are ranked based on their computed suspiciousness scores and then examined by developers in descending order, ranging from the most suspicious to the least suspicious. Statements with the highest scores are considered the most likely to be buggy. Table 3 presents the scores and ranks of applying different spectra formulas on the spectra information of our running example. In the case of the Tarantula formula, for example, statements 6 and 7 are ranked as the most suspicious elements by the formula as they have the highest scores compared to others. The third most suspicious element is 4 and so forth. It is worth mentioning that the statement 11 has not been included in the scoring as it has not been executed by any test case. Figure 2 shows the steps of the SBFL process and how the developer examines the suggested list of suspicious program elements.  43 4 In the previous example, we used statements as the basic code elements for fault localization. However, it is important to note that different kinds of granularities are frequently used as well such as functions and code blocks. Technically, FIGURE 2: SBFL process the granularity is determined by the granularity of code coverage measurement.

III. RELATED WORKS
The SBFL has been an important and active research field for decades. However, a survey study on the issues and challenges in this research field is lacking. A few general survey studies on software fault localization have been found in the literature as the most relevant publications. In this section, these studies are presented briefly.
The authors in [26] provided an overview of coveragebased testing and compared between 17 coverage-based testing tools including a tool called "eXVantage" which is developed by the authors. The comparison was based on several factors but focused more on coverage measurement. Then, they discussed various features (e.g., test case generation, test report customization, and automation) that should make tools more useful and practical. Also, they briefly mentioned that some tools have scalability issues, which makes them only suitable for small-scale software systems. Many others provide fine testing granularity, but the performance overhead prevents them from being useful for testing. However, the study helps developers pick the right tool that suits their requirements and development environment. VOLUME 4, 2022 In [27], the authors presented evidence that the empirical evaluation of the accuracy of coverage-based fault locators depends on many factors. They summarized the problems that they encountered during their own empirical evaluation of the accuracy of fault locators and classified them into two main categories: threats to validity and threats to value. Then, each category presents its own set of issues and their consequences on the accuracy, including fault injection, instrumentation, multiple faults, and unrealistic assumptions.
In [28], the authors briefly provided a review of the previous studies on software fault-localization in a table in terms of techniques, evaluation methods, and the datasets used. However, their results are very abstract and no details have been provided nor issues and challenges have been discussed.
In [29], the authors surveyed the fault localization techniques from 1977 to 2014. They classified the techniques into eight categories: program slicing, spectrum-based, statistics, program state, machine learning, data mining, model-based debugging, and additional techniques. They also listed popular subject programs used to study the effectiveness of different fault localization techniques. Their survey also addresses fault localization tools developed by the presented studies. Additionally, they presented some research challenges with fault localization techniques such as fault interference, programs with multiple faults, and granularity levels selection.
In [12], the authors conducted a survey on the state-of-theart of SBFL research including the proposed techniques, the type and number of faults they address, the types of spectra they use, the programs they utilize in their validation, and their use in industrial settings. Also, they highlighted some challenges (e.g., tied entities, faults introduced by missing code, and coincidental correctness) on SBFL research that have to be tackled to improve the SBFL to be used in real development settings.
In [30], the authors briefly discussed two issues, granularity levels and entities having the same suspiciousness, based on what the authors encountered in their collaboration with the industry. They highlighted that there are many different granularity levels that can be employed to generate a spectrum, but there is no guide for practitioners to help them select the right spectrum granularity they require. Also, they discussed ties within rankings due to having entities with the same suspiciousness which needs more attention in order to propose new strategies for tie-breaking.
In [31], [32], the authors presented the issue of multiple fault localization (MFL) of software systems in the software fault localization domain. They identified three prominent MFL debugging approaches, i.e., one-bug-at-atime debugging approach, parallel debugging approach, and multiple-bug-at-a-time debugging approach. Also, they presented some challenges with the identified approaches and provided some directions for future works.
All the survey studies mentioned earlier were general surveys that did not focus in detail on the issues in the SBFL. Some of them briefly highlighted a very limited number of issues. Also, most of them were not conducted systematically.
In contrast, our paper provides a thorough and systematic survey based on a detailed research methodology to examine different issues in the SBFL alongside possible solutions or research gaps for further investigations. As a result, our paper extends the details of the aforementioned studies by identifying, categorizing, and discussing 18 important issues comprehensively.

IV. RESEARCH METHODOLOGY
The systematic process followed in this study is based on the guidelines provided by [33] and [34]. It consists of several stages as presented in the following subsections.

A. IDENTIFICATION OF RESEARCH OBJECTIVE
The objective of this paper is to answer the following research question: "What are the challenges and issues posed by spectrumbased fault localization (SBFL)?" Answering the aforementioned question is achieved by providing a comprehensive survey via reviewing the publications on the topic. Thus, helping developers/researchers to better understand the SBFL and contribute to its development and research.

B. SEARCH STRATEGY 1) Literature Sources
Five well-known online literature sources indexing publications of software engineering and computer science were used. Table 4 lists these sources as well as links to their websites.

2) Search String
The following search string was used to find the relevant publications from the literature sources: ("spectrum" OR "statistical" OR "coverage") AND ("fault") AND ("localization") In the defined search string, the Boolean operators were employed to link all the selected terms with each other [35]. Where the "OR" operator was used to link synonyms or related terms and the "AND" operator was used to link the major terms. It is worth mentioning that no publication time span was set during the search.

C. PAPER SELECTION
Several criteria for including and excluding papers (based on the titles, abstracts, and full-text readings) were considered to decide whether a publication is relevant to our study or not as follows: Inclusion criteria: • Publications related directly to the topic of this study. This is ensured by reading the title of each obtained paper. When the title reading was not enough, the abstract or full-text reading has also been applied. It is worth mentioning that in full-text reading/filtering, we eliminated those papers that do not talk about issues or we could not use them to identify challenges. • Papers published online from 2002-2021. Exclusion criteria: • Publications that are not available in English. • Duplicated publications.

2) Snowballing
In this paper, the snowballing technique [36] was also used to reduce the risk of missing some relevant papers. The newly found papers are then subjected to the paper selection process recursively. Figure 3 shows the paper selection process and its outcome at each stage. In addition, Table 10 lists all papers (with their references, titles, and publication years) obtained after applying the paper selection process.

V. RESULTS
To answer the identified research question of this study, all the related publications were extensively read and analyzed. Thus, several challenges and issues posed by the SBFL have been identified alongside many directions, as shown in Figure 4, classified into several categories, and then discussed as follows.

A. STATISTICAL ANALYSIS:
In SBFL, statistical analysis is used to correlate program elements with failures [37], where similarity formulas from the statistics and data mining domains are used to measure the likelihood of a program element being faulty. The issue here is that software testers and researchers are not statisticians. Worse yet, most of them do not have access to statisticians or cannot afford to send their data to one. As a result, they often select SBFL formulas without statistical justifications. Another issue is that they evaluate their contributions using statistics to demonstrate that their technique is significantly better than the state of the art by applying their technique and the state of the art technique on one or more faulty programs. Then, they use statistics to demonstrate that a proposed new technique locates faults "significantly" better than the state of the art. As they are not statisticians and do not have statisticians readily available, this may lead to incorrect statistical analysis and conclusions about the importance of their SBFL results [38].  To solve these issues, more studies are required to evaluate if some SBFL formulas are statistically significantly better than others. Besides, statistical tools are needed to help developers evaluate their results. For example, the authors in [38], [39] presented the first such tools called "MeansTest". The tool automates some aspects of the statistical analysis of results by checking whether the statistical methods used and the results obtained are both plausible. It examines the data under consideration for several properties including normality and distribution. Then, it uses that information to determine which statistical method to use in order to obtain better results. The tool has been applied to the works presented in the papers at the 6th International Conference on Software Testing, Verification, and Validation (ICST'13). Six papers were discovered to have potentially misstated the significance of their findings because of the selection of inappropriate statistical techniques.

B. COVERAGE TYPES:
Since the granularity of fault localization is determined by the granularity of code coverage, the selection of which coverage type to use in the SBFL is crucial as each coverage type influences the performance of SBFL techniques in one way or another [29]. Program coverage elements can be divided into several common types as follows: • Statement coverage: There are different lines of code that can be considered for statement coverage. Thus, the issue is which line of code can be considered as the most suitable choice. In [40] for example, all the lines of code in the target program are considered for statement coverage. While in [41], lines of code that are preprocessor directives, variable declarations, and function declarations have not been considered for statement coverage. The number and type of lines of code considered for statement coverage may have a notable impact on the performance of any spectra formula based on the location of the buggy line. To illustrate this, let us consider that we have two versions of the same target program: (a) version A with 1000 lines of code which include all different types of lines of code. (b) version B with 100 lines of code which does not include variable declarations, function declarations, etc. Then, we suppose that the buggy line of code was located at the 500th position in the ranking list of version A. The same buggy line of code was at the 4th position in the ranking list of version B. Using the Exam measure in Equation 1 [42], which measures the percentage of statements that the programmer needs to examine before the actual bug is found, program B gives 4% as compared to program A, which gives 5%. This indicates that the fewer ranked statements a program has, the fewer statements the programmer has to examine to find the buggy statement.
Where E is the position of the faulty statement in the ranking list and N is the total number of statements in the ranking list.
Therefore, comprehensive experimental studies have to be conducted to distinguish between different types of lines of code and their impact on fault localization performance. For example, an interesting investigation could be giving an importance score to each line of code in the target program. Importance scores could be computed via the influence of a specific line of code on the behavior of a target program. However, statement coverage is one of the most used coverage types as it often provides the exact locations of faults [43] • Branch coverage: Here, each one of the possible branches from each decision point is considered for branch coverage. The issue in this type of coverage is that a fault in the condition of an if-then-else may lead to the execution of the else branch in all failed test cases. Thus, ranking the statements in this branch higher than the faulty condition, which is also executed by passing test cases [27], [44], [45]. • Block coverage: Here, a number of program statements are considered for block coverage [46]. Block size is determined by the compiler and it depends on the program size and structure. The standard size of a block is 5-7 statements [47]. Using statement coverage may result in ties of scores between the statements within the same block of a program. While this issue is reduced in the block-based spectra coverage. • Function coverage: Function (or method)-level granularity can also be employed as a program spectra or coverage type. Compared to statement-level granularity, it has several advantages [48], [49]: (a) it provides more global contextual information about the investigated program entity, (b) it is scalable to large programs and executions, (c) some studies report that it is a better granularity for the users [50], (d) it reduces the number of program tied elements too, (e) it is also one of the most commonly adopted approaches as the basic program elements [43]. However, the number of statements in some functions is huge sometimes. Thus, it would not be easy to locate a faulty statement in such functions. • Data-flow coverage: This is about how variables are defined and then used in a target program. Also, it concerns the relationships among them. Data-flow coverage provides more details than the standard coverage types but it requires more execution and memory overheads during test case execution [51].

C. ELEMENTS TIE:
In SBFL, program elements are ranked in order of their suspiciousness from the most suspicious to the least. To decide whether an element is faulty or not, developers examine each element starting from the top of the ranking list. To help developers discover the faulty element early in the examination process and with minimal effort, the faulty element should be put in the highest place in the ranking list. However, ranking only based on the suspiciousness scores computed by spectra formulas causes an issue called elements tie [52]. Elements tie means having a similar suspiciousness score for more than one program element in the target program [30]. Tied elements are usually ranked based on three approaches [53] as follows: • MIN measure (also known as the worst case): it refers to the bottom-most position of the elements that share the same suspiciousness score. • MAX measure (also known as the best case): it refers to the topmost position of the elements that share the same suspiciousness score. • MID measure (also known as the average case): it refers to the average of the position of the elements that share the same suspiciousness score and it is calculated using Where S is the tie starting position and E is the tie size. It is worth noting that the MID measure considers the average of all possible positions. Therefore, it is more widely used than the other two approaches. Also, the MID measure must be applied on ascending sorted suspiciousness scores. It is quite frequent that ties include faulty elements and it is not limited to any particular SBFL technique or target program. Such elements are tied for the same position in the ranking list. Also, it indicates that the used technique cannot distinguish between the tied elements in terms of their likelihood of being faulty. Thus, no guidance is provided to developers on what to examine first [54], [55]. In addition, the greater the number of ties involving faulty elements, the more difficult it is to predict at what point the faulty element will be found during the examination.
Ties among program elements can be divided into two types as follows: • Non-critical ties: This type refers to the case where only non-faulty elements are tied together for the same position in the ranking list. Here, if the tied elements have a higher suspiciousness score than the actual faulty element, then every element will be examined before finding the faulty element. On the other hand, if the tied elements have a lower suspiciousness score than the actual faulty element, then the faulty element will be examined before the tied elements. Thus, there is no need to continue examining the ranking list. In both cases, the internal order in which the tied elements are examined does not affect the performance of fault localization in terms of the number of elements that must be examined before finding the faulty element. • Critical ties: This type refers to the case where a faulty element is tied with other non-faulty elements [54]. In this type, the internal order of examination affects the SBFL performance. It is worth mentioning that critical ties are not a rare case in fault localization. Besides, a significant portion of the elements in the program under consideration might be critically tied. Therefore, there is a need for tie-breaking strategies.
Many approaches can be used to handle the ties problem in the ranking list of program elements. In [56], the authors proposed a tie-breaking strategy that firstly sorts program statements according to their suspiciousness scores and then breaks ties by sorting them according to applying a confidence formula. Such formula is designed to measure the degree of confidence in a given suspiciousness score. When two or more statements have the same suspiciousness score, the score assigned to the statements with higher confidence is more reliable, and thus the statements are more likely to be faulty.
In [52], the authors proposed a grouping-based strategy that employs another influential factor alongside statements' suspiciousness scores. This strategy groups program statements based on the number of failed tests that execute each statement and then sorts the groups that contain statements that have been executed by more failed tests. Afterward, it ranks the statements within each group by their suspiciousness scores to generate the final ranking list. Thus, the statements are examined firstly based on their group order and secondly based on their suspiciousness scores. Their results show that ranking based on several factors can improve the SBFL effectiveness. Thus, the grouping-based strategy could be effective in tie-breaking as well.
In [54], the authors proposed many tie-breaking approaches and also suggested using dynamic program slicing as a promising solution to break ties. To illustrate this, consider the test() function in Figure 5. The function has nine statements (1-9) and a fault in statement 3 (it should be b = y). Table 5 presents the function spectra after executing four test cases (T1-T4) and the suspiciousness scores for all the statements after applying Tarantula. It can be noted that eight statements are critically tied (i.e., having the same suspiciousness score of 0.5). Here, dynamic slicing can be used for tie-breaking based on the following steps: • Construct the dynamic slice of each failed test case. • Take the intersection set of statements from the constructed slices. From Table 5, this set will be the statements 3, 6, and 9. • Give higher priority to the statements in the obtained intersection set. • Examine the statements with higher priority (i.e., 3, 6, and 9) before the other statements (i.e., 1, 2, 4, 7, and 8).
Since the set with higher priority does include the faulty statement, it can be found that the dynamic slicing has reduced the size of the critical tie from eight to three statements. VOLUME 4, 2022

D. DIVISION BY ZERO
There is always a possibility of the denominators of some spectra formulas having zero. As a result, error messages are produced. For example, when the formula "Overlab" is applied to the information presented in Table 1; the error message "Division By Zero" is printed for each program statement. Therefore, we considered the value zero as a score for each statement as can be noted from Table 3. To overcome this issue, several possible solutions have been proposed in the literature as follows: • Considering zero as a result. The value zero is assigned to each program entity in which its denominator is zero [41], [57], [58]. • Adding a small fixed constant such as 10 −6 to the denominator [7], [59]. • Adding a larger value such as the number of tests plus 1 to the denominator. Such value is larger than any value which can be returned with a non-zero denominator [7], [58].
However, the aforementioned solutions may introduce undesired issues as well. For example, more program elements will have the same suspiciousness score in the ranking list, forming new ties. Often, scores generated using these solutions are not considered by the researchers in the literature. Simply, they are totally removed from the ranking list and thus not displayed to the developer. However, more studies are required here to answer what is the rate of program elements having the same score using these solutions and whether a faulty element could be within these elements or not.

E. NEGATIVE SUSPICIOUSNESS SCORES
In SBFL, most of the formulas used to compute suspiciousness scores of program elements produce positive scores. However, few formulas (e.g., Wong2 and Goodman) produce both positive and negative scores. This may cause a critical issue when a weighting method is applied to the generated scores for some valid reasons. For example, the final score of each element in the whole program or in a group of elements can be multiplied by a weighting value to determine which element is more important than others based on a reason such as which one contributes mostly to the behavior of the program, which one appears more in failed test cases, which one appears less in passed test cases, etc. Therefore, applying a weighting method to the negative scores produced by such formulas will change the original rank order of the scored elements. In other words, the rank order of the suspicious elements after applying a weighting method will be different from the rank order of the same elements before applying a weighting method.
To illustrate this, consider the scores produced by the Wong2 formula in Table 3. It can be noted the statements 4, 5, and 10 are assigned with the same score (i.e., -1) and the same rank order (i.e., 3); but we would like to consider the statement 4 as the most suspicious element because it has been executed by a failed test case while the two other statements were not. So, we decided to apply a weighting method that multiplies the score of 4 by the weighting value 0.9 (more suspicious) and the scores of 5 and 10 by the weighting value 0.1 (less suspicious). The results of applying our weighting method will decrease the score of 4 and thus put it in the worst position in the ranking list (i.e., 5 rank instead of 3); while it does the opposite with both 5 and 10. A possible solution to this issue is to apply the weighting method to each score generated by such formulas; the absolute of each score has to be taken before ranking the scores.

F. SOURCE OF BUGS
In the software development process, it is common to break the code of a program into several source code files. For example, putting the functions in one file and the classes using these functions into another. This practice is useful for structuring source code files and for reusing existing code. However, it also has its drawbacks in the context of software fault localization. In Figure 6 for example, File B includes two functions, LessThanFunction() and GreaterThanFunction(), with a bug in the statement 6, it should be m = x instead of m = y, of the first function. These two functions have been imported into File A. Thus, File B propagated its bug to File A. As a result, File A will also have a bug in statement 4. When File A is tested using the statement granularity/coverage level, it will show that it has a bug in statement 4. The developer will then examine statement 4 to find out that it calls a function from File B. The issue here is that s/he will not be able to know which statement in the called function caused the bug in order to be fixed.
In the literature, there is a lack in the experimental studies that try to distinguish between propagated/imported bugs and not propagated bugs. Therefore, it would be very useful to study this issue alongside many directions such as deciding if a bug is imported or not, specifying from where it is imported, how to locate it in its original place, and measuring its impact on the whole fault localization performance and process.

G. SINGLE AND MULTIPLE BUGS
A bug is a program element that shows unexpected behavior when executed by a test case. In general, program failures are caused either by a single bug or multiple bugs [31], [32]. A single-bug problem is where all the failures of test cases are caused by just one bug. In other words, whenever a test case fails, the same buggy element should have been executed in that test case. On the other hand, a multi-bug problem is where the failures of test cases are caused by more than one bug. Sometimes, a bug could be in a preprocessor directive or an initialization element that is used at multiple places in the target program. This issue shows that the target program has multiple bugs. Another issue here is that as the bug is in a statement (e.g., initialization statement) that is executed by all passed and failed test cases that statement is mostly not to be ranked high; making it difficult to be identified [44].
To address this issue, further studies are required to know whether a program really has multiple different bugs or a single bug element that is used at multiple places. In the case of the latter, it would be useful to specify the location of the first appearance of the bug and consider it in the fault localization process while ignoring the other places it has been used at. It is worth mentioning that many SBFL techniques are designed for programs with single bug only. Therefore, it would be interesting to study the impact of multiple bugs on the performance of SBFL. A good starting point on this is what has been performed in [60], where an empirical investigation on multiple-fault versions from different open-source programs has been conducted in order to study the negative impact of multiple-faults on SBFL and to explore the fundamental causes of this negative impact. Also, it has been found that some SBFL formulas are more robust to multiple-faults and showed the best performance among all others. In general, pure SBFL is not always sufficient for effective fault localization in multi-fault programs [61], [62]. Other ways to address the issue of multiple bugs in a program is to design novel suspiciousness formulas as in [63], or to divide the failed test cases into different clusters. The test cases in a cluster fail due to the same bug. In other words, each test cluster represents a different bug. Then, the failed test cases in each cluster combined with all passed test cases are used to localize only a single fault as in [64]- [66].

1) The ranked list of elements is huge
Mostly, a large number of program elements are included in the ranking list generated by SBFL techniques [44], [67]. This is not preferable for the following main reasons: • The more ranked elements, the more ties are produced as many program elements exhibit the same execution patterns. • It may increase the number of elements having the speciousness score of 0 due to the issue of division by zero. • A huge number of elements that are unrelated as suspects of a bug get considered in the ranking list.
Possible ways to address this issue are either combining the ranking with other suspiciousness factors derived from the testing and program elements contexts such as using program slicing as mentioned in a previous section or reducing the length of the target programs via applying code optimization and transformation techniques. To illustrate this, consider a Java function called match() which takes two inputs s and w and returns back whether the sentence s contains the word w or not. The function code is written in two ways, an unoptimized version of the code with a bug in the statement 9 and an optimized version of the code with the same bug in the statement 8, as shown in Figures 7 and 8. The unoptimized code of the function has 15 statements which all will be included in the ranking list; while the optimized code of the same function has only 10 statements to be included. Table 6 presents the spectra and test case information of all the statements alongside their speciousness scores before optimizing the code, and Table 7 presents the same information but after applying code optimization. It can be noted that code optimization reduces the number of ranked statements. Besides, we can see that it eliminates some ties completely and reduces some others. And, no statement has been scored with the value 0.
However, it would be interesting to study code optimization and its impact on performance alongside many directions. Many code optimization techniques reduce the length of programs without changing their outputs. Thus, the effects of these techniques have to be investigated experimentally and their feasibility has to be reported with evidence. A possible solution to the issue of the suspicious elements is not related logically as code is to group them into different logically related categories to at least understand why these elements were considered suspicious. Software module VOLUME 4, 2022 clustering could be employed in this respect as a potential solution to this issue. More studies are required to evaluate the usage of other potential factors and to measure their impacts on the SBFL performance. Here, we list out some factors that we believe will have a positive impact on the ranking effectiveness as follows: • The sequence, number, and coverage of executing failed test cases. • The importance of each failed test case in the used test suit. • The importance of each element in the target program.
For example, the statements that directly have an impact on the program's output should be given more importance than others. • Using various software metrics (e.g., the complexity of functions, relationships, elements types, etc.) to group  T1  T2  T3  ef  ep  nf  np  Scores  1  1  1  1  1  2  0  0  0.5  2  1  1  1  1  2  0  0  0.5  3  1  1  1  1  2    the elements sharing similar metrics into different categories and then relate them to the faulty element. • All the elements near the faulty element may be given more importance than others when being ranked. • Using the union of dynamic slices of failed test cases to reduce the number of elements included in the ranking list.

2) The ranked list of elements is practically arbitrary
In SBFL, the ranked list of program elements is formed as follows: you can get a statement from function a(), then another one from function b(), and so forth. As a result, the ranking list suggested by SBFL is not followed linearly by developers [67] because they have trouble understanding the context of the bug, since they are only given each bug location in isolation [68]. Instead, they examine the statements that were ranked high in the ranking list and then look for the location of the actual fault in the surrounding function, class, or file. This suggests that pointing developers towards good starting points with SBFL is more important than only improving the ranking of program elements in the ranking list.
In [68], the authors proposed a technique that reports the most suspicious program regions instead of a single program element which is likely to be faulty. In other words, each faulty element is reported together with its context. This is useful because the contexts can assist developers in identifying and comprehending the infection flow of each faulty program element. This is performed by extracting the execution traces of each program element in different failed and passing runs. Then, a final execution sequence for each element is formed as a graph that represents the faulty element and its context. Figure 9 shows what the hierarchical ranked list of program elements looks like.

FIGURE 9: Suspicious program regions
Recently, the authors in [69] proposed a hierarchical ranked list of elements to solve this issue as well. This is achieved by putting all the statements of each function under the corresponding function's name and then putting all the functions of each class under the corresponding class's name. Thus, each function will have its own set of statements, and each class its own set of functions including the statements. Afterward, the classes are sorted based on their suspiciousness scores, then the functions, and finally the statements. For example, the statement at line 37 will not be examined before the statement at line 11 because the latter belongs to a function of higher rank in the ranking list. This hierarchical grouping of program elements gives additional useful information about the suspiciousness scores on all layers to the user. They can exclude whole methods or even classes from the list. Figure 10 shows how the Hierarchical ranked list of program elements looks like. SBFL techniques require suitable tools to automatically collect spectra data and testing information from the target programs [29]. However, the currently available tools [17], [70]- [77] suffer from some limitations as follows [78]: • Mostly, they only collect abstract and trivial testing information, such as whether a program element is executed by a specific test case or not. • Some of them collect more and different types of information (e.g., control flow and data flow) that may be time-consuming, not well scalable for large-scale target programs, and cannot be used in practice. • Most of them are developed for programs written in Java or C/C++ programming languages. This is because these languages have been used widely in the past decades compared to other languages. Another possible reason is that the choice of programming languages represents the target industries of each company. For example, companies providing tools for embedded and real-time software vendors; focus more on supporting C/C++ [26]. Tools for helping Python developers in their debugging process have not been proposed by the researchers previously. Therefore, tools that target programs written in Python, which is considered one of the most popular programming languages, are extremely required to be proposed and developed. • They have the issue of inaccuracies in their results.
The inaccuracy of a tool's recorded coverage data can lead to various problems. For example, false trust in the result may be introduced by a code element that is falsely reported as covered in a tool and not covered in another tool. Therefore, to guide how to avoid the inaccuracies of the tools, further studies are needed. This can then help testers to determine the degree of risk of measurement inaccuracies on the performance of fault localization [79]. • Proposing and developing tools or plug-ins for specific IDEs is considered as a practical limitation of usage as not all the developers use the same IDE and many developers use more than one IDE. Developers do the debugging during/within the development phase itself but this is not always true and it is not a preferred practice. Therefore, developing standalone software tools that do not depend on a specific IDE is a good option in this respect. Perhaps the best option is to have some generic tool that can be invoked from the command-line or to use some APIs and then develop different plugins for various IDEs that are calling this generic tool.
In order to make SBFL tools more useful and practical, they should be developed with some important features as follows [26]: • A user-friendly graphical interface is a crucial feature for the users nowadays as such interfaces act as the gates into using software systems interactively and efficiently [80]. Thus, a proposed fault localization tool VOLUME 4, 2022 should also be run in a GUI mode besides a commandline mode to meet the requirements of different users. For example, developers usually like to use the GUI mode but the integrators usually like the command-line mode. • The results generated from a tool should be stored into various file formats according to the user's needs (e.g., XML , CSV, XLS, or JSON). As a result, the results will be useful for further processing or even for other testing tools. • A tool should provide control to the user to change the settings and configurations of its functionality such as where to store the results, which task should be automated, which results should be displayed first, etc.

J. BUGS DUE TO MISSING CODE
Generally, software bugs appear due to wrong written code (e.g., using a wrong variable instead of another one or using a wrong arithmetic operator instead of another one) or due to missing code (e.g., missing an element that performs a specifically required operation or missing a required conditional element) [12]. In some open source projects, it has been found that missing code faults form the majority of the total faults in these projects [81]. Locating a bug that is introduced by a missing code is a challenging task in SBFL. This is due to the fact that the code responsible for the bug is not in the program and SBFL is designed to locate a faulty element, the execution of which triggers failure [82]. However, a missing code will have an impact on some other elements in the target program. For example, some elements pose undesired behavior, get executed before other program elements, or get executed where they should not be. This issue could be addressed by analyzing the undesired behavior or the unexpected execution of the elements impacted by a missing code. Such elements could be identified by their high suspiciousness scores. Thus, the high scores of some elements may indicate that some elements in their neighborhood (i.e., preceding or succeeding elements) are missing [53], [83]. However, more work is needed to propose techniques to address the issue of bugs caused by missing code.

K. SIMULATION OF SBFL
Implementing and using SBFL requires target programs, test cases, and different types of coverage data. Providing these requirements is challenging for many reasons as follows: • Executing tests cases on the collected target programs requires that all the programs be provided with proper execution environments. Some programs depend on external libraries to be executed properly. Many others require some configuration settings to be set. • There is a lack of tools that extract various spectra data from the target programs.
Therefore, advanced SBFL simulation tools are very useful to be proposed and implemented to support researchers in this respect [84]. They should be able to simulate various program structures and their behaviours, relationships among elements, different coverage types and test cases, different numbers and types of faults, and calculate suspicion scores using various ranking formulas. Such tools can be used to validate new ideas or concepts before starting the actual and concrete experiment and development.

L. TEST FLAKINESS
SBFL depends on the results of executing several test cases. Sometimes, a test case may pose an issue called "test flakiness", which refers to a test case with a non-deterministic result. In other words, sometimes it passes and sometimes it fails on the same code depending on unknown circumstances [85]. This issue negatively impacts the effectiveness of SBFL techniques as it provides misleading signals during the fault localization process [86]. It has been found that the flakiness of individual test cases influences fault localization scores and ranks , and that some SBFL spectra formulas (e.g., Tarantula) are more sensitive to this issue than others (e.g., Ochiai and DStar).
The dominant approach when addressing this issue is to detect and then remove all the identified flaky test cases from the test cases execution. However, it has been found that the number of flaky test cases is sometimes so high that removing them is not considered a practical solution [87]. Therefore, proposing new approaches which give good performance even with the existence of flaky test cases is preferable. Flaky test cases can be detected in many ways as follows: • Re-run a test case several times after it has failed. If some re-runs pass, then the test case is considered a flaky one. One issue here is how many times a failed test case has to be re-run. Different studies used different numbers. For example, in [88], each test case has been re-run 10 times. In [89], 30 times. In [85], 100 times. In [90], 4000 times. In [91], 10000 times and even with this huge number of re-runs, the authors interestingly found that some of the previously identified flaky tests were still not detected. The re-run approach suffers from several issues [92] such as: (a) flaky test cases are non-deterministic. Therefore, there is no guarantee that re-running a flaky test case will change its outcome. (b) there is no guidance for how many times a failed test case has to be re-run to maximize the likelihood of considering it flaky. (c) it may also inject a delay between each re-run to allow the cause of failure (e.g., a network outage) to occur. (d) the performance overhead of re-runs scales with the number of failed tests. • Monitor only the coverage of the most recent code changes rather than the entire target program, and mark as flaky any newly failed test case that did not execute any of the changes without re-running and with minimal runtime overhead. In other words, a test case is considered flaky if it passes in the previous version of the code but fails in the current version [92].

M. SEEDED AND REAL BUGS
Artificial faults (also called seeded faults) are made when a researcher intentionally seeds a fault in a program source code to intentionally break its functionality. This is performed with the hope that the SBFL techniques under study will be able to identify the location of the seeded fault in the modified source code. Seeded faults are often used to replicate real fault behavior, especially when the real faults can not be reproduced due to many reasons including technical ones or because they are not available for programs written in certain programming languages. Also, they can be used to solve the issue of unbalanced test suits in real fault datasets such as Defects4J [93] for Java programs, BugsJS [94] for JavaScript programs, and BugsInPy [95] for Python programs, where the passed test cases are much more common than the failed test cases. It is worth mentioning that they are widely used in multiple fault localization studies with about 70.91% of the selected studies utilizing them. However, the issues with these faults are as follows: • They may be picked arbitrarily. • There is a potential for bias in the selection of the faults. • They may not be representative of real industry faults. To overcome these issues, it is recommended to use real faults, such as the faults presented in Defects4J and BugsInPy datasets, or to seed faults in well-known and complex software systems and provide all the created faulty versions publicly online, which legitimizes the experimental results by reducing bias and enhancing result generalization [31].

N. SPECTRA FORMULAS SELECTION
There are many SBFL formulas proposed in the literature. However, still, there is a lack of guidance on how to select the right formula for a specific purpose. In [96], SBFL formulas were divided into three groups based on how the formulas of each group are affected by the number of failed test cases. It has been found that some formulas (e.g., Ochiai and Tarantula) are more sensitive to the number of failed test cases than others. In [97], several formulas generated by genetic algorithms have been evaluated, and it has been found that the GP13 formula is one of the best performing formulas of its kind. In [85], it has been found that some SBFL spectra formulas (e.g., Tarantula) are more sensitive to the issue of test flakiness than others.
All the previous studies did not evaluate even half of the SBFL formulas proposed in the literature. Besides, many other aspects are not yet evaluated , for example, which formula is more sensitive to the tie issue or which formula performs better with a specific type of fault. It is worth mentioning that multiple spectra formulas can be combined into a single new formula. The resulting formula is called a hybrid formula; which combines the advantages of other existing formulas that have been used in the combination. As a result, a hybrid formula should outperform other existing formulas as in [20].
To produce an effective hybrid formula, more experimental studies are required to be conducted to understand the behavior and characteristics of each existing formula, as each has its strengths and weaknesses at the same time. Thus, providing a detailed guideline with experimental evidence to help researchers select the right formulas for the combination will help a lot in this respect. Also, the computed suspiciousness is different for every formula according to its peculiarity for the same target program. Thus, it would be interesting to investigate the relationship between the used formula and the target program. This may lead to the introduction of some improvements in the combination process. All the aforementioned issues are possible avenues worthy further exploration.

O. NO INTERACTIVITY
Often, SBFL techniques compute the suspiciousness scores of program elements without involving the user. In other words, only the statistical analysis of program spectra is used for this purpose. Thus, the user's previous knowledge about the program under testing is not utilized to improve the fault localization performance [98]. This issue can be addressed by involving user interactivity. Involving the user and considering his/her feedback on the suspicious elements and their ranks can help to re-rank them, thus improving the fault localization process. Figure 11, which is adopted from [99], shows the difference between the static SBFL (i.e., without user interactivity) and the interactive SBFL (i.e., with user interactivity).

FIGURE 11: Static vs. interactive SBFL
In [100], the authors proposed and implemented an approach called Interactive Fault Localization (iFL) to support user interactivity in the SBFL process. Their approach allows the user to interact with the output of the SBFL process based on his/her understanding of the system elements and their contexts by considering the following three feedback actions: (a) the user decides that a proposed suspicious element is really faulty. Thus, the SBFL process will stop as the faulty element is found. (b) the user decides that a proposed suspi- VOLUME 4, 2022 cious element and its context are not faulty. Thus, it can be given low importance and then moved lower in the ranking list. (c) the user decides that a proposed suspicious element is not faulty but its context is suspicious. Thus, it can be given high importance and then moved higher in the ranking list.
In [72], [99], the authors also proposed an interactive fault localization approach that leverages simple user feedback. The user can interact with their approach by labeling a suggested suspicious element as faulty or not. Following that, the proposed approach utilizes such simple user feedback and re-orders the rest of the suspicious program elements based on that, intending to put truly faulty elements higher in the ranking list.
In [101], the authors proposed an approach called "Enlighten" which is similar to the previous approach except that it uses dynamic program slicing to form a Dynamic Dependence Graph (DDG) for every failed test in the test suite. In the DDG, nodes represent occurrences of statements in the program, whereas edges represent dynamic (data or control) dependencies between these statements. This information will then be used to create queries for the user to interact with. Each query consists of a method invocation, together with its input and output values, which the user can mark as correct or not. This approach is also iterative and in each iteration, it updates the debugging data and the ranking list based on the user feedback until the fault is found.
In [102], the authors proposed an interactive approach to use the user feedback about the correctness of a set of statements to estimate the number of coincidentally correct test cases (those that execute faulty statements but do not cause failures).
Despite the attempts to propose and improve interactive fault localization approaches, many issues are still not addressed comprehensively in the literature. For example, more studies are required to investigate the effectiveness of different proposed approaches and the comparison among them. Performing user studies to evaluate the usability of the tools implemented in this context is also required. It would be interesting to also investigate cases when developers or users make the wrong estimation and give incorrect feedback due to mistakes or them not being quite familiar with the faulty program as they are not the actual developers of it. This could be addressed by proposing new methods to allow users to roll back their feedback if they made mistakes. Enabling users to provide multiple feedback at the same time rather than one by one following the recommended list, especially in the scenarios where multiple bugs exist is also recommended.

P. TOP-N RANKING
Several studies [103], [104] report that developers think that inspecting the first 5 program elements in the ranking list produced by an SBFL technique is acceptable, and that the first 10 elements are the upper bound for inspection before ignoring the ranking list. Therefore, the performance of SBFL can also be evaluated by focusing on these rank positions, collectively called Top-N, as follows: • Top-1: When the rank of a faulty element is the first in the ranking list. • Top-3: When the rank of a faulty element is less or equal to three in the ranking list. • Top-5: When the rank of a faulty element is less or equal to five in the ranking list. • Top-10: When the rank of a faulty element is less or equal to ten in the ranking list. • Other: When the rank of a faulty element is more than ten in the ranking list. Also, there is a special non-accumulating variant of Top-N categories, in which the cases where the bug fell into nonoverlapping intervals of [1], (1,3], (3,5], (5,10] or (10, ...] are counted. These categories show in how many cases an SBFL approach moves a bug into a better (for example, from (5,10] to (1,3]) or a worse (for example, from [1] to (1,3]) category. In other words, in how many cases do the bugs get into a higher-rank category (this kind of improvement is also known as enabling improvement [43]) and in how many cases do they downgrade the category. Thus, an SBFL approach that achieves improvements in all categories by moving many bugs to higher ranked categories has better performance.
However, due to the nature of SBFL, the faulty element cannot always be ranked at higher-ranked Top-N categories. This issue is the biggest obstacle to the usefulness of SBFL in practice [22]. It is worth mentioning that many SBFL studies published in the literature specifically addressed this crucial issue compared to the other issues. Therefore, we will list them in Table 8 with a brief description of each proposed solution.

Q. RESULTS VISUALIZATION
During testing a program, software developers gather a large amount of testing data. These data can be used for the following two main purposes [17]: • To identify failures and to help developers locate the faults causing these failures. • To identify program elements that were not executed by the used test suite. As a result, more test cases can be added to cover these elements. SBFL uses such data to compute the suspiciousness of program elements under test and often displays them in a table of many fields as in Table 9. This form of output helps the users know which program elements are suspicious, their locations in the source files, their suspiciousness scores, and their ranks.
However, there are two main issues with this approach of displaying the results of SBFL, as follows: • The huge amount of displayed results is not attractive and difficult to interpret when large-scale programs and test suites are used. • It causes developers to focus their attention locally rather than providing a global view of the target program. Therefore, there is a need for different approaches that provide users with a global view of the program Improving fault absolute ranking for SBFL if some non-faulty elements ranked higher were excluded from the ranking list of a target program based on the failed test cases. [22], [105] Categorizing program elements The ranking list of SBFL can be improved if program elements get categorized into "suspicious group" and "unsuspicious group". Under such categorization, we only need to calculate the risks for suspicious statements, and simply to assign the risks of unsuspicious statements as the lowest value. [106] Using program slicing Deleting program elements that have no dependence on faulty elements to improve the precision of locating faults.
[107]- [113] Introducing new ranking formulas Proposing new risk evaluation formulas that outperform the existing ones.
[23], [114]- [124] Combining existing ranking formulas Combining multiple formulas into a single formula. The resulting formula is called a hybrid formula that has the advantages of the formulas used in the combination.
[20], [125], [126] Optimizing test cases Optimization methods can maximize SBFL performance using a minimum (e.g., by removing redundant test cases) or balanced number of test cases used by SBFL formulas.
[127]- [135] Weighting and prioritizing test cases The performance of SBFL can be improved by differentiating the importance of different test cases. In other words, not all test cases have the same importance (e.g., some test cases are more important than others).
[136]- [140] Mitigating the impact of coincidental correctness Coincidentally correct test cases execute faulty program elements but do not cause failures. Such test cases reduce the effectiveness of SBFL. Therefore, removing or reducing such cases can improve the SBFL. [141] Increasing failed test cases Some SBFL formulas may become less accurate if there are very few failed test cases. Therefore, cloning the whole set of failed test cases or adding some more to enlarge them can improve their performance.
[142]- [145]  To address the aforementioned issues, two main visualization approaches for the results of SBFL have been proposed in the literature, as follows: • The discrete coloring scheme. In this simple scheme, if a program element is only executed by failed test cases, then its color will be red. If a program element is only executed by passed test cases, then its color will be green. If a program element is executed by both the passed and failed test cases, its color will be yellow. The problem with this approach is that it is not considered very informative because the majority of program elements are in yellow color, and the developer is not provided with helpful hints about the location of faults. The red, green, and yellow colors were selected because they are the most natural and the best for viewing [17]. Elements are colored in yellow (i.e., not suspicion nor not safety) when they are executed by nearly equal percentages of passed and failed test cases. The visualization based on this scheme can be displayed to the user in many forms as shown in Figure 12: (a) coloring program elements in the source code itself [17], [73], [76], [77]. (b) visualizing the results as a Sunburst [74], [146], [147]. (c) visualizing the results as a Treemap [74], [146]. (d) visualizing the results as a Bubble Hierarchy [146]. However, more studies are required to propose new approaches or to improve the usability and effectiveness of the existing approaches alongside many directions. For example, providing a zoomable user interface that lets the user view the results at various abstraction levels is essential, especially for large-scale software systems. Also, providing the users with interactive visualization filtering options is an interesting area to be investigated.

R. NO CONTEXTUAL INFORMATION
In SBFL, the ranking is performed only based on the suspiciousness score of each element. An element with a high score will get positioned at the beginning of the ranking list and vice versa. Thus, SBFL cannot distinguish between program elements that exhibit the same execution patterns. The reason behind this issue is that SBFL techniques leverage hit spectra (i.e., whether an element is executed or not) only as the abstraction for program executions without considering any other useful contextual information [148]. In other words, they represent a program's behavior as an abstract  [44].
Recently, the authors in [149] addressed this issue by using method call frequency. The frequency of the investigated methods occurring in call stack instances during the execution of failed test cases is used to modify the standard SBFL formulas. The basic idea is that if a method is called multiple times in a failed test case, it is more likely to be faulty than others. Thus, the ef of each formula was changed to the frequency ef . Their experimental results showed that adding this new information to the existing formulas can lead to improvements in the effectiveness of SBFL. However, this approach can only be applied to the formulas that have the ef numerator. Also, it is considered heavy, as it requires tracing the execution of each method call, as caller or callee, in the failed test cases.
In [24], [150], the authors also utilized the relations of software methods. Particularly, they investigated the fault influence propagation implied in method calls. The basic idea is that a caller method often calls several callee methods with complex logical controls, making the complexity of the caller method usually higher than the callee methods. According to the complexity degree, fault influence may often propagate from the callee method to the caller method. Also, the callee's influence is statistically the most crucial factor, and this influence can be utilized to improve the suspiciousness estimation. From the caller's perspective, the caller's suspiciousness evaluation often contains multiple callees' behaviors and influences. Also, propagating redundant fault influence reduces the accuracy of the suspiciousness computation. Therefore, the authors extended the basic intuition of SBFL (i.e., a program element executed in more failed test cases is more likely to be faulty) with a hypothesis that the method linked with more and higher suspicious methods is more likely to be the root cause. Based on such intuitions, a heuristic approach called Fault Centrality was proposed in this paper to capture the local faulty suspiciousness influence of the callee method to the caller for boosting SBFL.
A method call sequence mining with a slide-window method has been used in [151] to boost the performance of SBFL. The authors achieved this by splitting each method call sequence into different sub-sequences. Then, they computed the hit-spectra for each sub-sequence. After that, they took the maximum suspicion score of the sub-sequences that contain the target method as its final score. In [152]- [155], the method call sequences have also been employed to highlight the methods that are more often related to other methods in the failed executions of test cases. However, many such studies and other contextual information can be considered to improve the effectiveness of SBFL.

VI. THREATS TO VALIDITY
There are different threats that might affect the validity of each survey study. For this study, different internal and external threats were avoided by considering the following actions: • Internal validity -Finding related papers: There is no guarantee that all the papers related to the topic of this study have been found. Therefore, a search string containing different term synonyms was applied to various literature sources to obtain the related publications. Despite that, there may be some important relevant papers left. To address this threat, the snowballing search technique was used in order to lower the possibility of missing them. -Paper inclusion/exclusion criteria: Applying paper selection criteria can pose a threat of personal bias. Thus, only after the authors reached an agreement were the papers included or excluded in/from this study.
• External validity -Study reproducibility: Another threat to consider is whether or not other researchers will be able to replicate this study and obtain similar results. This issue can be addressed by providing the details of how this study has been conducted. Therefore, Section IV thoroughly describes each step of the research methodology that was used in this study.

VII. CONCLUSIONS
Software cover many aspects of our day-to-day life and our world cannot be imagined without different types of software products that automate most of our activities. Therefore, developing high-quality software is crucial. However, faults are almost unavoidable in software products even with all the current advancements in software development. Locating faults in software is a difficult, time-consuming, tedious, and costly task.
To overcome this issue, many fault localization techniques have been proposed in the literature. Compared to other available techniques, the SBFL is considered the most prominent technique. It computes the suspiciousness of each program entity of being faulty based on information gathered from test cases, their results, and their corresponding code coverage. Several important issues and challenges have been identified and categorized in this study. In each category, the most important issues have been briefly presented with some possible ideas to address them.
In conclusion, this survey aims to provide a clearer understanding of the most important challenges and issues in spectrum based fault localization, such that additional studies can be carried out to overcome these issues or possible avenues can be suggested for further exploration. Also, the results of this paper may be of great interest to novice testers and researchers who would like to provide contributions to this interesting topic. We hope that this paper will be regarded as a primary source of useful and relevant information on the issues and challenges in SBFL. QUSAY I. SARHAN received his B.Sc. degree in Software Engineering from the University of Mosul, Iraq, in 2007 and his M.Tech. degree in Software Engineering from Jawaharlal Nehru Technological University, India, in 2011. Qusay has been a lecturer at the University of Duhok, Iraq since 2012. Currently, he is working at the department of Software Engineering, University of Szeged, Hungary. Qusay has a couple of national and international publications and his research interests include software engineering, internet of things, and embedded systems.
ÁRPÁD BESZÉDES obtained his Ph.D. degree in computer science from the University of Szeged in 2005 and is currently an associate professor at the same institution. His active research area is static and dynamic program analysis with special emphasis on software testing and debugging applications. Dr. Beszédes has over 100 publications and is regularly invited to serve in the Program Committees of various software engineering conferences, and as a reviewer and editor for software engineering and computer science journals.