Advanced Crowdsourced Test Report Prioritization Based on Adaptive Strategy

Crowdsourced testing is an emerging trend in software testing, which takes advantage of the efficiency of crowdsourced and cloud platforms. Crowdsourced testing has gradually been applied in many fields. In crowdsourced software testing, after the crowdsourced workers complete the test tasks, they submit the test results in test reports. Therefore, in crowdsourced software testing, checking a large number of test reports is an arduous but unavoidable software maintenance task. Crowdsourced test reports are numerous, complex, and need to be sorted to improve inspection efficiency. There are no systematic methods for prioritizing reports in crowdsourcing test report prioritization. However, in regression testing, test case prioritization technology has matured. Therefore, we migrate the test case prioritization method to crowdsourced test report prioritization and evaluate the effectiveness of these methods. We use natural language processing technology and word segmentation to process the text in the test reports. Then we use four methods to prioritize the reports: total greedy algorithm, additional greedy algorithm, genetic algorithm, and ART. The results show that these methods all perform well in prioritizing crowdsourced test reports, with an average APFD of more than 0.8.


I. INTRODUCTION
Crowdsourcing technology has been widely favored in the field of software engineering research [1] in recent years. In crowdsourcing software testing, the main participants include task requesters, crowdsourcing workers, and crowdsourcing platforms [2]. Crowdsourced works are required to perform test tasks and submit a test report of the system's behavior. The test reports are usually composed of natural language descriptions, sometimes with screenshots. Since crowdsourced testing usually involves many users, the number of test reports can be great, and the task of reviewing these test reports is time-consuming and expensive. Therefore, developers are trying to find methods to prioritize and filter valuable reports.
Previous research has produced test case prioritization techniques. The test case prioritization is to arrange the execution order of the test cases in the regression test suite to execute them in the order of earlier faults. The existing prioritization algorithms can be divided into the following groups, i.e., greedy algorithm, search-based al-gorithm, information-retrieval-based algorithm, integratelinear-programming-based algorithm, and machine-learningbased algorithm [3]. In the field of crowdsourced test report prioritization, there is no mature technology proposed, and the test case prioritization technology has matured. This motivates us to migrate and apply test case prioritization technology to the prioritization of crowdsourced test reports.
In this paper, we apply four test case prioritization techniques to crowdsourced test report prioritization and evaluate the effectiveness of these methods. The four methods are total greedy algorithm, additional greedy algorithm, genetic algorithm, and ART [4]. We collect a total of 723 crowdsourced test reports for five different types of applications. After obtaining the crowdsourced test reports, we use natural language processing (NLP) technology and word segmentation to process the texts in the test report. First, we put together all the texts of each report. Then, Jieba library 1 is used to perform word segmentation processing on the summarized texts. After word segmentation, we filter out the stop words based on the stop word list 2 . Finally, we use a one-hot encoded vector to represent each report. After getting the vector, we use the above four algorithms to sort the test report. For the ART method, in addition to the word segmentation method, we also use the Word2Vec model [5], and TF-IDF [6] to encode the texts. The average percentage of faults detected (APFD) [7] is introduced to compare four test report prioritization methods. The experimental results show that four methods perform well in the crowdsourced test report prioritization, and the average APFD is above 0.8.
In this paper, our main contributions are as follows: 1) We first apply the test case prioritization method to the prioritization of crowdsourced test reports, adopt corresponding text encoding strategies for different methods, and explore the feasibility of these methods. 2) We employ a diverse strategy to rank test reports, and these strategies are evaluated using five crowdsourced test report datasets.
The rest of this paper is structured as follows. In the section II, we introduce the relevant background and motivation. Section III describes four prioritization methods. In section IV and V, we conduct empirical experiments to evaluate the effectiveness of the prioritization of crowdsourced test reports. The related work is reviewed in Section VI. Finally, in section VII, we conclude this paper.

II. BACKGROUND AND MOTIVATION
In this section, we describe the background of crowdsourced software testing to motivate the prioritization of test reports. We also show some sample test reports. Prioritization of test cases is also introduced in this section.
Crowdsourced Software Testing. In 2006, Howe first proposed the concept of crowdsourcing [8]. Subsequently, many scholars defined crowdsourcing from different angles [9]. In crowdsourcing software engineering, researchers have proposed a large number of related application technologies and application scenarios. Zhang et al. derived the definition of crowdsourcing software testing from the definition of crowdsourcing software engineering [10]. Figure 1 shows the procedure of crowdsourcing testing. Testers prepare the software under test and test tasks. Test tasks are divided into sub-tasks. These tasks are released on a crowdsourced platform, and workers bid for test tasks. When a bug is detected, the worker submits a test report online.
In crowdsourced software testing, workers will submit a test report after completing the task. Because crowdsourced testing involves many tasks, workers usually submit thousands of test reports. Testers manually check all test reports to judge the performance of workers, which is a timeconsuming and tedious process. Therefore, this motivates us to prioritize crowdsourced test reports.  Test Case Prioritization. Prioritization of test cases is to prioritize test cases that need to be re-executed in regression testing. Test cases are executed in the prioritized order to detect more faults earlier. The test case prioritization can be defined as the following process: given any test suite T , the test case prioritization is to find a permutation T ′ of T that satisfies f (T ′ ) ⩾ f (P T ), where P T represents a set of possible permutations of T and f is a function from P T to the real number, when applied to an ordering, an evaluation of the ordering occurs [11].
Test case prioritization algorithms can be divided into several groups, i.e., greedy algorithm, search-based algorithm, information-retrieval-based algorithm, integratelinear-programming-based algorithm, and machine-learningbased algorithm [12]. The greedy algorithm is widely used to solve the problem of test case prioritization, and it searches for the optimal local solution of prioritization. Since the greedy algorithm does not always get the optimal solution in the solution space, search-based algorithms are used to solve the prioritization problem. Integrate-linear programming-based algorithm transforms the prioritization of test cases into a formula construction and solution process. The information-retrieval-based algorithm uses each test case's execution information or source code to construct a corresponding document collection for each test case. The machine-learning-based algorithm uses machine learning technology, which builds a model from sample input to make predictions on new data.

III. METHODOLOGY
In this section, we introduce our crowdsourced test report prioritization method in detail. Figure 2 shows the overall framework of our test report prioritization.

A. PREPROCESSING
After collecting the crowdsourced test reports, we first preprocess the report. These test reports strictly include four parts: Environment, Input, Description, and Screenshot. Our preprocessing operations for the report include three pro-cesses: word segmentation, stop word removal and synonym replacement.
Word Segmentation. Word segmentation is a natural language processing (NLP) task. There are many effective word segmentation tools for different languages [13]. Since our report is composed in Chinese, we use the Jieba library to segment sentences into words. Jieba is a word segmentation algorithm specially used for Chinese word segmentation. It uses a prefix dictionary to achieve efficient word graph scanning, generates a directed acyclic graph composed of all pos-VOLUME 4, 2016 sible generations of Chinese characters in the sentence, then uses dynamic programming to find the maximum probability path, and finds the maximum cut based on word frequency. For unregistered words, the HMM (Hidden Markov Model) model [14] based on the ability to form Chinese characters is used, and the Viterbi algorithm is used.
Stop Word Removal. In crowdsourced testing, workers usually come from different workplaces, and they have different language preferences and expression habits. Therefore, the test report will contain some meaningless words, and these words do not need to be part of the features of the test report. We filter out these useless words according to the stop word list, usually called "stop words" in the NLP literature.
Synonym Replacement. In crowdsourced testing, workers may use different words to express the same concept in the test report. For example, "turn on" and "open" refer to the same semantics. To alleviate this problem, we perform the synonym replacement operation. We adopt the synonym library of Language Technology Platform (LTP) [15], which is considered to be one of the best Chinese NLP platforms.

B. TEXT FEATURE EXTRACTION
We extract three types of report text features, i.e., TF-IDF (Term Frequency and Inverse Document Frequency) feature, word embedding feature, and one-hot encoding feature.
TF-IDF Feature. TF-IDF is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. The main idea of TF-IDF is that if a word or phrase appears in an article with a high frequency of TF and rarely appears in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification. Specifically, given a term T and a report R, T F (T, R) is the frequency of the term T in the report R. IDF (T ) is obtained by dividing the total number of reports by the number of reports containing the term T , and then taking the logarithm of the obtained quotient. The calculation formula of TF-IDF is: Through the above formula, the text description of report R can be expressed as a TF-IDF vector, i.e., R = (W 1 , W 2 , ..., W i ), where W i represents the TF-IDF value of the i-th item in the report R.
Word Embedding Feature. Word embedding is a feature learning technology in natural language processing. A single word is represented as an absolute number vector in a predefined vector space, and each word is mapped to a vector [16]. We use the Word2Vec model to obtain the word embedding of a report. Word2Vec is one of the methods of word embedding. The trained word embedding model will convert each word into a 100-dimensional vector. A crowdsourced test report contains multiple words, which can be converted into a matrix. Each row of the matrix represents a term in the report. Then, we convert the report matrix into a vector by averaging all word vectors contained in the report.
Specifically, given a reporting matrix with a total of M rows, the i-th row of the matrix is denoted as r i , and the converted report vector R w is generated as follows: According to the above formula, each crowdsourced test report can be expressed as a word embedding vector.
One-hot Encoding Feature. One-hot encoding is also known as one-bit effective encoding. One-hot encoding is a word represented by a vector of length V , where only one position is 1, and the rest are 0. V is the size of the lexicon in the corpus. We construct a corpus based on the collected crowdsourced test reports and convert each report into a report vector encoded by 0 and 1.

C. PRIORITIZATION STRATEGY
In this section, we use four prioritization strategies to prioritize test reports. These four strategies are total greedy algorithm, additional greedy algorithm, genetic algorithm, and ART.
Total Greedy Algorithm. The total greedy algorithm is widely used to solve the test case prioritization problem, which focuses on always selecting the current best test case during the test case prioritization period. The goal of the total greedy algorithm is to select tests that contain more sentences and prioritize the test cases according to the descending order of the sentences covered by each test case. The total greedy algorithm directly calculates the program entity coverage of each test case and sorts it from high to low. Assuming that there are m test cases and n program entities that need to be covered, the time complexity of this strategy is O(mn). For this strategy, we use the one-hot encoding feature to prioritize reports.
Additional Greedy Algorithm. The additional greedy algorithm introduces a feedback mechanism. This algorithm believes that when the executed test cases cover some program entities, the remaining test cases do not need to be considered for coverage of the above program entities. Specifically, each time a test case is executed, the coverage information of the remaining test cases needs to be updated in real-time. When all program entities are covered, reset these program entities to uncovered and iteratively apply the above process to the remaining test cases. The time complexity of the strategy is O(m 2 n). This algorithm uses the one-hot encoding feature to prioritize reports.
The greedy method is a standard method to solve the test case prioritization problem. After collecting the historical coverage information of test cases, according to the coverage ability of each test case's program entities (such as statements, branches, or functions), set the weight for it, and apply the greedy method to guide the ordering of test cases. The assumption is that improving the early coverage of program entities helps improve the early detection rate of defects. The total greedy and additional greedy algorithms focus on FIGURE 2. The overall framework of test report prioritization searching for the optimal local solution for prioritization, so the prioritization result may not be optimal.
Genetic Algorithm. The genetic algorithm is an evolutionary algorithm. Its basic principle is to imitate the evolutionary law of "natural selection in competition and survival of the fittest" in Darwin's biological evolution theory. The genetic algorithm encodes the problem parameters into chromosomes. It then uses iterative methods to perform selection, crossover, and mutation to exchange information about chromosomes in the population and finally evolve into chromosomes that meet the optimization goals. In this algorithm, each test sequence is encoded in an n-size array representing an instance of a chromosome. A set of test sequences is randomly generated as the initial individual in the initial step. Combine the selected individuals guided by the fitness function to generate a new generation. This process is iterative. The entire search process stops until the requirements are met. The feature of this algorithm for prioritization is a one-hot encoding feature.
The genetic algorithm belongs to search-based prioritization, characterized by finding the optimal solution under the guidance of a predefined fitness function in the search space.
ART. ART was proposed by Jiang et al. in 2009. Its primary purpose is to prioritize a given set of test cases. It iteratively builds a set of candidate test cases and then selects a test case from the candidate set until all the given cases are selected. It calls a generated process to build the candidate set. The generation process iteratively constructs a set of unselected test cases by randomly adding the remaining test cases to the candidate set, as long as they can increase the program's coverage and the candidate set is not complete yet. Another procedure selection function needs to be called to decide which candidate test case to choose. The latter requires a function f 1 to calculate the distance between a pair of test cases and another function f 2 to return the index of the selected test case furthest from the priority set. The function f 1 uses the Jaccard distance between two test cases based on the coverage structure. For a test suite with m test cases and n sentences, the time complexity of the algorithm is O(m 2 ) in the best case and O(m 3 n) in the worst case. We use the TF-IDF feature, word embedding feature, and one-hot encoding feature for the ART algorithm.

IV. EXPERIMENT
In this section, we evaluate the effectiveness of four test report prioritization algorithms. We introduce the experimental settings and the baseline method compared with the four prioritization algorithms. We also set up two research questions (RQ). Finally, we introduce the evaluation metric.

A. EXPERIMENT SETUP
To evaluate the four prioritization algorithms, we design an empirical experiment. We collect 723 crowdsourced test reports from 5 different mobile applications to complete the experiment. Table 2 shows the detailed information. The labels of these applications range from A1 to A5, and the number of test reports for different applications ranges from 39 to 338.

B. BASELINE
We select two baselines for comparison to verify the effectiveness of four prioritization methods. The first baseline is the Random strategy, which is widely used in software testing. Random strategy randomly selects a test report for VOLUME 4, 2016 sorting, forming a priority ranking list. The second baseline is the Best strategy, which is an ideal prioritization technique.
To compare these prioritization methods fairly, we repeat the experiment 10 times to collect experimental data.

C. RESEARCH QUESTION
In the experiment, we design the following two research questions.
• RQ1: How effective are these four prioritization algorithms? • RQ2: How far are these four methods from the Best strategy? When the priority method is not applied, testers will apply the Random strategy to prioritize the test reports. Only when the prioritization method can exceed the Random strategy will testers use it. RQ1 evaluates the effectiveness of four prioritization methods.
In practice, it is difficult to design effective methods in all situations. Therefore, it is valuable to understand the gap between the current four methods and the theoretically Best method. RQ2 evaluates the room for improvement in the four prioritization methods.

D. EVALUATION METRIC
In order to measure the effectiveness of four prioritization methods, we adopt APFD (Average Percentage of Failure Detection) [17]. Test report prioritization simulates the ideal test case prioritization, and APFD is an evaluation metric that is widely used in test case prioritization [18]. Therefore, we use APFD to evaluate the effectiveness of four test report prioritization methods. We will record the index of the first test report that showed it for each bug. According to the order of the test reports and which test reports show which fault information, we can calculate the APFD value to measure the effectiveness of the prioritization method. The value of APFD ranges from 0 to 1. The higher the APFD value, the better the prioritization result. The calculation formula of APFD is as follows, where n is the number of test reports, M is the total number of bugs shown in the test report, and T f i is the index of the first test report showing the bug i.

V. RESULT ANALYSIS
In this section, we analyze the experimental results to answer RQ1 and RQ2. The results of all prioritization methods are shown in Figure 3. Figure 3 shows a box plot of APFD results for ten experimental runs of five applications. The application label is displayed on the horizontal axis, and the APFD value is displayed on the vertical axis.

A. ANSWER TO RQ1
RQ1: How effective are these four prioritization algorithms? According to the results shown in Figure 3, we can find that these four priority ranking methods are better than Random. For the greedy strategy, the average APFD calculated by the additional algorithm is slightly better than the total on these five applications. There is little difference in the results obtained by using three different features to prioritize the ART strategy. On the two applications, A2 and A5, the calculated average APFD value is the same. In the remaining three applications, the calculated average APFD values are within 3%. In these three strategies, the average APFD value of the genetic algorithm in each application is above 0.9. In addition, the box plot and Table 3 show that these four methods are more stable than the random method. In summary, we found that these four prioritization methods can improve the effectiveness of the prioritization of test reports.

B. ANSWER TO RQ2
RQ2: How far are these four methods from the Best strategy?
In order to avoid the contingency of the experimental results, we run the random method ten times on the test report data set of each application and calculate the average APFD. Table 4 to 9 shows the comparison results of the four prioritization strategies and the baseline method. In all the tables, "Improvement" means the improvement of the method compared to Random. The calculation formula is X−Random Random , where X is our method. "Gap" means the gap between the method and the Best. The calculation formula is , where X is our method. As can be seen from the table, we observe that the results of these four strategies are better than random and close to the best results. Among the two greedy strategies algorithms, the additional greedy algorithm performs better than the total greedy algorithm in these five applications, and the gap with Best is smaller. For the ART algorithm, we used three features. It can be seen from the results that the APFD calculated using these three features is not very different. Overall, the APFDs calculated by the four strategies are similar, and all are better than randomly ordered test reports.

C. THREATS TO VALIDITY
In our research, there are some general threats to its effectiveness.
Crowdsourced worker capacity. The capacity of the crowdsourced workers is not controlled, so there may be some low-quality reports. However, these four strategies identified the errors described in the reports even when some reports were of low quality. If the test report contains no errors, these policies will classify the report into a single category and will not affect the priority of other reports.
Experimental software selection. We selected five applications widely used on the Internet that are not specifically designed for our research. Due to time and cost constraints, we only collected crowdsourced test reports for these five applications. The amount and type of software may threaten the generalization of our conclusions. However, there are different types of software in our choice, which may reduce the threat to a certain extent.
Test report text. In our experiment, all test reports are written in Chinese, which may threaten to apply these methods to other natural language test reports. However, the current natural language processing technology is very mature, and there are different toolkits for processing different languages. Moreover, our preprocessing operations do not limit the type of language. In summary, we adopt different toolkits for different languages to deal with them, which can reduce this threat.

VI. RELATED WORK
In this section, we introduce the work related to our research, including crowdsourced testing, crowdsourced test report processing, and test case prioritization.

A. CROWDSOURCED TESTING
Crowdsourcing refers to a company or organization that outsources work tasks performed by employees in the past to non-specific and usually large mass networks in a free and voluntary manner. Since crowdsourcing can simulate different conditions of use and the economic cost is relatively low [19] [20], crowdsourcing has been applied in many fields.
Crowdsourced testing is a relatively new trend in software engineering [21]. Liu et al. [22] studied the results of applying crowdsourced testing to usability testing and also examined empirical comparisons between crowdsourced usability testing and more traditional face-to-face usability testing results. Through experiments, they found that crowdsourcing exhibits some limitations compared to traditional lab environments, but its applicability and value in usability testing are clear. In addition to applying crowdsourced testing to usability testing, Dolstra et al. applied crowdsourced testing to GUI testing by instantiating a virtual machine running the system under test and giving testers access to the virtual machine through their web browsers. Outsourced to a large number of workers from the Internet [23]. Nebeling et al. [24] developed a toolkit for crowdsourced testing of web interfaces that not only efficiently recruits large numbers of test users but also evaluates websites under many different conditions.
Another part of the research focuses on the selection of crowdsourced workers. Not all workers are qualified to perform a particular test task. Cui et al. [35] proposed a multiobjective population worker selection method (MOOSE), which utilizes the widely used multi-objective evolutionary algorithm NSGA-II to optimize three objectives when selecting workers. ExReDiv [36], a novel hybrid method consisting of an empirical strategy, a correlation strategy, and a diversity strategy, is used to select a set of workers for testing tasks. Furthermore, Xie et al. [37] propose a new research problem called Crowdsourced Test Quality Maximization under Context Coverage Constraints.
The above researches apply crowdsourced testing methods to traditional software testing activities to solve some problems in traditional software testing. Instead, we propose to migrate the test case prioritization method to test report prioritization to solve the problem in crowdsourced testing.

B. CROWDSOURCED TEST REPORT PROCESSING
In crowdsourced testing, a large number of test reports will be generated, which requires developers to spend a lot of time checking, and the key issue is to improve the efficiency of developers reviewing test reports. There has been extensive research on handling crowdsourced test reports to better assist developers in reviewing reports.
Wang et al. [25] proposed a clustering-based classification method that first clusters similar reports together and then uses an ensemble method to build a classifier based on the most similar clusters, which can reduce the effort required for manual inspection and facilitate project management in crowdsourced testing. This is the first work to address the classification of test reports in practical industrial crowdsourcing testing practice. Wang et al. [26] proposed Local Based Active Classification (LOAF) to classify accurate faults in crowdsourced test reports. LOAF recommends a small set of the most informative instances in the local neighborhood, asks the user their labels, and then learns a classifier based on the local neighborhood. Furthermore, Chen et al. proposed a new framework named Test Report Augmentation Framework (TRAF) to improve the quality of inspection test reports [27]. They enhance test reports with additional useful information in repeated test reports, which can help developers better understand and fix bugs. Sun et al. [28] used an information retrieval model to more accurately detect duplicate bug reports.
The above methods include report classification and duplicate detection, all of which select part of the test report to represent all the test reports. In addition to this, some studies have attempted to assist in the generation of test reports. Liu et al. used the spatial pyramid matching technique to measure similarity and extract features from screenshot images to generate descriptions for screenshots, which could allow developers to understand mobile crowdsourced test reports better [29]. Yu et al. generate crowdsourced bug reports by analyzing crowdsourced uploaded error screenshots using image understanding techniques to provide operational guidance for crowdsourcing [30].

C. TEST CASE PRIORITIZATION
Test case prioritization is proposed in regression testing to handle the execution order of test cases by scheduling to detect more failures earlier [31]. Rothermel et al. [32] proposed various basic test case prioritization techniques, including total techniques and additional techniques. Zhang et al. [33] [34] proposed a unified priority model that uses a probabilistic model to bridge the gap between total and additional algorithms.
Li et al. [38] applied a meta-heuristic search-based algorithm to test case prioritization, and they applied the steepest ascent hill-climbing algorithm and genetic algorithm. In addition, Zhang et al. [39] first applied Integrated linear programming to solve the time-aware test case prioritization problem. Nguyen et al. [40] proposed an information retrieval-based method to prioritize test cases for web services, which used an identifier document extracted from the execution trace to represent each test case and a web service change description as input queries for information retrieval. Tonella et al. [41] proposed a test case prioritization method based on machine learning, using the priority indicators defined by user use cases, as well as test case information such as coverage and failure propensity indicators as features, to train a model to predict test cases priority. Ledru et al. [42] used string distance comparisons on test case texts to prioritize test cases and provided empirical results on different distances. In addition to this, [43] proposed ESBS, which iteratively selects test cases from each cluster and uses spectral information in cluster test selection.
Test case prioritization techniques have matured, which inspired us to migrate test case prioritization techniques into test report prioritization. Most test case prioritization techniques use execution profiles, whereas we use test reports in natural language. Test case prioritization enables more failures to be detected earlier by executing test cases. Prioritizing test reports can improve tester review efficiency by checking test reports.

VII. CONCLUSION
We employ a diversity strategy to prioritize test reports to help developers prioritize and filter useful report information. We process the text in the test reports using NLP techniques and use three different features to represent each report. These three features are the one-hot encoding, TF-IDF, and word embedding. After obtaining the feature vector, we use the total greedy algorithm, additional greedy algorithm, genetic algorithm, and ART to prioritize the test report. Five mobile crowdsourced test report datasets are used to evaluate the effectiveness of the test report prioritization method. The results of the empirical study show that these four prioritization methods all have good performance and can effectively help developers reduce inspection costs. VOLUME 4, 2016 In the future, we will continue to dig into the adaptive test report prioritization methods, not only by borrowing techniques for prioritizing test cases. In addition, we will also try to design a new test report prioritization technology, starting from the test report itself, analyzing the images and text information contained in the test report, and extracting the corresponding features for report prioritization. And we will apply these techniques to actual production to solve test report-related tasks.