DBSCAN-Based Automatic De-Duplication for Software Quality Inspection Data

Software quality inspection will generate too much data, and removing duplicate data can improve the efﬁciency of software quality inspection. This paper studies the automatic de-duplication method of software quality inspection data based on density-based spatial clustering of applications with noise (DBSCAN) clustering. Intelligent optimization algorithm is used to generate software quality inspection data by initializing individuals, calculating ﬁtness function value, improving individuals and splitting individuals that meet the conditions. Local linear embedding algorithm is selected to extract software quality inspection data features by searching neighborhood points, calculating reconstruction weight and projection vector. The extracted features are used to select DBSCAN multi-density clustering algorithm of regional division, and the automatic de-duplication of software quality inspection data is realized by grid division, data bin dividing and grid merging. The experimental results show that the precision and recall of this method are higher than 99%, and the resource consumption rate is low, which can effectively improve the efﬁciency of software quality inspection.


I. INTRODUCTION
With the rapid development of information technology, the scale of data appears explosive growth. Mining valuable information from complex data has great practical significance [1]. As an important method in the field of data mining, clustering algorithm is widely used in data analysis and mining. DBSCAN algorithm is a typical clustering algorithm, and data density is as a measure, which can identify arbitrary shape classes and noise points of data sets. Some data show that by 2020, the data accumulated by China will account for 20% of the global data. Nowadays, both the government and enterprises have accumulated a large amount of data [2], and mining useful information from these data has great research value. As an important technology in the field of data mining, clustering has been applied in many fields, such as pattern recognition, information retrieval, network public opinion prediction and so on.
Clustering is an unsupervised learning method, which divides the objects into several classes according to the similarity of objects, so that the objects in the same class have The associate editor coordinating the review of this manuscript and approving it for publication was Jian Guo. the greatest similarity, while the objects in different classes have the greatest difference. Based on the diversity of data sets and the wide range of applications, the types and scale of clustering algorithms have been greatly developed. According to different clustering principles, clustering algorithms are mainly divided into partition-based method, hierarchybased method, density-based method, grid-based method and model-based method. Among them, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is a classical density-based clustering algorithm, which can not only find arbitrary shape classes in the data set, but also find noise points and outliers. At present, DBSCAN has been applied to e-commerce, network public opinion prediction and other fields, and has played a great role in these application fields. For example, in the field of e-commerce, users can be grouped into different characteristics through DBSCAN, and managers can set different marketing strategies for different types of users, so as to improve the user experience [3]- [5]. In the field of network public opinion, the development of some events can be predicted according to the characteristics of similar events. At the same time, based on the characteristics of DBSCAN for identifying outliers, it can provide technical support in the field of network crime.
For example, according to the different behavior tracks of some network users who provide false fraud information and ordinary network users, they can be identified as the key investigation object. Naive Bayes algorithm was used, and multiple news data were used as experimental training samples to get a more complete mathematical model to classify news text by using data preprocessing, Chinese word segmentation and other methods. However, in the process of using naive Bayes algorithm, the supervised learning algorithm is required some characteristic samples as the training set, so it is not suitable to be used as the de-duplication classification algorithm. Other methods reduce the linear and nonlinear damage in the multi input single output (MISO) visible light communication (VLC) system. By presenting a novel dimming control scheme at the transmitter and generating various superimposed constellation structures at the receiver, the spatial multiplexing was realized. However, when using K-means clustering algorithm, the number of clusters should be given. When solving the problem that the parameters of DBSCAN algorithm were not easy to determine, the classic OPTICS algorithm is proposed. This algorithm did not generate a clear class for the data set, but calculates a cluster order for the data set, which reflected the density distribution structure of the data set and provided sufficient reference information for parameter setting. But the cluster order calculation of OPTICS algorithm was based on the whole data set, which led to low scalability and high cost of parallelization and increment. Then, a method based on multi-dimensional parameter setting is proposed to solve the problem of parameter setting when DBSCAN processes highdimensional data. In order to improve the time efficiency of DBSCAN algorithm, the class fusion standard of DBSCAN algorithm is changed, and a DBC algorithm based on similarity weight measurement was designed. The algorithm greatly reduces the time complexity by changing the class fusion strategy. Other improvement includes using a graph-based index method to improve the efficiency of neighborhood query in high-dimensional datasets, and further improved the efficiency of algorithm, using data partition to speed up the efficiency of DBSCAN algorithm to a certain extent, etc.
With the increasing scale and complexity of software applications, software quality and reliability become particularly important, and computer software quality testing has become the focus attracted many people. In the process of using computer software, the defects of computer software itself will affect the normal use of the computer. Computer software defects exist objectively in the process of software development. If the software defects can not be found in time, it may lead to the useless of computer software, or even produce serious consequences that can not be made up. In related research, Gokilavani et al. [6] proposed the use of K-means clustering algorithm for software testing. After the clustering process, the method applies a sorting algorithm and uses it to prioritize the software test data in the cluster. Finally, by analyzing the priority ranking of clusters, the sum of different faults detected based on the number of test cases, the adjacency matrix between test cases, and the failure detection rate to cross-validate the software performance. This approach though enables software testing after data prioritization. However, in this process, its repeated data processing efficiency is not high and needs to be further improved. Guo et al. [7] proposed an improved method for oversampling of unbalanced data -SMOTE algorithm based on canopy and K-means. The Synthetic Few Oversampling Technique (SMOTE) is a preferred method for solving the problem of unbalanced data classification. This method has designed a perfect algorithm, called ''C-K-SMOTE,'' which is a hybrid clustering algorithm of Canopy and K-means. In order to get an approximate balanced data, Canopy is first used to realize approximate clustering, then K-means is used to obtain an accurate clustering, and then the data is re-classified. This method has a certain efficiency of weight removal, but the accuracy of the results still needs to be improved. Zhang et al. [8] have proposed RMD, a deduplication scheme based on similarity and merging, which aims to provide a quick response to fingerprint queries. The key idea of RMD is to significantly reduce the query scope by using Bron filter arrays and data similarity algorithms. When the data is ingested, RMD uses a similarity algorithm to detect similar data segments and place them in the same bin. Therefore, when the query, only need to search in the corresponding bin can detect the duplicate content, greatly speeding up the query speed. But the resource consumption rate of this method is higher.
Computer software defects exist objectively in the process of software development. If software defects cannot be found in time, the computer software may become unusable, or even cause irreparable serious consequences. In order to further improve the efficiency of software detection. Therefore, this paper proposes an automatic deduplication method for software quality detection data based on DBSCAN clustering. Considering that the software quality inspection data has certain characteristics. Therefore, it is considered to use the algorithm for clustering and deduplication according to the data characteristics.

A. SOFTWARE QUALITY INSPECTION DATA GENERATED BY INTELLIGENT OPTIMIZATION ALGORITHM
When obtaining the quality inspection data of computer software, the appropriate mapping rules should be determined. Binary coding is a better mapping form, which has many advantages such as simple expression, easy operation and so on, and can improve the computational efficiency of intelligent optimization algorithm. According to different data types, the corresponding encoding mapping is also different. Other parameters may be affected when testing computer software data. Therefore, it is necessary to encode each input parameter independently to make it a binary parameter, and then connect all the parameters to form an independent individual, which is called multi parameter concatenated coding. The specific process is as follows: VOLUME 11, 2023 Before concatenation : X 1 , X 2 ,. . . , In decoding, n segments should be cut from the total code, and the length of data chain should be m, and then the data chain should be decoded separately.
Intelligent optimization transformation equation is adopted.
In equation (1), θ is the rotation angle. The steps of using intelligent optimization algorithm to generate test data are as follows: Start algorithm Step 1. Initialize individual Scan the given path, find out the variables that are required to generate test data, assign random 0, 1 strings to each variable, and form individuals according to the principle of multi parameter cascade coding.
Step 2. Calculate the fitness function value According to the constructed fitness function, the fitness function values of each group of variables are calculated respectively. If the termination condition is met, go to Step 4.
Step 3. Improve individuals If there are no individuals meeting the conditions, the following operations are performed: a) Select the next generation individuals according to the fitness function value. b) Solution crossover: randomly select two individuals from the generated individuals for single point crossover to get new individuals [9], and repeat this step until all individuals are selected. c) Variation of solution: randomly add some variations to the individuals after crossover to produce new individuals. d) Go to Step 2.
Step 4. Split the individuals that meet the conditions Convert the 0 and 1 strings corresponding to each variable into decimal numbers, and these data are the generated test data.
End algorithm

B. FEATURE EXTRACTION OF SOFTWARE QUALITY INSPECTION DATA
Local linear embedding algorithm (LLE) is a local unsupervised algorithm. Its basic idea is to assume that the data has a local linear structure [10]- [12]. The local geometric structure of the data is described by using the nearest neighbor points to reconstruct the arbitrary sample points linearly, and then the local information is superimposed to achieve the description of global information. The essence is to map the data matrix The algorithm can be divided into three steps Step 1: search neighborhood points.
x im − x jm 2 from each sample point x i to other points in the data set is calculated to form the distance matrix; then, for each sample point x i , k points closest to the sample point are selected from the distance matrix to form the adjacent point Step 2: calculate the reconstruction weight. Firstly, the sample point x i is expressed linearly, namely where, w ij is the weighted coefficient of sample point x j reconstructing x i , and x j belongs to the distance matrix; then, the weight w ij of reconstructing each sample point x j is calculated, which is obtained by minimizing the reconstruction error. The expression is as follows: The constraint condition of the above equation is k j=1 w ij = 1. Obviously, if x j does not belong to the distance matrix, then w ij = 0. In order to solve equation (3) more easily, the local covariance matrix is introduced, and the Lagrange multiplier method is used to obtain the local optimal reconstruction weight matrix.
Step 3: calculate the d-dimensional projection vector. Fixed w ij , the reconstruction error is defined as follows: The projection vector y i can be solved by minimizing the above equation. In order to solve equation (5), w ij is extended to sparse matrix w ∈ R N ×N , then equation (5) can be rewritten as: In the above equation, M ∈ R N ×N . Therefore, the solution of minimization equation (5) is the eigenvector corresponding to d minimum nonzero eigenvalues of symmetric sparse cost matrix M = (I − w) T (I − w).

C. AUTOMATIC DATA DEDUPLICATION BASED ON DBSCAN MULTI-DENSITY CLUSTERING ALGORITHM
As a density based clustering algorithm, DBSCAN can cluster any data set. The algorithm is required to input a pair of global density parameters MinPts and Eps, and supports the user to choose the appropriate parameters. The parameter MinPts refers to the number of data points in a certain neighborhood of an object, and Eps is the radius of the neighborhood. Based on this pair of globally unique density parameters, DBSCAN can detect classes and noise points in the dataset [13]. Generally, in a class, the density of boundary points is smaller than that of inner points, and the density of noise points is smaller than that of inner points. DBSCAN algorithm can use the density parameter to identify all the high density classes and low density noise points in the data set.
DBSCAN introduces the intuitive definition of ''class'' and ''noise point'' in data set. The density definition of DBSCAN algorithm is that each point contains at least MinPts points in the given neighborhood radius Eps, that is, the density in the neighborhood must exceed a certain threshold. In general, there are two types of points in a class. The points within the class are called core points, and the points at the boundary of the class are called boundary points [14]. In DBSCAN, if the number of points in the Eps neighborhood of a data point is greater than the threshold MinPts, that is, N eps (p) ≥ MinPts, it is called the core point. If the number of points in the Eps neighborhood of a data point is less than the threshold MinPts, but there is a core point in its neighborhood set, it is called a boundary point.

1) GRID DIVISION
Given a d-dimensional data set D (i = 1, 2, · · · , d), the number of data is N , and any dimension attribute A i of D is bounded. Let the value of the i-th dimension be in the interval Rg i = [l i , h i ], then S = Rg 1 × Rg 2 × · · · × Rg d is the ddimensional data space. Each dimension of the data space is divided into equal and disjoint intervals to form grid cells. These grids are left closed and right open in each dimension. In this way, the data space is divided into num i equal volume super rectangular grid cells (num i is the number of intervals in the i-th dimension of the data space).
Set the grid side length to: where: a is the grid control factor, which is used to control the size of the grid. All experiments in this paper use a = 1.5. According to the grid side length, the number of intervals on each dimension can be calculated as follows: 2) DATA DISTRIBUTION BOX Each object in the dataset is mapped to the corresponding grid. For each data object X (x 1 , x 2 , · · · , x d ), the subscript of the corresponding grid in each dimension is: For each object X in the dataset, it is mapped to the corresponding grid g according to equation (9), and the number of objects in grid g is den (g).

3) GRID MERGING AREA
Adjacent grid is defined as: if grid g 1 and g 2 are adjacent, then |ind i (g 1 ) − ind i (g 2 )| ≤ 1(i = 1, 2, · · · , d), and there are at most 3 d − 1 adjacent grid in a grid. Using the relative density difference [15], that is, two grid cells g 1 and g 2 , whose densities are den (g 1 ) and den (g 2 ) respectively, the relative density difference of g 2 relative to g 1 is defined as: This is the condition of grid merging. Firstly, the grid with the highest grid density is selected as the initial grid cell g 0 , and the relative density difference rgdd(g 0 , g) between the adjacent grids g and g 0 is calculated according to equation (10). If rgdd(g 0 , g) < ε (ε is the given parameter), then g 0 and g are merged. At this time, the initial cell grid g 0 becomes the merged large grid area, and the initial cell grid density den (g 0 ) = den(g 0 )+den(g)

G num
, where G num represents the number of cells that have been merged, that is, the initial cell density is dynamic.
This method continues to expand the merging grid outward until all the boundary grids do not satisfy the equation rgdd(g 0 , g) < ε, because some points in the boundary grid may be the boundary points of the region. Data points A and B are the boundary points of cluster C, which can not be regarded as noise [16], [17], so the boundary grid is also merged into the region (the boundary grid does not continue to expand outward). The merged grid forms a region, and the data set of this region is D 1 . Then, in the remaining unprocessed grids, the one with the highest density is taken as the initial cell grid, and the above steps are repeated until the remaining data points can no longer be clustered [18]. In this way, the data set D is divided into D 1 , D 2 ,. . . , D k and preliminary noise points.

4) DBSCAN CLUSTERING BASED ON REGION DIVISION
The above method has roughly divided the data set D into K different density data regions and noise points. The next step is to cluster each different density region by DBSCAN.
According to the method of grid division, the grid with the highest residual grid density is taken as the initial grid cell [19] in each division, so the density from the first region to the K region is basically decreasing. When clustering, Eps and MinPts parameters are input into the first region D 1 to do DBSCAN clustering. The Eps parameters of D 2 to D k are automatically obtained by the following equation: VOLUME 11, 2023 In equation (11): num i represents the number of data in D i , i = 2, 3, · · · , K ; G i represents the number of grids merged in the i-th region.

5) ALGORITHM DESCRIPTION
Firstly, based on the extracted data features of software quality detection, the data space is roughly divided into different data regions by grid partition, and then Eps parameters suitable for each region are automatically obtained according to different densities for DBSCAN clustering. Region partition reduces the unnecessary density connected query operation in DBSCAN algorithm [20], and improves the efficiency. According to the density of each region, Eps parameters of DBSCAN algorithm are automatically obtained, which makes it more adaptable to data, especially for multi-density data and better effect.
The algorithm flow designed in this paper is shown in Figure 1. The main steps of the algorithm are as follows. Input: data set D, Eps, MinPts, ε. According to equations (7) ∼ (9), the data set D is mapped into the divided grid, and the density of each grid is counted.
According to equation (10) and grid density difference parameter ε, the grids with similar density and adjacent distance are merged. In this way, the data space is divided into different regions, to generate data blocks D 1 , D 2 , D k and preliminary noise points.
According to the parameters Eps, MinPts and equation (11), DBSCAN clustering is carried out for data regions with different densities.
Output the clustering result and obtain the area that needs to be deduplicated according to equation (11). Thus, the automatic deduplication of software quality inspection data is completed.

III. RESULTS
Winzip, ABI-Coder, Readbook, FlashGet, StreamBox, Winamp, VoptMe are selected as test software to verify the effectiveness of the proposed method to automatically de-duplicate the software quality inspection data. The generated software quality inspection data set is taken as the experimental test set, and 485664 data in the experimental data set are automatically de-duplicated by the proposed method. In order to intuitively verify the effectiveness of this method, K-means method and feature iteration method are selected as the comparison methods, since these two methods are typical and representative in the field of data deduplication.
In this paper, the within-cluster sum-of-squares distance (WCSSD) of the three methods is calculated. The WCSSD value is an indicator of algorithmic clustering, which represents the distance of data points from the central data point under the same set. The smaller the WCSSD value obtained by its operation is, the better the clustering result of the algorithm is. The comparison results of WCSSD values of the three methods are shown in Table 1. The experimental results in Table 1 show that under the same data set, the results of WCSSD based on K-means method and feature iteration method for different data sets are relatively close, while the results of WCSSD based on the proposed method for different data sets are significantly lower than those of the other two methods, indicating that this method has high computational performance, and DBSCAN algorithm has high clustering effect. The clustering speed of DBSCAN algorithm is fast, and it can effectively deal with the noise points of the data. Therefore, this paper significantly improves the effect of clustering.
The following is a comparison of the performance of the three methods in terms of execution time and number of cycles. In order to analyze the clustering speed in detail, the time of the algorithm in the data preprocessing stage, data feature extraction and clustering stage will be listed in detail. For the same data test set, the less time it takes to execute the algorithm and the less times of cycle calculation are, the more efficient the algorithm is.
The comparison results of the calculation times of automatic de-duplication for each software quality inspection data set with three methods are shown in Figure 2.
As can be seen from the experimental results in Figure 2, the cycle calculation times of automatic de-duplication of  each software quality inspection data set using the proposed method are significantly lower than those of the other two methods. The cycle calculation times of automatic deduplication of each software quality inspection data set using the proposed method are less than 100 times, while the cycle calculation times of automatic de-duplication of each software quality inspection data set using K-means method and feature iteration method are larger than 250 times. The experimental results show that the proposed method has high computing speed, and can effectively reduce the number of cycles of the automatic de-duplication algorithm because an unsupervised local linear embedding algorithm is used when the features of the software quality inspection data is extracted. Dimensionality reduction can be performed by simultaneously maintaining these reconstruction weights, thereby reducing the number of loop computations for data deduplication.
Three methods are used to compare the execution time of automatic de-duplication of each software quality inspection data set, and the results are shown in Figure 3.
As can be seen from the experimental results in Figure 3, the execution time of automatic de-duplication for software quality inspection data sets using the proposed method is significantly lower than that of the other two methods. The execution times of automatic de-duplication for software quality inspection data sets using the proposed method are both less than 150ms, while the execution time of automatic de-duplication for software quality inspection data sets using K-means method and feature iteration method is higher than 400ms. Experimental results show that the proposed method can use less execution time to achieve automatic deduplication of software quality inspection data, and improve the de-duplication efficiency, which is because the proposed method adopts the local linear embedding algorithm to extract the features of the software quality inspection data. Under the operation of the unsupervised algorithm, the time for manual labeling is reduced, and the efficiency of data deduplication is further improved. After the data is deduplicated, the software quality inspection can effectively improve the inspection efficiency.
Three methods are used to de-duplicate 7 kinds of software quality inspection data, and the number of samples in the software quality inspection data set after de-duplication is counted. The results are shown in Table 2. Experimental results in Table 2 show that the three methods can realize the automatic de-duplication of software quality inspection data set, and have high effectiveness of automatic de-duplication. Because on the basis of data feature extraction, we use the DBSCAN algorithm for data deduplication. The algorithm not only finds arbitrary-shaped classes in the dataset, but also finds noise points and outliers. Therefore, efficient automatic data deduplication can be achieved.
In order to evaluate the accuracy and effectiveness of the proposed method, precision, recall and F-measure are used to further evaluate the proposed method. When F-measure is higher, the accuracy of automatic weight removal is higher. The class identified by clustering results is called result class for short, and the class in original data set is called original class for short. F-measure combines the precision and recall of automatic data de-duplication. The equation of precision and recall of the result class corresponding to the original class is as follows: In the above equation, N ij is the number of original class i in result class j; N j is the number of all objects in result class j; N i is the number of all objects in original class i. VOLUME 11, 2023 The F-measure definition of the original class i is shown in the equation: For the original class i, the higher the F-measure value of the clustering algorithm is, the better the automatic deduplication effect of the clustering algorithm is, and the more it can reflect the mapping of the original class i. In other words, F-measure can be used as the evaluation score of the original class i. For the automatic de-duplication result, the total F-measure of the algorithm can be obtained from the weighted average of F-measures corresponding to each original class i, as shown in the equation: In equation (15), N i is the number of all objects in the original class i.
The precision ratio comparison results of automatic de-duplication of software quality inspection data sets are shown in Figure 4. As can be seen from the experimental results in Figure 4, the precision of automatic de-duplication of software quality inspection data sets using the proposed method is significantly higher than that of the other two methods. The precision of automatic de-duplication of software quality inspection data sets using the proposed method is higher than 99.4%, while the precision of automatic de-duplication of software quality inspection data sets using K-means method and feature iteration method is lower than 98.1%. The results show that the proposed method has high performance of automatic de-duplication and high applicability. On the basis of extracting data features, the method in this paper improves the precision of the detection data set through grid division, data binning and other data processing processes.
The recall comparison results of automatic de-duplication of each software quality inspection data set are shown in Figure 5.
As can be seen from the experimental results in Figure 5, the recall rate of automatic de-duplication of each software quality inspection data set using the proposed method is significantly higher than that of the other two methods. The recall rate of automatic de-duplication of software quality inspection data sets by the proposed method is higher than 99%; while the recall rate of automatic de-duplication of software quality inspection data sets by K-means method and feature iteration method is lower than 98%. The experimental results show that the proposed method can not only automatically de-duplicate the software quality inspection data, but also has high accuracy. The comparison results of F-measure values of automatic de-duplication of the software quality inspection data set with three methods are shown in Figure 6. As can be seen from the experimental results in Figure 6, the F-measure values of automatic de-duplication of software quality inspection data sets by the proposed method are significantly higher than those of the other two methods. The F-measure values of automatic de-duplication of software quality inspection data sets by the proposed method are higher than 0.96; the F-measure values of automatic de-duplication of software quality inspection data sets by K-means method and feature iteration method are lower than 0.92. The experimental results show that the F-measure value of automatic de-duplication of software quality inspection data sets by the proposed method is higher, which shows that the clustering algorithm used in this paper has better effect of automatic de-duplication.
The resource consumption of automatic de-duplication of software test data by three methods is counted, and the comparison results are shown in Figure 7. As can be seen from the experimental results in Figure 7, the method in this paper is used to automatically de-duplicate the software test data, and the CPU usage and memory usage are the lowest among the three methods, both below 15%. The experimental results show that the resource consumption of deduplication processing by the method in this paper is significantly lower than that of the other two methods. The method in this paper can process massive data with low resource consumption rate, which effectively verifies that the method in this paper has high processing efficiency and can effectively improve the efficiency of software quality detection.
The experimental results in Figures 5-7 are caused because the method in this paper considers the characteristics of software quality inspection data. On this basis, grid division and data binning are carried out, and then the grid is merged to form an area, and the area of repeated data is divided. Combined with the feature that the DBSCAN algorithm can cluster any data, it is superior to the limitations of the K-means method and the feature iteration method on the clustering of specified data, and realizes the data sub-regional deduplication, thereby improving the operation efficiency of the algorithm in this paper.

IV. CONCLUSION
At present, the society has entered the information age of big data. It is of great practical significance to mine valuable information from complex data. Effective data mining technology can reveal the hidden information and rules from complex data sets. DBSCAN algorithm, as a classical density clustering method in data mining field, can identify arbitrary shape classes and outliers in data set. Software quality is one of the key factors restricting the further development of computer application. Software testing is an important means to ensure software quality and improve software reliability. Software test data set contains a lot of data with high repeatability, so it is very important to automatically de-duplicate software test data. The DBSCAN clustering algorithm is applied to the automatic de-duplication of software quality inspection data to improve the de-duplication effectiveness of software quality inspection data. The experimental results show that the method has high automatic de-duplication effectiveness and high de-duplication accuracy. The loop computation times of the method in this paper are all less than 100 times, and the execution time is all less than 150ms. Its precision rate and recall rate are both higher than 99%, the resource consumption rate is low, and the deduplication accuracy rate is high. Because the local linear embedding algorithm is used in this paper, the algorithm can extract the features of software quality inspection data without supervision, thereby improving the efficiency of data deduplication. The proposed method uses the DBSCAN multi-density clustering algorithm on the basis of extracting data features, and improves the accuracy of the detection data set through grid division, data binning and other processes, which can be applied to the practical application of software quality inspection data deduplication. However, as the dimension of software defect data increases, the difficulty of data imbalance processing increases, so the next research focus is to extend this method to high-dimensional software defect data.