An Improved Numerical DBSCAN Algorithm Based on Non-IIDness Learning

In clustering algorithm research, objects, attributes and other aspects of data sets are usually considered to be independent and identically distributed; that is, each object is assumed to be an independent and uniformly distributed individual with no impacts between objects. However, objects in real life are often neither independently nor identically distributed; that is, they are non-IID, leading to a complex coupling relationship between objects, and objects interact with each other. The results of a clustering algorithm under an independent and identical distribution may be incomplete or even misleading. To make the results of the DBSCAN algorithm as accurate as possible, an improved numerical DBSCAN algorithm based on non-IIDness learning is proposed in this paper. The algorithm calculates the coupling relationship between objects to obtain the potential relationship between objects and determines the parameters Eps and MinPts by the distribution characteristics of the data. Experiments on large-scale real and synthetic data sets show that the algorithm achieves a higher accuracy than the original DBSCAN algorithm and the main algorithms that improved upon it.


I. INTRODUCTION
Clustering refers to the grouping of abstract or physical objects in accordance with the principle that objects in groups should be as similar to each other as possible and that objects in different groups be as different as possible under the condition that samples are not marked; the ultimate purpose of clustering is to discover the natural structure of data [1]. Clustering is an unsupervised machine learning method and therefore has its own unique advantages to deal with some data sets with unknown distributions. Clustering has become an important component of machine learning, pattern recognition, data mining and other fields.
Over many years of research and development, clustering algorithms have been greatly developed; there are many types of clustering algorithms that, according to different clustering principles and ideas, can be divided into several types of clustering algorithms based on models, divisions, hierarchies, densities, grids, and so on. Clustering techniques are diverse, and different data show different characteristics. Therefore, different types of data have different requirements for clustering techniques. For numerical data, the model-based The associate editor coordinating the review of this manuscript and approving it for publication was Senthil Kumar . clustering algorithm usually uses a mixed Gao Si model, represented as an EM algorithm [2]. Clustering algorithms based on partitioning divide the data set into several groups according to a certain target, and the result is to maximize or minimize the value of the objective function; for numerical data, clustering algorithms based on partitioning are represented by k-means [3], [4], k-medoids [5], and k-medians [6]. Hierarchical clustering algorithms aim to form a tree-shaped clustering structure, often using a ''ottom-up'' aggregation strategy or a ''top-down'' splitting strategy to divide data sets; for numerical data, HC [7], CURE [8], and BRICH [9] hierarchical representation algorithms are used. Density-based clustering algorithms interpret the density of data points as the number of data points contained in designated neighbourhoods; such algorithms consider the connectivity between samples according to the density of samples and continuously expand the cluster based on connected samples to obtain clustering results. Density clustering is a method to determine the clustering structure by the density of the sample distribution. Classic density-based clustering methods for numerical data are DBSCAN [10], OPTICS [11], and DENCLUE [12]. Grid-based clustering can be regarded as the implementation of a density-based clustering algorithm, which takes the number of data points in each grid as the density and splices adjacent dense grids to form clusters. Grid-based representative clustering algorithms for numerical data include GRIDCLUS [13], BANG [14], and STING [15]. DBSCAN (density-based spatial clustering of applications with noise) is a classic density-based clustering algorithm. Its advantage is that it can not only find clusters of arbitrary shapes and sizes in data but also find noise points and outliers in data. To date, DBSCAN algorithms have been applied to consumer market analysis [16], vehicle identification [17], time series processing [18], blending images [19], detection of abnormal population behaviour [20], abnormal flight detection [21] and so on.
Although the DBSCAN algorithm has great advantages and extensive applications, the accuracy is not high enough for certain applications, which limits the use of the DBSCAN algorithm to a certain extent. Therefore, it is necessary to further improve the accuracy of the algorithm. To improve the accuracy of DBSCAN clustering results, scholars worldwide have improved and innovated the algorithm in different directions. Greedy-DBSCAN algorithm [22] adaptively determines the local optimal EPS and detect noise based on the EPS value. However, the algorithm has difficulty determining the threshold parameters required for noise determination. AD-DBSCAN algorithm [23] adaptively determines the parameters of the algorithm, reduces the error caused by manual parameter settings, and thus improves the accuracy of the clustering results of the algorithm. However, the algorithm cannot automatically identify the number of cluster classes, and thus, the number of clusters needs to be specified in advance. A new method [24] to calculate the EPS and MinPts parameters of the DBSCAN algorithm improved the parameter selection accuracy. However, if the knee is not accurately specified in the sorting distance, it may cause incorrect clustering results. FDBSCAN algorithm [25] improved the calculation speed by reducing the distance measurement when searching the core object while ensuring the clustering accuracy. However, when the size of the data set was small, its clustering accuracy was low. 3W-DBSCAN algorithm [26] represents a cluster through a pair of nested sets called the lower bound and end segment, which could explain the clustering results and was in line with human cognitive thinking. However, for some data sets, the clustering results of the 3W-DBSCAN algorithm overlap. AA-DBSCAN algorithm [27] uses a new tree structure based on a quadtree to define the data set density layer. This method allows AFM to find clusters of different densities more accurately. However, the use of inflexible εDL can lead to incorrect clustering results. KLS-DBSCAN algorithm [28] automatically selects the globally optimal EPS and MinPts parameters by a mathematical statistics method, establishing a mathematical model. Although KLS-DBSCAN can realize the adaptive selection of parameters, data clustering classes with a large density difference have a general effect. Non-IIDDBSCAN algorithm [29] aiming at the problem that the traditional distance formula cannot accurately calculate the similarity of categorical data in non-independent co-distributed environments, the non-IIDDBSCAN algorithm replaces the traditional distance formula with the coupling similarity. The accuracy of the algorithm is improved, but the non-IIDDBSCAN algorithm is only applicable to categorical data.
At present, in addition to the non-IIDDBSCAN algorithm, other improvements to the accuracy of the DBSCAN algorithm are based on independent identical distributions. This assumption ignores the internal relation between data points. In reality, data are often neither independently nor identically distributed [30], Each data in the data set has more or less interaction, that is, each data has non-IID property. And different types of data have different characteristics of interaction relationships. Therefore, for numerical data, we propose the NDBSCAN algorithm, which is based on the improvement of the DBSCAN algorithm in the case of a nonindependent identical distributions. this algorithm focuses on the non-IID characteristics of each data in the data set and uses the principle of coupling similarity to quantify the relationship between the data. Then, the parameter neighbourhood EPS and density threshold MinPts are selected by using the distribution of data. Experimental results show that the clustering results of this algorithm are more accurate than those of the comparison algorithms.

A. DBSCAN CLUSTERING THEORY
DBSCAN is a classic density-based clustering algorithm that seeks the final result of clustering by finding the maximum set of connected data point densities.
For data set D = [x 1 , x 2 , . . . , x n ], the definitions are as follows: Definition 1 (Neighbourhood EPS) [10]: The EPS neighbourhood of x i is the supersphere region with the centre of x i and the radius of the EPS. In addition, N Eps(xi) = {xj ∈ D|dist(xi, xj) ≤ Eps}, where x j is a set of sample points and dist(x i , x j ) represents a measure of distance, for example, the Euclidean distance between x i and x j .
Definition 2 (Density Threshold MinPts) [10]: For x i ∈ D, MinPts is the minimum density that makes x i become the core point.
Definition 3 (Core and Boundary Points) [10]: x i is a core point if the EPS value of x i contains at least the value of MinPts objects. If |NEps(xi)| ≥ MinPts, then x i is a core point. If a point is a non-nuclear point but is in the EPS neighbourhood of a core point, the point is called a boundary point.
Definition 4 (Directly Density-Reachable) [10]: If x j is the core object and x i is in the EPS neighbourhood of x j , then x i does not satisfy symmetry and is said to be directly identically reachable with respect to EPS and MinPts of x j .
Definition 5 (Density-Reachable) [10]: If there exists a chain P 1 , P 2 , . . . , P n , if P 1 = x i , P n = x j , and P i+1 is directly density-reachable from P i , then x i is the density that x j can reach. Definition 6 (Density-Connected) [10]: For x i and x j , if there is x k so that both x i and x j are reachable by the density of x k , then x i and x j are density-connected, and the density connection has symmetry. Definition 7 (Cluster and Noise Points) [10]: Any core point, all objects with an object density up to form a cluster, and points that do not belong to any cluster are marked as noise points.

B. THE THEORY OF NUMERICAL DATA COUPLING ATTRIBUTE ANALYSIS
Cao put forward the idea of non-IIDness in 2014, which proposes that data do not exist independently but have more or less explicit or implicit relationships with other data. Wang proposed a coupling representation scheme [31] of numerical objects by integrating the internal coupling of continuous attributes and the coupling [32] interaction between attributes on the basis of the idea of non-IIDness. For a numerical data set D, the coupling attribute analysis framework is shown in Fig. 1.

C. REASONS TO IMPROVE THE COUPLING PROCESS IN DBSCAN
The clustering algorithm is a process to derive the sample set with the maximum density relation from the densityreachable relation. A process diagram of clustering a data set using the DBSCAN algorithm is shown in Fig. 2. In the figure, EPS = d, MinPts = 4, red dots are core objects because there are at least four samples in the EPS neighbourhood, and black dots are non-core objects. All core object density direct samples in the red core objects are in the centre of the superball body. If they are not in the superball body, they cannot be density direct. The core objects connected by green arrows form a density-reachable sample sequence, and the set of samples connected by the maximum density derived from the density-reachable relation is a category of final clustering. Furthermore, all samples in the EPS neighbourhood of these density-accessible sample sequences are connected to each other.
In Fig 2, each point is an object. The NDBSCAN algorithm proposed in this paper needs to first calculate the relationship between each point to obtain the coupling representation of the original data. This process will aggravate the computation of the DBSCAN algorithm. Therefore, the coupling calculation of the data set is improved to reduce the computation of the DBSCAN algorithm.

D. THE OPTIMIZED COUPLING ATTRIBUTE ANALYSIS OF NUMERICAL DATA
For data set D, the improved coupling calculation process is shown below.
where a nm is the mth attribute value of the nth object and the value of each attribute in data set D is extended to a power of 2. The extended data set is generated as D 1 : where a 2 nm is the second power value of the mth attribute of the nth object. To facilitate writing the following formula, a nm is written as a m , and a l nm is written as a l m a l m . Pearson correlation coefficients [33] are used to quantify the interaction between attribute values. The Pearson coefficient between attributes a i and a j is formalized as shown in 1.
is the attribute value of object u on attribute a j , and µ i and µ j are the average values of a i and a j , respectively. When quantifying the interaction relationship between attributes, this paper only considers the important attribute coupling relationship rather than simply involving all attributes to avoid overfitting the coupling relationship. Therefore, 0.05 is used as the dividing point of the P value, and the modified formula is shown in Formula 2.
According to the revised Pearson correlation coefficient, the internal coupling and the inter-attribute coupling of the attributes are calculated separately.
The internal coupling is quantified as the correlation between attribute a i and its quadratic power attribute value, which is expressed as R la (a i ).
where ϕ 12 (i) = ϕ 21 (i), ϕ jk (i) = Cor(a j i , a k i ), and a j i , a k i represent the j and k powers of the a i attribute, respectively. Coupling between attributes is quantified as the correlation between the a i and a j attributes, which is expressed as R le a i | a j j =i .
In 4, ε pq (i|j k ) = Cor a p i , a q j k . The coupling representation of numerical objects is obtained by integrating the internal coupling and inter-attribute coupling of continuous attributes, as shown in Formula 5.

III. PROPOSED ALGORITHM
To cluster numerical data effectively, we propose an improved DBSCAN algorithm based on the idea of a non-independent and identical distribution, namely, the NDBSCAN algorithm. The algorithm first quantifies the relationship between data by finding the coupling relationship between data points and then selects the EPS and MinPts parameters according to the distribution of data. The flowchart of the NDBSCAN algorithm is shown in Fig. 3.

A. COUPLING ATTRIBUTE ANALYSIS
The improved coupling analysis process in Section II is used to analyse and calculate the numerical data set to obtain the coupling data set that can reflect the relationship between the data.

B. SELECTING THE EPS PARAMETER
First, the distance between objects in the coupled data set needs to be calculated. The distance formula is shown in Formula 6, and the distance distribution matrix DIST n×n is obtained as shown in Formula 7.
where X and Y are objects in the coupled data set and the DIST n×n matrix is a real symmetric matrix with zeros along the diagonal. The elements in each column of DIST n×n are sorted in increasing order, and the element in row m is the distance from each object to the nearest neighbour object in row a (m-1). After sorting, the k-distance is plotted, as shown in Fig. 4. In Fig. 4, lines of different colours represent different objects. Each line shows the Euclidean distance between it and other objects, and the Euclidean distance is arranged in increasing order. In the range of the densest distribution of KNN lines, that is, at the red box in Fig. 4, the ordinate of the point with the largest slope change is the EPS value.

MinPts
where n is the number of samples in the data set. If n/25 is an integer, this value is taken. If there is a fractional part, the integer part is rounded upward; that is, the integer part is added by 1, and the fractional part is discarded. The proposed NDBSCAN clustering algorithm is described in Algorithm 1.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
The NDBSCAN algorithm is implemented in MATLAB and runs in the Windows 10 system. The processor used is an AMD Ryzen 5 4600U with a Radeon Graphics 2.10 GHz graphics card and 16 GB of memory.

A. DATA SETS
We used four synthetic data sets downloaded from the website ''Clusters Datasets'' [34]: Flame, R15, Aggregation and S1. The other synthetic data sets were downloaded from GitHub [35]. Eleven real data sets were downloaded from the UCI machine learning repository [36] to determine the ability of the algorithm to process data sets of different densities. Table 1 shows the attributes of the data sets, including their different data sizes and dimensions.

B. EVALUATION INDICATORS
To reflect the quality of the final clustering results with data, this paper uses the unsupervised clustering accuracy [37], F-measure [38], ARI [39] and NMI [40] as the evaluation indexes of the algorithm. The following is a brief introduction of these two indexes. Accuracy is used to compare whether the obtained label is consistent with the real label of the data set, with a value range of [0,1]. The closer the value is to 1, the more consistent it is with the real clustering result, which is defined as: where s i is the true label, r i is the label after clustering, map represents the reproduction allocation of the best class index, N represents the total number of data sets, δ is the indicator function, and the specific formula of δ is as follows:

2) F-MEASURE
As a comprehensive measure of the accuracy and recall rate, the F-measure can effectively reflect the performance of the clustering algorithm. The value of the F-measure represents the clustering effect. The larger the value is, the better the clustering effect. It is defined as: where Recall is the recall rate. The proportion of objects in the predefined cluster correctly divided into clusters by the clustering algorithm is calculated. Precision is the ratio of the objects in the cluster found by the clustering algorithm to the correct cluster. The recall and precision calculation methods are as follows: Recall = n jk n j (12) Precision = n jk n k (13) where n jk is the number of samples correctly divided into cluster k in cluster j, n j represents the cluster generated by the clustering algorithm, and n k represents the actual correct cluster of the dataset.

3) ARI
The ARI reflects the degree of overlap between the two classifications, and the use of this metric requires that the data VOLUME 9, 2021 FIGURE 6. Four evaluation index values for the ten synthetic data sets on the six clustering algorithms.
be labelled with categories.
Among them, a represents the number of equal samples in the real labels and clustering results, and b represents the number of different samples.

4) NMI
NMI relates real labels for class A and clustering results of a data set B, to pick up the unique value of A of vector X and B of vector Y ; therefore, A and B of the NMI value can be defined as: where A is the true label, B is the clustering result, p(x) represents the probability of x in A, p(y) represents the probability of y in B, and p(x, y) represents the joint probability of x and y. The NMI value ranges from 0 to 1. If the NMI value is close to 1, the clustering quality is high.

C. EXPERIMENTAL RESULTS AND ANALYSIS
This paper discusses the clustering performance of each algorithm in a reasonable parameter search range. DBSCAN, FDBSCAN and NDBSCAN use the global parameters EPS and MinPts. The selection of parameters depends on the data set, and the clustering results are sensitive to both parameters. In this study, the parameters that make the F-measure value of each data set high are selected through multiple iterations. 3W-DBSCAN uses a pair of nested sets, called the lower     to establish a mathematical model and automatically selects the global optimal EPS and MinPts parameters. Since the parameter setting of each algorithm is different according to each data set, this paper presents the parameters and search range of each algorithm except the 3W-DBSCAN algorithm, as shown in Table 2.

V. COMPLEXITY ANALYSIS
In this paper, the time complexity of the algorithm is used to describe the computational load of the algorithm, where m represents the number of attributes in the numerical data set and n represents the number of samples contained in the data set. The internal coupling time complexity of the quadratic power under non-IIDness is O(4mn), the time complexity of external coupling is O(2mn), and the time complexity of generating the coupling representation is O(2mn). The time complexity of generating the clustering results is O(n 2 ), so the overall time complexity of this algorithm is O(n 2 ). The time complexity of the DBSCAN algorithm is also O(n 2 ). Although F-DBSCAN will reduce the computation time, its time complexity is still O(n 2 ). On the test data set used in this paper, the running time of AA-DBSCAN is approximately equal to that of the DBSCAN algorithm. The time complexity of 3W-DBSCAN is mainly used to calculate the distance matrix, which also has a time complexity of O(n 2 ). Each part of the KLS-DBSCAN algorithm has a time complexity of O(n 2 ), so the total time complexity is O(n 2 ). Therefore, compared with the comparison algorithms used in this paper, NDBSCAN increases the computational complexity but does not increase the time complexity of the algorithm. Table 3 shows comparisons of the F-measure, accuracy, ARI and NMI of the 10 synthetic data sets under the proposed NDBSCAN and comparison algorithms. Fig. 5 shows the plane distributions and category labels of the 10 synthetic data sets used in this paper. Different colours represent different categories. Because this algorithm first needs to process the data and the low-dimensional data are processed into high-dimensional data, it is impossible to draw a clear cluster scatter diagram for the multidimensional data. Therefore, the clustering effects of this algorithm and the comparison algorithms are compared intuitively in Fig. 6. Table 3 and Fig. 6 show that the values of the four evaluation indexes of the proposed algorithm are higher than those of the comparison algorithms on Smile, 2circles, Size5, and S1. The values of the F-measure, accuracy and ARI obtained by the NDBSCAN algorithm on the Moon and Flames data sets are all higher than those obtained by the comparison algorithms. The values of the F-measure, accuracy and NMI obtained by the NDBSCAN algorithm on R15 are higher than those obtained by the comparison algorithms. The values of the accuracy, ARI and NMI obtained by the NDBSCAN algorithm on Aggregation and Square are higher than those obtained by the comparison algorithms, indicating that the proposed algorithm can achieve good clustering effects on different types of synthetic data sets. Table 4 shows comparisons of the four evaluation indexes of the clustering results generated by the NDBSCAN algorithm proposed in this paper and the comparison algorithms for the 11 real data sets. Fig. 7 shows the data comparison in Table 4 more intuitively. In a few cases, the evaluation values of our algorithm are slightly lower than those of the comparison algorithms. The F-measure values of the proposed algorithm on the Iris and Seeds data sets are slightly lower than those of AA-DBSCAN. The NMI value of the proposed algorithm on the Ecoli data set is slightly lower than that of 3W-DBSCAN, and the accuracy value on Banknote is slightly lower. The values of the four evaluation indexes on Sobar, Wine, Blood, Ecoli, Banknote, Magic, Phoneme and Robotnavigation of the proposed algorithm are higher than those of the comparison algorithms.
The experimental results on synthetic data sets and real data sets show that the proposed algorithm not only is suitable for small-scale low-dimensional data sets but also can achieve a better clustering effect than the comparison algorithms on large-scale low-dimensional, small-scale high-dimensional, and large-scale high-dimensional data sets. This is because the algorithm in this paper considers the internal relationship between objects and attributes of the data set and discovers the real relationship between the data by exploring the coupling relationship between the data to obtain more complete and accurate clustering results. However, this paper only focuses on numerical data, and further research is needed for categorical and mixed data.

VI. CONCLUSION
In view of the fact that there is always an interactive relationship between data and the fact that both the original and the improved DBSCAN algorithms ignore this relationship, this paper proposes a numerical DBSCAN algorithm based on the idea of a non-independent identical distribution, namely, the NDBSCAN algorithm. This algorithm calculates the relationship between data according to an improved coupling formula for calculating the relationship between numerical data, generates the coupling representation data set that can reflect the relationship between the data in the original data set, and then selects the appropriate neighbourhood parameter EPS and density threshold MinPts according to the distribution of the data in the coupling data set. Experimental results show that the proposed NDBSCAN algorithm is effective and superior to the comparison algorithms overall.
YANMEI WANG received the bachelor's degree from the Qilu University of Technology (Shandong Academy of Sciences), in 2019, where she is currently pursuing the master's degree. Her research interests include data mining and big data processing.
HE JIANG received the B.Sc. degree from Liaocheng University, in 1985, and the M.E. degree in computer applications from Beijing Jiaotong University, in 1997. He was an Associate Professor with the Qilu University of Technology, in 2001, and a Professor with the Qilu University of Technology, in 2005. He served as the Director of the School of Computer Science and Technology, Qilu University of Technology, for a long time. He was also the Director of the Computer Teaching Professional Research Institute, Shandong Higher Education Association. At Qilu University of Technology, he was the first Famous Teacher and a Teacher Morality Model. His research interests include data mining technology, big data analysis, and other aspects of research.
HUIXIN ZHOU received the bachelor's degree from the Qilu University of Technology (Shandong Academy of Sciences), in 2019, where she is currently pursuing the master's degree. Her research interests include data mining and big data processing.
CHENGQIANG WANG received the bachelor's degree from the Qilu University of Technology (Shandong Academy of Sciences), in 2019. His research interests include data mining and data security.