A Novel Hybrid PSO-K-Means Clustering Algorithm Using Gaussian Estimation of Distribution Method and Lévy Flight

Clustering is an important data analysis technique, which has been applied to many practical scenarios. However, many partitioning based clustering algorithms are sensitive to the initial state of cluster centroids, may get trapped in a local optimum, and have poor robustness. In recent years, particle swarm optimization (PSO) has been regarded as an effective solution to the problem. However, it has the possibility of converging to a local optimum, especially when solving complex problems. In this paper, we propose a hybrid PSO-K-means algorithm, which uses the Gaussian estimation of distribution method (GEDM) to assist PSO in updating the population information and adopts Lévy flight to escape from the local optimum. The proposed algorithm is named a GEDM and Lévy flight based PSO-K-means (GLPSOK) clustering algorithm. Firstly, during initialization, a few particles are initialized using the cluster centroids generated by K-means, while other particles are randomly initialized in the search space. Secondly, GEDM and PSO are selected with different probability to update the population information at different optimization stages. Thirdly, Lévy flight is adopted to help the search escape from the local optimum. Finally, the greedy strategy is carried out to select the promising particles from the parents and the newly generated candidates. Experimental results on both synthetic data sets and real-world data sets show that the proposed algorithm can produce better clustering results and is more robust than existing classic or state-of-the-art clustering algorithms.


I. INTRODUCTION
In recent years, data mining is widely used to find useful patterns and knowledge which are hidden inside Large-scale data from different sources [1]. Clustering is one of the research hotspots in the field of data mining. It is an unsupervised learning method and aims to group the objects based on the The associate editor coordinating the review of this manuscript and approving it for publication was N. Ramesh Babu .
principle of maximizing the intra-cluster similarity and minimizing the inter-cluster similarity [2]. In the past few decades, various clustering algorithms [3]- [6] have been proposed. So far, clustering has been used in a number of applications, such as image segmentation [7], bioinformatics-gene expression data analysis [8], and object recognition [9]. Partitioning based clustering algorithms, such as K-means [10], Fuzzy C-means (FCM) [6], and K-Harmonic Means (KHM) [11] are widely used because of their simplicity and efficiency.
However, these algorithms also present some serious inconveniences, i.e., they are sensitive to the initial state of cluster centroids, may get trapped in a local optimum, and have poor robustness. The above inconveniences often lead to unsatisfactory clustering results. Since clustering tasks can be modeled as optimization problems, it is a natural choice to solve the clustering problems with optimization algorithms. So far, many intelligent optimization algorithms such as particle swarm optimization (PSO) [12], [13], ant colony optimization (ACO) [14], cuckoo search algorithm (CSA) [15], genetic algorithm (GA) [16], and differential evolution (DE) [17] have been regarded as effective solutions to clustering problems.
Among all the optimization algorithms, PSO has received much attention owing to its simplicity and competitiveness in finding a better solution [18]. In 2015, Liang et al. [19] proposed an adaptive clustering-based PSO, which considered the population topology and individual behavior control together to balance local and global search in an optimization process and proved the superiority of the algorithm. Furthermore, among the clustering approaches based on optimization algorithms, PSO has proven to be a strong competitor [20]. For example, the PSO-based clustering technique was firstly proposed in [21], where the initial swarm was fed by the clustering results of K-means. In 2011, Izakian and Abraham [22] proposed a clustering algorithm based on FCM and PSO, which applied FCM to the particles in each generation to improve the fitness of each particle. However, as with almost all optimization algorithms, the basic PSO algorithm also has the possibility of converging to a local optimum, especially when solving complex multimodal problems. In addition, most PSO-based clustering algorithms have the problem of low convergence efficiency.
K-means is one of the most classic clustering algorithms and is widely used for its simplicity and low computation cost. PSO is an effective global optimization algorithm and has a strong ability to search for solutions. To take full advantage of both algorithms and solve the above problems, we propose a hybrid PSO-K-means algorithm, which adopts the Gaussian estimation of distribution method (GEDM) and Lévy flight to improve the performance of the algorithm. The proposed algorithm is named a GEDM and Lévy flight based PSO-K-means (GLPSOK) clustering algorithm. Firstly, during initialization, a few particles are initialized using the cluster centroids generated by K-means to ensure that there are some relatively good particles in the initial swarm, while other particles are randomly initialized in the search space to ensure the diversity of the initial swarm. Secondly, we adopt GEDM to assist PSO in updating the population information. At the early stage of the optimization process, the GEDM is selected with a high probability to estimate a better evolution direction using promising particles. The purpose is to accelerate the convergence speed. At the later stage of the optimization process, PSO is used with a high probability to make full use of the great local search ability of PSO. The purpose is to improve the convergence accuracy. Thirdly, when the search is stagnant, the Lévy flight strategy is adopted to generate stagnation disturbance, which can increase the diversity of the population to help the search escape from the local optimum. Finally, the greedy strategy is carried out to select the promising particles from the parents and the newly generated candidates according to their fitness.
The rest of this paper is organized as follows. Section II reviews the related work of data clustering based on PSO. Section III includes the problem definition of clustering. In Section IV, the proposed clustering algorithm GLPSOK is described. Section V describes the experimental results and analysis. In Section VI, we conclude this work and indicate directions for future research.

II. RELATED WORK
In the past few decades, researchers made significant progress in PSO-based data clustering. Depending on the research method, the research carried out in this respect can be divided into two categories. The first category is the hybrid clustering approaches, which combine PSO with traditional clustering algorithms. The second category is the effective PSO variants for data clustering.
The goal of the hybrid clustering approaches is to take full advantage of PSO and traditional clustering algorithms to get better clustering results. As early as 2003, Merwe and Engelbrecht [21] proposed the first PSO-based clustering algorithm, denoted as HPSOK-means in our paper, where the initial swarm was fed with the clusters generated by K-means. In 2008, Ahmadyfard and Modares [23] proposed a clustering algorithm based on PSO and K-means. In this paper, PSO was adopted for global search at the initial stages of the optimization process and K-means was used to achieve faster convergence to an optimum solution when around global optimum. A hybrid algorithm based on PSO, ACO and kmeans for cluster analysis was proposed in [24] to solve nonlinear partitioning problems in data clustering. In order to process the large-scale gene expression data generated by microarray experiments, Deepthi and Thampi [25] proposed a clustering algorithm, which adopted PSO to search for the best subset and then used k-means as a wrapper algorithm to evaluate the obtained subsets. In 2019, Xu et al. [26] proposed an accelerated two-stage particle swarm optimization (ATPSO) for clustering not-well-separated data. In ATPSO, K-means is utilized to accelerate particles convergence during the population initialization. Recently, Liu et al. [27] proposed an effective algorithm based on K-means and randomly occurring distributed delayed PSO (RODDPSO) to group patients from emergency center. To avoid falling into a local optimum, this algorithm introduced randomly occurring time-delays to the velocity updating model using a distributed form. Yang et al. [28] proposed a hybrid clustering algorithm (PSOKHM) combining K-Harmonic Means (KHM) with PSO and the experimental results indicated the superiority of the PSOKHM algorithm. A hybrid clustering algorithm based on KHM, PSO, and GA was proposed in [29], where PSO was combined with another evolutionary algorithm such as GA to enhance the standard PSO. Bouyer and Hatamlou [1] proposed a new clustering algorithm which combined KHM with an Improved Cuckoo Search (ICS) and PSO where ICS was used to help PSO escape from the local optimum. An improved FCM clustering algorithm based on PSO was proposed in [30], in which, firstly, each object in the data set was distributed on the basis of distance to meet the constraints of FCM. And then, PSO was adopted to search for a global optimal solution. Zhao et al. [4] proposed an alternate PSO-based adaptive interval type-2 intuitionistic Fuzzy C-means clustering algorithm (A-PSO-IT2IFCM) and used it to solve a problem of image segmentation. With all, most hybrid clustering algorithms only combine PSO with traditional clustering algorithms in a simple way. Because PSO also has the possibility of converging to a local optimum, this kind of approaches cannot completely solve the problems that the traditional clustering algorithms possess, like being easily trapped in local optima and having low accuracy.
In recent years, researchers have proposed various PSO variants for data clustering. In 2008, Alam et al. [31] proposed a new algorithm called Evolutionary Particle Swarm Optimization (EPSO), which was based on the evolution of swarm generations for clustering. The swarm attempted to dynamically update itself to optimal positions during each iteration. In 2010, Szabo et al. [32] proposed a Modified Particle Swarm Clustering (mPSC) algorithm on the basis of Particle Swarm Clustering (PSC). This work modified the metaphor of natural social order to decrease the input parameters of the system and particles velocity's memory. The purpose of this paper is to reduce the computational complexity of PSC, but Szabo has acknowledged that this improvement is not significant. A version of PSC-based algorithm was proposed by Yuwono et al. [33], in which the rapid centroid estimation (RCE) was adopted to simplify the update rules of PSC for clustering. In this algorithm, which is denoted as PSC-RCE in this paper, each particle represents the centroid of a cluster, and all particles form a solution of the problem. By introducing New Substitution Strategy, Particle Reset, Swarm Strategy, and White Noise Update Scheme, this algorithm improves the efficiency of the clustering process and could better approximate the global optimal solution. In general, the PSO variants for data clustering can reduce the possibility of getting trapped in local optima. However, this kind of algorithms also has some problems. For example, in PSO, the update of the population mainly depends on the current global optimal solution and the individual historical optimal solution, which often leads to low convergence efficiency at the early stage of the search and a significant decrease in population diversity at the later stage of the search.
In summary, although clustering algorithms based on PSO, have emerged endlessly, none of them can simultaneously solve all the three problems of premature convergence, low convergence efficiency and low clustering accuracy in a satisfactory manner.

III. PROBLEM DEFINITION
Clustering is an unsupervised learning method that aims to organize each data point in the data set into the corresponding cluster. It is based on the principle of maximizing the intracluster similarity and minimizing the inter-cluster similarity. For a given data set where N is the number of data points and D is the dimension of the data to be clustered, K-means clustering partitions it into K clusters C j K j=1 by minimizing the sum of the intra-cluster variances defined as Eq.(1). K-means seeks better solutions by alternately optimizing cluster centroids and the assignment of data points to clusters.
where µ i is the centroid of the cluster C i , x are objects that belong to the cluster C i . In the context of clustering, the proposed GLPSOK algorithm is used to optimize the cluster centroids. A single particle represents K cluster centroid vectors. That is, each particle X t (t = 1, 2, . . . S, where S is the population size) is constructed as follows: where C t k refers to the k-th cluster centroid vector of the particle X t , M t k,d is the d-th dimensional position of the centroid C t k , K is the number of clusters, D is the dimension of the data to be clustered. For example, considering a search space with K = 3 and D = 3, the particle [ Fig.1 shows the encoding of a solution that is formed by K centroids, where D = 3. Therefore, a swarm represents a number of candidate clustering results for the current data vectors.
A variety of validity metrics have been proposed as the objective functions of clustering algorithms, such as sum of Euclid Distance (SED) [34], Mean Squared Error (MSE) [1], and J m -index [35]. In this paper, SED has been adopted as the fitness function of GLPSOK following the suggestion of Kaufman [34]. For each particle X t , its fitness function SED is defined as follows: where µ i refers to the centroid vector of cluster C i within particle X t , x are objects belonging to the cluster C i .

IV. PROPOSED GLPSOK ALGORITHM A. POPULATION INITIALIZATION
For most PSO-based clustering algorithms, two initialization methods [36] are often adopted. The first method randomly locates all initial particles throughout the search space. The other one randomly chooses K objects from the data set as the cluster centroids to form the initial particles.
In this paper, a hybrid approach is used. The specific description is as follows: 5% of the particles in the initial swarm are initialized with the cluster centroids generated by K-means. The remaining 95% of the particles are initialized using the second method described above, i.e., K objects are randomly chosen from the data set as the centroids to form the initial particles.
We use K-means' clustering results to initialize a few particles in the initial swarm for two reasons: (1) Clustering results of K-means may not be ideal, but they are much better than the randomly obtained initialization results in most cases. Therefore, this method can guarantee there are some relatively good particles in the initial swarm; (2) K-means is one of the most classic, simplest, and most efficient clustering algorithms. Initializing particles with the clustering results of K-means can significantly improve the performance of clustering at the cost of a little more computational time.

B. UPDATE RULES OF THE POPULATION
In basic PSO, the update of the population mainly depends on the current global optimal solution and the individual historical optimal solution, which will lead to low convergence efficiency at the early stage of the search and a significant decrease in population diversity at the later stage of the search. GEDM, whose core is the weighted covariance matrix, can make full use of promising particles to estimate a better evolution direction and has the ability to avoid local optimum. Therefore, we adopt GEDM to improve the performance of the proposed clustering algorithm.

1) PARTICLE SWARM OPTIMIZATION
PSO was first proposed by Kennedy and Eberhart in 1995 [12]. This algorithm imitates the process of bird foraging and guides optimization search by swarm intelligence generated by cooperation and competition among particles in a swarm.
In PSO, the swarm of particles is considered as a set of potential solutions, and the fly process of the particles can be regarded as a search process [37]. Each particle i is associated with two vectors, i.e., the position vector Particles search for new solutions by constantly adjusting their positions x i,d . For each particle, it can remember its individual historical optimal position that it has searched for, as pbest i , and the current global optimal position found by the entire particle swarm, as gbest. When the two optimal positions are found, each particle updates its velocity and position according to Eq.(6) and Eq. (7), respectively.
where t represents the t-th iteration, d represents the d-th dimension of the particle, ω is the inertia weight, c 1 and c 2 are the acceleration constants, rand 1 and rand 2 are random numbers randomly distributed in the interval [0, 1]. It has been proved that the performance of PSO can be improved if the inertia weight ω decreases linearly [38]. The linearly decreased inertia weight is given as follows: where ω max and ω min are the maximal and minimal weights, respectively; t is the number of the current iteration, and t max is the number of the maximum iteration.

2) GAUSSIAN ESTIMATION OF DISTRIBUTION METHOD
The estimation of distribution algorithm (EDA) can estimate the evolution direction of the promising population using probabilistic model learning and sampling. Some studies have shown that EDA has promising performance when dealing with complex optimization problems [39] [40]. GEDM is the core component of EDA. Therefore, it has been introduced into PSO to estimate the better evolution direction for the purpose of improving the convergence speed and enhancing the local optimal avoidance ability of PSO. The model of GEDM based on the weighted covariance matrix is as follows.
In Eq. (9), m is the number of solutions that are selected as the promising solutions to estimate the evolutionary direction. The solution, which has a high rank, would have a great weight when calculating the weighted mean using Eq.(10). The set of {X 1 , X 2 , . . . , X m } represents the promising solutions with fitness values ranked from high to low in Eq.(10). In Eq. (11), Cov(t) is the weighted covariance matrix of the promising solutions. When using GEDM, the population update its position using Eq. (12). The first addend of Eq. (12) corresponds to Gaussian random walk. In order to integrate the GEDM and PSO together effectively, the GEDM and PSO are selected with different probabilities at different stages of the optimization process. At the early stage of the optimization process, the GEDM is selected with a high probability to estimate the better evolution direction using promising particles, for the purpose of accelerating the convergence speed. At the later stage of the optimization process, PSO is used with a high probability to make full use of its great local search ability, for the purpose of improving the convergence accuracy of GLPSOK. The model of the probability is as follows: where γ max and γ min are the maximal and minimal probabilities to select GEDM, respectively.

3) STAGNATION DISTURBANCE STRATEGY
The Lévy process is a stochastic process of continuous time, which was first proposed by the French scientist Paul Lévy.
Researchers have found that the behavior of many animals in nature is consistent with the characteristics of the Lévy process and have proposed the Lévy flight theory [41] based on the Lévy process. Lévy flight is an optimal exploration behavior to search for randomly distributed objects. It is characterized by long-term random walks with small steps and occasionally jumping with a large step, which is similar to the global search and local search features in the intelligent optimization algorithm. Thus, Lévy flight is widely used by researchers in various optimization algorithms to generate random step sizes. In this paper, once the search is stagnant, the Lévy flight has been adopted to generate disturbance for the purpose of escaping from the local optimum. The random walk step of Lévy flight follows a heavy tail probability distribution, called Lévy distribution, whose exponential form is as follows: where s is the random step size, β is the exponential parameter that determines the shape of the Lévy distribution. The value of β is inversely proportional to the generated random step size. Because the exponential form of Lévy distribution is difficult to implement using MATLAB programming language, the method of generating the Lévy flight random search path proposed by Mantegna [42] is used to generate Lévy flight random step size in this paper. The model of Lévy flight is as follows: where s is the random walk step size, and (x) is the gamma function.
In the search process, the average fitness value of the promising solutions is used to determine whether the search is stagnant. If the average fitness value does not change in three consecutive iterations, the search would be regarded as stagnant. Once the search process is stagnant, in order to escape from the local optimum and overcome the premature convergence of GLPSOK, the Lévy flight strategy is adopted to generate disturbance to update the population information. In addition, the stagnation disturbance strategy can also increase the diversity of the population. The model of stagnation disturbance strategy based on Lévy flight is as follows: where randn is a random number following a normal distribution, and α ∈ [−1, 1] is a scale factor.

C. BOUNDARY HANDLING
Some works [43] have shown that boundary handling has a significant impact on the performance of PSO algorithms, especially when solving complex problems. In recent years, many boundary handling schemas [36] have been proposed, including those based on either periodic, absorbing, invisible, damping, reflecting, random or zoom. Among the boundary handling schemas above, the most fundamental and widely used are random and absorbing schemas. A brief description of these two fundamental schemas [43] is addressed below. 1) Random Schema: In general, this schema is adopted as the default setting in PSO programs. If a particle flies outside any dimension j of the search space, a random value drawn from a uniform distribution between the lower and upper boundaries of the dimension j would be assigned as the corresponding component for the particle, as follows. (21) where t = 1, 2, . . . , S; S is the population size; b l is the lower boundary of the j-th dimension of the t-th particle;b u is the upper boundary of the j-th dimension of the t-th particle; U (a; b) is a random value drawn from a uniform distribution between a and b.
2) Absorbing Schema: In this schema, when a particle flies outside any dimension j of the search space, the component corresponding to the j-th dimension of the particle is assigned the boundary of the dimension j. [x t,1 , . . . , x t,j−1 , b l (or b u ), x t,j+1 , . . . , x t,n ] (22) In this paper, the random schema is adopted to handle the boundary of position to prevent particles from flying out of the search space. And the absorbing schema is adopted to handle the boundary of velocity to prevent particles from moving too fast.
Finally, the greedy strategy is adopted to select the promising particles from the parents and the newly generated candidates according to the fitness. This step can guarantee the global convergence efficiency of the proposed GLPSOK algorithm. This mechanism can fully retain the dominant particles, which can improve the convergence speed of the algorithm and obtain better clustering results.
The pseudo code of the proposed GLPSOK is described as Algorithm 1.

Algorithm 1
The Procedure of GLPSOK 1. Initialize the population (see Sect.IV-A). 2. For each particle, assign each data point to its nearest cluster C j (j = 1, 2, . . . K ); 3. Calculate the fitness of each particle using Eq. (5) Population is updated using GEDM; End if End if 7. Boundary control; 8. For each newly generated candidate particle, assign each data point to its nearest cluster C j (j = 1, 2, . . . K ); 9. Calculate the fitness of each generated candidates; 10. Greedy strategy is adopted to select the promising particles; 11. Update the values for both pbest and gbest; 12. t = t +1; Execute step 4;

D. COMPUTATION COMPLEXITY ANALYSIS
The computation complexity of GLPSOK is described below. At the initialization stage (from Step 1 to Step 3), the computation complexity is O(N ·K ·D·NP), where N is the number of data points to be clustered, K is the number of clusters, D is the dimension of data points, NP is the number of particles. The computation complexity of updating the population (Step 5 and Step 6) is O(K 2 ·D 2 ·NP), where the computation complexity of computing X(t) mean and Cov(t) is O(K ·D·NP) and O(K 2 ·D 2 ·NP), respectively. The computation complexity of stage from Step 7 to Step 11 is O(N ·K ·D·NP). The computation complexity of stage from Step 4 to Step 12 is O(K ·D·NP·(K ·D+N )·t max ), where t max is the maximum number of iterations of GLPSOK. Let MaxFEs be the maximum number of the fitness evaluations of PSObased Clustering algorithm. Then, in GLPSOK, MaxFEs = NP·t max . Therefore, the overall computation complexity of GLPSOK can be expressed as O(K ·D·(K ·D+N )·MaxFEs). Table 1 summarizes the computation complexity of GLPSOK and eight related algorithms, which are introduced in detail in Section V as comparison algorithms. In non-PSO-based clustering algorithms, such as K-means, MinMaxK-means, and K-Multiple-Means (KMM), MaxFEs represents the maximum number of calculations of the objective function of algorithms. In PSC-RCE, n m is the number of groups in the swarm. In K-Multiple-Means, m is the number of sub-clusters and t 1 is the number of iterations of the sub-alternating system. In the actual program running, the running time of PSObased clustering algorithms is much longer than that of non-PSO-based clustering algorithms. The number of fitness evaluations is close or equal to MaxFEs for PSO-based clustering algorithms and the number of calculations of the objective function is usually much less than MaxFEs for non-PSObased clustering algorithms when the algorithm terminates after satisfying the algorithm termination condition.

V. EXPERIMENTAL RESULTS AND ANALYSIS
The proposed GLPSOK is a hybrid clustering algorithm based on PSO and K-means. In order to evaluate its effectiveness, we tested it on six synthetic data sets. For the purpose of proving its superiority, we compared it with eight classic or state-of-the-art algorithms on five real-word data sets. The experiments were conducted using the Matlab R2016a platform on a computer with 1.99 GHz Inter(R) Core (TM) i7-8565U CPU, 16 GB memory and Windows 10 operating system.

A. EXPERIMENTAL DATA SETS
Six synthetic data sets and five real-world data sets were used to evaluate the proposed algorithm. The synthetic data sets (S 1 − S 6 ), which are presented in [47], [48], are generated using the Gaussian model and shown in Fig. 2a-7a. These data sets were selected for the following reasons: S 1 VOLUME 8, 2020 shows the asymmetric case where the clusters have different shapes and different numbers of data points. S 2 − S 4 shows the approximately symmetric case where the clusters take the similar shape and have the same numbers of data points. The difference between S 2 , S 3 , and S 4 is that the distance between cluster boundaries decreases in order, which can help in evaluating the performance of the proposed algorithm on data sets with unclear boundaries. S 5 and S 6 have been selected to evaluate the performance of the proposed algorithm on multi-dimensional data sets and data sets with multiple clusters, respectively. The real-world data sets (Iris, Wine, Glass, WDBC, and CMC) are obtained from the famous UCI Machine Learning Repository [49]. These data sets have been selected because they come from various domains and have been widely used in the field of machine learning [1] [50]. The information of the synthetic data sets and real-world data sets are described in Table 2 and Table 3, respectively. A detailed description of all data sets is shown below.  This data set consists of three classes, those ones having 400,300, and 200 data points, respectively. These data points are drawn from three different bivariate Gaussian distributions with the following parameters: where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.2a. Different colors in the illustration represent different classes (ground truths) of the data points. This data set consists of four classes, each containing 300 data points. These data points are drawn from four different bivariate Gaussian distributions with the following parameters: where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.3a. This data set consists of four classes, each containing 300 data points. These data points are drawn from four different bivariate Gaussian distributions with the following parameters: where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.4a. This data set consists of four classes, each containing 300 data points. These data points are drawn from four different bivariate Gaussian distributions with the following parameters: where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.5a.  where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.6a. 6) SYNTHETIC DATA SET6 (n = 500, d = 2, k = 10) This data set consists of ten classes, each containing 50 data points. These data points are drawn from ten different bivariate Gaussian distributions with the following parameters: where µ i is the mean vector, and i is the covariance matrix. The overall data distribution is shown in Fig.7a.  Iris is one of the most widely used data sets in the pattern recognition literature. This data set consists of three classes, each of which contains 50 data points. Each class refers to a type of Iris plant. M.Forina et al. [51] analyzed the chemical composition of three wines grown in the same region of Italy, hoping to determine the origins of the wines using chemical analysis. Wine data set was the analysis result of this experiment which determined the quantities of 13 constituents found in each of the three types of wines. In criminology, some research has been done on the classification of glass types, because the glass left at the crime scene can be used as evidence. The Glass Identification Data Set contains a collection of attributes that are useful for categorizing glass types. The first attribute in the instances (objects), the ID number from 1 to 214, is ignored in our research, so the dimension of the data set is 9 instead of 10. This data set is denoted as WDBC in this paper. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. This data set is a subset of the 1987 National Indonesia Contraceptive Prevalence Survey and is denoted as CMC in this paper. The woman investigated is either not pregnant or not sure if she is pregnant. The survey is to predict the current contraceptive approach (no use, long-term methods, or shortterm methods) of a woman based on her demographic and socio-economic characteristics.

B. COMPARISION ALGORITHMS
Our proposed algorithm is a hybrid clustering one based on K-means and PSO. Therefore, we have selected several classic or state-of-the-art algorithms related to K-means or PSO to carry out a convenient comparative analysis. In addition, we also chose a classic hierarchical clustering algorithm, named BIRCH. Below further details of the comparison algorithms are briefly explained.
1) K-means is one of the most popular clustering algorithms and the basis of the proposed GLPSOK algorithm. This algorithm partitions the data set into K clusters by minimizing the sum of the intra-cluster variances. 2) MinMaxK-means [45] is an effective K-means variant.
This algorithm assigns weights to the corresponding clusters according to their variance and optimizes a weighted version of the objective function of K-means. Weights are learned together with the cluster assignments through an iterative procedure. 3) K-Multiple-Means (KMM) [46] is the state-of-the-art multi-prototype clustering algorithm based on K-means, which organizes the data set with multiple sub-cluster centroids into specified k clusters. 4) HPSOK-means [21] is the first work to apply PSO to data clustering. This algorithm uses PSO to find the centroids of specified k clusters, where the clusters generated by K-means are used to initialize the initial swarm of PSO. 5) PSC-RCE [33] is a famous variant of particle swarm clustering (PSC). This algorithm simplifies the update rules of PSC, and significantly reduces computational complexity by improving the efficiency of the particle trajectories. 6) PSOLF [52] is an Enhanced Particle Swarm Optimization with Lévy Flight for global optimization. In order to prove that our proposed GLPSOK clustering can outperform other partitioning based clustering methods that are assisted by present state-of-art PSO variants, we combine PSOLF and K-means in the same way used in this paper. The combination of PSOLF and K-means is denoted as PSOLFK in this paper. 7) PSOSCALF [53] is a state-of-the-art PSO variant. It is based on sine cosine algorithm and Lévy flight for solving optimization problems. Similarly, we combine PSOSCALF and K-means in the same way used in this paper for the purpose of proving that our proposed GLPSOK clustering can outperform other partitioning based clustering methods that are based on present stateof-art PSO variants. The combination of PSOSCALF and K-means is denoted as PSOSCALFK in this paper. 8) BIRCH [5] is a classic hierarchical clustering algorithm which can typically find a good clustering result only with a single data scan and improve the quality of the clustering result further with a few additional scans. The reason that we select BIRCH as the comparison algorithm is to further prove that GLPSOK can achieve superior performance over different types of clustering algorithms.

C. PARAMETER SETTINGS
To ensure the fairness of the experiments, for all PSObased algorithms, the maximum number of fitness evaluations (MaxFEs) was set to 30000. As for other algorithms such as K-means, BIRCH, MinMaxK-means, and KMM, the maximum number of calculations for their objective functions was also set to 30000. K-means has no additional parameters. Other parameters of the other seven comparison algorithms were set as indicated in the corresponding original papers or as default values. The parameter settings of the eight algorithms are shown in Table 4. In order to avoid experimental deviations, each algorithm was run 30 times on each data set independently.

D. EVALUATION METRIC
In order to evaluate and compare the performance of the proposed algorithm with the other six comparison algorithms, four metrics were selected. The first metric is SED as defined in Eq. (5). The other three metrics are Normalized Mutual Information [54], F-measure [1] and Accuracy [54]. Next, these metrics are described.

1) NORMALIZED MUTUAL INFORMATION (NMI)
Suppose C represents the set of classes obtained from the ground truth and C represents the set of clusters obtained from the algorithm. Their mutual information measure MI(C,C ) is defined as follows: where p (c i ) and p c j are the probabilities that an object randomly selected from the data set belongs to the class c i and cluster c j , respectively; and p c i , c j is the probability that an object randomly selected from the data set belongs to c i and also belongs to c j . In our experiments, NMI is defined as follows: where H (C) and H C are the entropy of C and C , respectively; NMI ranges from 0 to 1. NMI = 1 if the C and C are identical; and NMI = 0 if C and C are mutually independent.

2) F-MEASURE
Precision and recall are two metrics that have been widely used in information retrieval and statistics. It is desirable that the precision and recall are as high as possible. However, in some cases, the two metrics contradict to each other. F-measure is an approach to consider precision and recall together. In clustering, the higher the F-measure is, the better the clustering performance is. Formally, each class i (as given by the ground truth of the input data set) is regarded as the set of n i instances desired for a query; each cluster j (obtained from the algorithm) is regarded as the set of n j instances retrieved from a query; n ij represents the number of instances of class i within cluster j. For each class i and cluster j, F-measure (F), precision (p), and recall (r) are defined by means of the following equations: p(i, j) = n ij n j (27) r(i, j) = n ij n i (28) In Eq.(26), b is assigned 1 to keep equal weights for precision and recall.

3) ACCURACY (AC)
Given an object x i , let s i and r i be the ground truth of x i and the cluster label obtained by algorithms, respectively. The definition of AC is as follows: where n is the number of total objects; δ(x, y) is the delta function that equals 1 if x = y, 0 otherwise; and map (r i ) is the permutation mapping function which maps each cluster label r i to the equivalent ground truth.

E. RESULTS ON SYNTHETIC DATA SETS
In this section, the experimental results of applying the proposed algorithm GLPSOK on six synthetic data sets are described.  Table 5 shows the overall experimental results (including the mean and standard deviation (SD) of F-measure and AC) of the proposed algorithm on six synthetic data sets. The proposed algorithm was run independently 30 times on each data set. From the table we can see that the proposed algorithm achieves good clustering results on all synthetic data sets, where the F-measure and AC on all data sets are greater than 0.92, on data set 3 are greater than 0.98, and on data set 1 and data set 2 are even greater than 0.99.
In addition, Fig.2b, Fig.3b, Fig.4b, Fig.5b, Fig.6b, and Fig.7b illustrate the clustering results of the proposed algorithm on six synthetic data sets, respectively. Each color in each figure corresponds to a cluster. Since we only care about which data points belong to the same cluster and do not care what color is used to represent a cluster, we have unified the color representation of the clustering results and ground truth of each data set to more intuitively observe the clustering performance. From the figures, we can see that the proposed algorithm can achieve good clustering results.

F. RESULTS ON REAL-WORD DATA SETS
In this section, we test the performance of GLPSOK on five real-world data sets from the UCI Machine Learning Repository and compare it to eight comparison algorithms in terms of four evaluation metrics, namely, SED, NMI, F-measure, and AC. For each data set, we ran each algorithm 30 times independently. Tables 6-10 show the statistical results (Mean and standard deviation (SD)) obtained by all algorithms on data set Iris, Wine, Glass, WDBC, CMC, respectively. The best results are shown in bold. Table 6 shows the experimental results on data set Iris. From this table we can see that GLPSOK outperforms the        other eight algorithms for all evaluation metrics. In addition, the standard deviation of experimental results obtained by GLPSOK is very small, which proves that the proposed algorithm is stable on this data set. Table 7 shows the experimental results on data set Wine. For metrics F-measure and AC, the proposed algorithm outperforms all other eight algorithms. For metric SED, PSOSCALFK achieves the best results, slightly better than our proposed algorithm. For metric NMI, HPSOK-means achieves the best results, slightly better than our proposed algorithm. Table 8 shows the VOLUME 8, 2020  experimental results on data set Glass. For metrics SED, NMI and AC, the proposed algorithm outperforms all other eight algorithms. For metric F-measure, KMM achieves the best results, followed by BIRCH, but only a little better than our proposed algorithm. Other algorithms all get worse results than our proposed algorithm. Table 9 shows that the proposed algorithm outperforms all other comparison algorithms for all evaluation metrics on data set WDBC. Table 10 shows that GLPSOK underperforms some other algorithms in most metrics on the date set CMC. This is because the data in CMC are categorical values. For example, one of the attributes is ''Wife's now working?'' where ''0'' represents ''not working'' and ''1'' represents ''working''. However, GLPSOK is a distance-based clustering algorithm, which is suitable for continuous numeric data and not suitable for categorical data. The traditional way to treat categorical attributes as numeric does not always produce meaningful results because many categorical domains are not ordered.
To better observe the overall results and robustness of all algorithms, box plots of SED, NMI, F-measure and AC obtained by eight algorithms on data set Iris, Wine, Glass, WDBC, CMC, are shown. According to Fig.8, Fig.9, and Fig.11, compared with the eight comparison algorithms, the overall results obtained by GLPSOK are the best on the data sets Iris, Wine and WDBC. In addition, the whole distribution of experimental results obtained by our proposed algorithm is very concentrated, with almost no outliers, indicating that our proposed algorithm has strong robustness and stability on these three data sets; From Fig. 10, we can see that GLPSOK also obtains the best overall results on the data set Glass. Fig.12 shows that the clustering result of GLPSOK is at a medium level on data set CMC. In summary, GLPSOK achieves the best overall results and proved to be very robust.
In order to compare the experimental results more scientifically and comprehensively, the Wilcoxon signed rank test [55], a non-parametric hypothesis test method, was used for a paired difference test. This test can quantitatively indicate whether there is a significant difference in the performance between algorithms. It is noted that the smaller the value of SED is, the better the quality of the clustering results will be. In contrast, the larger the values of NMI, F-Measure and AC are, the better the quality of the clustering results will be. With the significance level α = 0.05, the results of the Wilcoxon signed rank test are shown in Table 11. The meaning of each symbol in the table is explained as follows: 'p-value' is the probability that the random variable has values more 'extreme' than the currently observed value under the null hypothesis; 'w+' is the sum of the rank greater than 0; 'w−' is the sum of the rank less than 0; R is the result of the Wilcoxon signed rank test, where '+' means GLPSOK outperforms other algorithms, '−' means GLPSOK is worse than other algorithms and '=' means GLPSOK is similar to other algorithms. As can be seen from Table 11, on the whole, GLPSOK is significantly better than the other eight algorithms.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a hybrid PSO-K-means clustering algorithm that adopts GEDM and Lévy flight strategy to improve the performance of the algorithm. The experimental results on six synthetic data sets show that the proposed algorithm can obtain good clustering results on all synthetic data sets. In addition, the proposed algorithm was compared with several classic or state-of-the-art clustering algorithms on five real-word data sets from the UCI Machine Learning Repository. The experimental results show that the proposed algorithm can achieve better performance than the six comparison algorithms on real-world data sets and has strong robustness. However, GLPSOK consumes a lot of time when dealing with large-scale data set. Future work can be summarized into two aspects: (1) how to further improve the convergence efficiency of GLPSOK; (2) how to apply the proposed GLPSOK algorithm to practical scenarios like image segmentation, and applications including clustering in the financial, ecological, and educative domains from data pressed in natural language texts.