Data Clustering Method Based on Improved Bat Algorithm With Six Convergence Factors and Local Search Operators

Clustering as an unsupervised learning method is a process of dividing a data object or observation object into a subset, that is to classify the data through observation learning instead of example learning without the guidance of the prior class label information. Bat algorithm (BA) is a swarm intelligence optimization algorithm inspired by bat’s ultrasonic echo localization foraging behavior, but it has the disadvantages of being easily trapped into local minima and not being highly accurate. So an improved bat algorithm was proposed. In the global search, a Gaussian-like convergence factor is added, and five different convergence factors are proposed to improve the global optimization ability of the algorithm. In the local search, the hunting mechanism of the whale optimization algorithm (WOA) and the sine position updating strategy are adopted to improve the local optimization ability of the algorithm. This paper compares the clustering effect of the improved bat algorithm with bat algorithm, flower pollination algorithm (FPA), harmony search (HS) algorithm, whale optimization algorithm and particle swarm optimization (PSO) algorithm on seven real data sets under six different convergence factors. The simulation results show that the clustering effect of the improved bat algorithm is superior to other intelligent optimization algorithms.


I. INTRODUCTION
At present, swarm intelligence algorithms based on bionics have attracted people's attention. People have successfully applied the inspiration obtained from the biological world to the solution of practical problems, and proposed a series of meta-heuristic swarm intelligence algorithms based on biological behavior. For example, the whale optimization algorithm (WOA) based on whale predation [1], the particle swarm optimization (PSO) algorithm based on the swarm behavior of birds and fish swarms [2], the harmony search (HS) algorithm based on the behavior of simulated musical instruments [3], bee colony algorithm (BCA) [4], artificial flower pollination algorithm (FPA) [5] for self-pollination The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. and cross-pollination of flowers in nature, the gray wolf optimizer (GWO) [6] and so on. The bat algorithm (BA) is a swarm intelligence algorithm proposed by Prof. Yang in 2010 based on the foraging behavior of bat ultrasonic echo localization [7]. It has been widely used due to its features of few parameters, simple model and easy coding. However, like other random searching algorithms, it has the disadvantages of easy premature convergence and low convergence accuracy, especially in the face of high-dimensional data.
Guo proposed an improved bat algorithm based on multiple swarm strategies and chaotic bat swarm algorithm so as to improve the convergence speed and accuracy of the bat algorithm. The chaos factor and the second-order oscillation mechanism are introduced to improve the update speed and dynamic parameter mechanism of the system [8]. Zhu et al. designed new pulse emissivity, loudness, velocity, and position update functions to avoid premature convergence, and designed a new one-dimensional perturbed local search strategy to improve the efficiency and accuracy of local search [9]. Yuan proposed an improved bat algorithm based on weighted method to solve the multi-objective optimal power flow problem, and the experimental results shown its effectiveness [10]. Meng introduced the bat habitat selection and its adaptive compensation method to the Doppler effect into the basic BA, and proposed a new bat algorithm (NBA), which was experimentally verified with BA and other algorithms to show its effectiveness [11]. Yaseen proposed a hybrid optimization algorithm based on the bat algorithm and particle swarm optimization algorithm, that is, the hybrid bat swarm algorithm, whose main idea is to improve BA by using PSO algorithm in parallel to replace the suboptimal solution generated by BA. This algorithm effectively speeds up the convergence speed of the algorithm and avoids the local optimal trapping due to the existence of BA [12]. Selim enhances the local and global search characteristics of the Bat algorithm through three different methods. In order to verify the performance of the enhanced bat algorithm (EBA), the practical problems of standard test functions and constraints are used, and the results prove that EBA is better than standard BA [13]. Miodragović introduced the Bat family to expand to continuously repeat the process of finding the optimal solution by including a loop search in the solution area. For each bat in each family, perform a fine search according to Levy-flight to find an improved solution until the given constraints are met [14]. An improved adaptive bat algorithm (SABA) was proposed, which has adaptive step control and mutation mechanism. This step control mechanism uses two frequencies to adapt to the step size used for global search and local search. This mutation mechanism can improve the algorithm's ability to avoid local optimization [15]. A bat algorithm based on iterative local search and stochastic inertia weight (ILSSIWBA) is proposed [16]. A new local search algorithm, iterative local search (ILS), is proposed, which makes ILSSIWBA have a strong ability to jump out of local optimal solutions. A new weight update method, random inertia weight method, is also proposed, and the pulse rate and loudness are improved to improve the balance performance of global search and local search. Al-Betar Applied the island model strategy to the bat algorithm to enhance the algorithm's ability to control the concept of diversity [17]. Sensitivity analysis of the main parameters of the island bat algorithm was conducted, and their influence on convergence was studied. The comparison with other algorithms on the benchmark function was very successful. A binary cooperative bat search algorithm (BCBA) was proposed [18]. Different from the original bat search algorithm, in the cooperative bat search algorithm (CBA), a consensus term is added to the speed equation of the original bat search algorithm. By comparing with the four binary algorithms in the literature, a numerical explanation is provided to prove the superior performance of BCBA. A chaotic enhanced bat algorithm is proposed to solve the global optimization problem [19]. The proposed method controls the steps of chaotic mapping through thresholds and uses velocity inertia weights to synchronize the speed of the agent. These mechanisms are designed to immediately improve the stability and convergence speed of the bat algorithm. Ylidizdan originally proposed an advanced modified BA (MBA) algorithm, and then proposed a hybrid system (MBADE), which includes the use of MBA in combination with DE to further increase development potential and provide excellence in various test problem clusters Performance. Compared with published data of existing algorithms, the developed hybrid system shows better performance than standard BA in all test problem sets and produces more acceptable results [20]. Hong proposed a chaotic and efficient bat algorithm based on chaos, niche search, and evolutionary mechanisms to optimize the parameters of a mixed kernel support vector regression model [21]. In order to overcome the low search capability of the bat algorithm and the premature convergence may occur, Chakri introduced directional echo localization in the standard bat algorithm to enhance its detection and development capabilities [22]. In addition to this directional echo localization, three other improvements are embedded in the standard bat algorithm to improve its performance. In order to improve the search ability of the bat algorithm, an improved bat algorithm based on the covariance adaptive evolution process is proposed [23]. The information contained in the covariance adaptive evolution diversifies the search direction and sampling distribution of the population, which is of great benefit to the search process. Dhar proposed an image threshold segmentation method based on interval fuzzy set (IT2FS) and proposed an improved bat algorithm, which improved the calculation efficiency of threshold technology [24].
As an unsupervised learning method, clustering does not need prior class labeling information to classify data through observation learning rather than example learning [25]. Clustering is the process of dividing a data object or an observation object into subsets. Each subset is also a cluster. The purpose of clustering is to make the objects in the cluster similar to each other, and the objects between the clusters different from each other. The swarm intelligence optimization algorithm has good optimization ability, and the clustering problem can also be regarded as an optimization problem to find the optimal clustering center in the solution space. The combination of different clustering centers constitutes the solution space of the clustering problem. The goal of clustering is to find the clustering center that optimally divides the data in the solution space. The optimization mechanism of the swarm intelligence algorithms is used to enable individuals to continuously move in the solution space to find a better combination of clustering centers. Therefore, the swarm intelligence optimization algorithm is an efficient way to solve the clustering problem. Kuo proposed a dynamic clustering method based on particle swarm algorithm and genetic algorithm. This algorithm realizes automatic clustering of data by detecting the data without pre-specifying the number of clusters [26]. Yang proposed a Chinese text clustering optimization algorithm based on hybrid differential evolution optimization and invasive weed optimization. Experimental results show that the method has better performance [27]. The bee mating optimization algorithm was applied to clustering and got good results [28]. In order to overcome the disadvantages of K-means method that is highly dependent on the initial solution and easily fall into the local optimum, a flower pollination algorithm with bee pollination was proposed [29]. An improved differential evolution (DE) algorithm was proposed by utilizing Archimedean spiral, Mantegna Levy flight and neighborhood 80538 VOLUME 8, 2020 search (NS). These strategies achieved good efficiency in convergence speed and better local and global search [30]. In order to solve the problem that the EM algorithm with the Gaussian model is very sensitive to the initial value, a robust Gaussian mixture model EM clustering algorithm is proposed, which is robust to initialization and different cluster capacities, and can automatically obtain the optimal number of clusters [31]. A K-means clustering method based on the shuffled leap frog algorithm (SFLKmeans) was proposed, which is compared with other heuristic algorithms (such as GAK, SA, TS, and ACO) on multiple simulated and real data sets. The results show that the algorithm has better performance [32].
This paper proposes an improved bat algorithm to solve cluster optimization problems. In the global search stage, a convergence factor with the Gaussian function form is added, and on the basis of this, five different convergence factors are proposed to improve the algorithm's global optimization capability. The local search is added with the whale optimization algorithm's hunting mechanism and the sine position updating strategy in order to improve the local exploration ability of the algorithm. The improved bat algorithm, bat algorithm, flower pollination algorithm (FPA), harmony search algorithm, whale optimization algorithm and particle swarm optimization algorithm are adopted to perform clustering experiments on seven real data sets to verify the effectiveness of the proposed algorithm.

II. BAT ALGORITHM
Bats use echolocation technology to detect prey, avoid obstacles, and find habitat in dark surroundings. It can emit very loud pulses and listen to echoes that bounce back from surrounding objects. Based on the time and intensity of the echoes to the ears, it can determine the direction and position of the object. It can also issue pulses of different properties according to the characteristics of the target prey or obstacle. The frequency of sound waves emitted by bats is usually in the range of 25-100 kHz. Each sound wave emission usually lasts a few thousandths of a second (5-20 ms), and a miniature bat emits sound waves about 10-20 times per second. When hunting for prey, bats emit sonic pulses about 200 times per second. Bats make loud sounds up to 110 dB, which can change from the loudest when hunting for prey to the silence when approaching the prey. The bat detects the distance and orientation of the target, the type of the prey, and the speed of the prey [5] through the time difference between the time when the bat emits and receives the echo. If the echo localization characteristics of bats is studied in an idealized way, it can be more easy to simulate the bat algorithm. In analyzing the bat algorithm, the following approximately idealized rules are adopted.
1) All bats adopt echolocation to sense distance, and they also know the difference between food / prey and background obstacles in some magical way.
2) The bats fly randomly at position x i at speed v i . They can automatically adjust the frequency (wavelength) of the emitted pulses and adjust the pulse emission rate r ∈ [0, 1] according to the proximity of the target.
3) Although the loudness can be changed in many ways, we assume that the loudness changes from a large (positive) value A 0 to a minimum value A min .
In the process of simulating the bat algorithm, it is assumed that the search space of the bat has D dimension, and the update rules of the position x t i and speed v t i of each bat in each generation are given by Eq. (1)- (3).  where x * is the current global optimal solution, β ∈ [0, 1] is a random number, f i is the sonic frequency of the bat, which is located between [f min , f max ].
For the local search, once a solution is selected among the current best solutions, a local random walk is used to locally generate a new solution for each bat.
where, ε ∈ [−1, 1] is a random number and A t is the average loudness of the entire population in the same generation. Assume that once a bat finds its prey, it will gradually reduce the loudness of its pulse emission, while increasing its pulse emission rate. The loudness A i and rate r i of the bat's transmitted pulse are adjusted according to Eq. (5) and (6).
where, α ∈ (0, 1) is the acoustic loudness attenuation coefficient, γ > 0 is the pulse frequency enhancement coefficient and r 0 i is the initial pulse frequency of bat i. Based on the above analysis, the procedure of the basic bat algorithm are summarized as follows: Step 1: Parameter initialization. Bat population size m, number of iterations N , objective function f (X ), bat position X i (i = 1, 2, . . . , m) and velocity V i , sound wave frequency f i , sound wave loudness A i and frequency r i .  Step 2: Find the optimal bat position x * in the current population, and update the speed and position according to Eq. (1)-(3).
Step 3: Generate a random number rand1 located in the scope [0, 1]. If rand1 > r i , choose an optimal individual among the best bats, and then generate a local solution by Eq. (4) near the selected optimal individual, otherwise update the bat position according to Eq. (3).
Step 4: Generate a random number rand2 located in the scope [0, 1]. If rand1 < A i , and the fitness of the objective function is better than the new solution in Step 3, then accept this position. Adjust A i (decrease) and r i (increase) according to Eq. (5)-(6).
Step 5: Sort the fitness values of all individuals in the population and find the current best x * .
Step 6: Repeat Step (1)-(4) to determine whether the maximum number of iterations is met, and then output the global optimal value.

III. IMPROVED BAT ALGORITHM
The bat algorithm relies on the mutual cooperation and interaction between bat individuals. There is no mutation mechanism for individuals within the population. Once the local optimal value is found, it will fall into it and affect other individuals to move closer to it, which will cause the algorithm to prematurely converge, and it will also greatly reduce the diversity of the population. Aiming at the shortcomings of the basic bat algorithm, such as easy to fall into local extreme values, low optimization accuracy, and slow convergence speed in the later stages of the algorithm, this paper introduces a non-linear mutation factor in the speed update equation in the global search phase. It keeps the bat population highly diverse, thereby enhancing the global exploration ability of the algorithm. At the same time, the position update equation is also changed during the local search stage. The narrowing and enclosing mechanism in the whale optimization algorithm and the sine position updating strategy in the sine and cosine search algorithm are adopted to improve the deep exploration ability of the algorithm.

A. GLOBAL SEARCHING BASED ON CONVERGENCE FACTORS
During the global search stage, the bat mainly updates its position by relying on its corresponding speed value as its moving step, so as to keep approaching the prey [33]. It can be seen from the speed update Eq. (2) of the bat algorithm that x t i − x * has an important effect on the speed update method, that is to say it has an important effect on the bat's moving step size. x t i − x * is the distance from the i-th bat at the t generation to the current optimal position. The bat will be constrained by this distance in the global search, and it will not be able to swing the bat population well for the global exploration. Therefore, the global optimization ability of the algorithm is reduced. So the speed update Eq. (2) determines the global exploration ability of the bat population. In order to enhance the global search ability of the algorithm, this paper adds a non-linear mutation factor D to Eq. (2), which can be described as: The obtained speed update strategy non-linearly expands the search range and ensures the diversity of the population, thereby increasing the global search capability of the bat algorithm. a is a random number between [0, 1], and c is calculated by: where, t is the current number of iterations, Maxiter is the maximum number of iterations. The convergence factor D decreases gradually as the number of iterations increases. At the beginning of the iteration, D has a lower attenuation degree and can move with a larger amplitude, which can better find the global optimal solution. In the later iterations, the degree of attenuation of D increases, and the range of movement decreases, which can more accurately find the optimal solution and balance the exploitation and exploration capabilities during global search. The decay behavior of the convergence factor D with the number of iterations is shown in Fig. 1 (a). It can be seen from the Eq. (9) that the expression of c belongs to the Gaussian function. Therefore, this paper proposes non-linear factors with the expressions of cosine, sine, tangent, power function and exponential function. The formulas for the five convergence factors are described as follows. The convergence factor D 1 with the cosine form is defined as: where c 1 can be calculated by: The convergence factor D 2 with the sine form is defined as: where c 2 can be calculated by: The convergence factor D 3 with the tangent form is defined as: where c 3 can be calculated by: The convergence factor D 4 with the power function form is defined as: where c 4 can be calculated by: 80546 VOLUME 8, 2020 The convergence factor D 5 with the exponential function form is defined as: where c 5 can be calculated by: The movement trend of above convergence factors with the increase of iterations is shown in Fig.1.

B. LOCAL SEARCHING BASED ON HUNTING MECHANISM AND SINUSOIDAL POSITION UPDATING STRATEGY
In the local search stage, it is considered that the bat algorithm adopts the complete perturbation method in Eq. (4) for local search.
In order to generate a new solution, each vector of the current optimal solution will change, so the search efficiency is low and the search accuracy is poor. Therefore, in this paper, the shrinking enclosing mechanism in the whale optimization algorithm and the sine position update strategy in the sine and cosine algorithm are combined to enhance the local search ability of the bat algorithm.

1) HUNTING MECHANISM
The whales use the bubble net attack method (exploitation stage), which includes two methods of reducing the surrounding mechanism and updating the position by the spiral.

a: REDUCING ORBITING MECHANISM
The WOA assumes that the current best candidate solution is the target prey or near the optimal solution. After the best search agent is defined, other search agents will try to update their positions to the best search agent, this strategy is expressed as follows.
A and C are calculated by: where, a decreases linearly from 2 to 0, and r is a random number between [0, 1]. Fig.2 (a) illustrates the principle on the two-dimension WOA. The location (X , Y ) of the search agent can be updated based on the location of the current best record (X * , Y * ). By adjusting the values of the A and C vectors, different positions around the best agent can be achieved based on the current position. The same concept can be extended to an n-dimensional search space, and the search agent will move around the best solution obtained so far in the hypercube. The fluctuation range of A also decreases as a decreases. a decreases from 2 to 0 during the iteration. The value range of A is a random value in [−a, a]. Set a random value for A in [1,1], that is to say the new location of the search agent when |A| ≤ 1 can be defined anywhere between the original location of the agent and the location of the current best agent. Fig.2 (b) shows all possible positions from (X , Y ) to (X * , Y * ), and 0 to 1 can achieve these positions in a two-dimensional space.

b: POSITION SPIRAL UPDATING METHOD
As shown in Fig.2 (c), this method first calculates the distance between the whale at (X , Y ) and the prey at (X * , Y * ), then create a spiral equation between the whale and the prey's position, and simulate the spiral motion of the humpback whale, which is described as follows.
where, D = |X * (t) − X (t)| is the distance from the whale i to the prey (the best solution currently obtained), b is a constant defining the logarithmic spiral shape, and l is a random number in [−1, 1]. The humpback whale swims around its prey in a narrow circle, while swimming along a spiral path. To simulate this simultaneous behavior, it is assumed that there is a 50% probability that a choice can be made between the reduced enclosing mechanism and the spiral model to update the position of the whales. The mathematical model is described as follows.
where, p is a random number between [0, 1]. In the searching for prey (exploration stage), the same method based on the change of vector A can also be used to find prey (exploration). In fact, humpback whales carry out the random searches based on each other's location. Therefore, a random value A greater than or less than 1 is used to force the search agent away from the reference whale. When |A| > 1, exploration is emphasizes by combining with WOA algorithm for global search. Its mathematical model is expressed as follows.
where, X rand is a randomly selected position vector (random whale) from the current population.

2) SINUSOIDAL POSITION UPDATING STRATEGY
The sine and cosine algorithm uses simple mathematical functions (sine and cosine functions) to explore and use the space between the two solutions to design an optimization algorithm in order to find a better solution. Its position updating principle can be expressed as: where, X t+1 i is the position of the current solution in the i-th dimension at the t-th iteration, P t i is the position of the end point of the i-th dimension, and r 1 = 2(1 − t/Maxiter), r 2 is a random number between [0, 2pi], r 3 is a random number between [0, 2], and r 4 is a random number between [0, 1].

3) LOCAL SEARCH BASED ON HUNTING MECHANISM AND SINUSOIDAL POSITION UPDATING STRATEGY
By combining the whale optimization algorithm's miniaturization surrounding mechanism and the sinusoidal position updating strategy of the sines and cosines optimization algorithm. This specific strategy can be expressed as: When |A| < 1, the new position (X , Y ) can be updated based on the position of the current best record (X * , Y * ). The position can be updated around the current optimal solution, so that the position can be better explored and updated. When |A| ≥ 1, the sine position update principle is adopted to expand the search range and balance the global search and local search capabilities more effectively.

C. PSEUDO CODE OF IMPROVED BAT ALGORITHM
The pseudo-code based on the improved bat algorithm is described as follows.
Initialize the bat population x i and v i (i = 1, 2, . . . , n) Initializes pulse frequency f i , pulse rates r i , and loudness A i . while (t < Max number of iterations) Generate new solutions by adjusting frequency, and updating velocities and locations/solutions [equations (1), (7) and (3)] if (rand > r i ) Select a solution among the best solutions Update the formula according to formula (29) Accept the new solutions Increase r i and reduce A i end if Rank the bats and find the current best x * t = t + 1 end while According to the above pseudo-code, the time complexity of the algorithm is O(log n). The inner nested algorithm needs to loop all individuals, so the time complexity is O(n log n). VOLUME 8, 2020  When formula (29) is calculated, the position is updated for each dimension of each individual, so its time complexity is O(n 2 log n). So the overall time complexity is O(log n + n log n + n 2 log n), which is O(n 2 log n).

D. GLOBAL SEARCHING BASED ON CONVERGENCE FACTORS
In this section, the numerical efficiency of the improved algorithm proposed in this paper is verified by solving 10 mathematical optimization problems. The expressions of the ten benchmark functions are shown in Table 1. In order to prove the superiority of the algorithm from various aspects, the test functions are divided into three groups of functions, one is the unimodal function F1 ∼ F3 [34], and the unimodal function has only one global optimal solution. One is the multimodal function F4 ∼ F6 [35]. Multimodal functions have more than extreme points, so multimodal functions have local optimal values; the last combination function is F7 ∼ F10 [34]. The combination function is formed by rotating, shifting, and offsetting various benchmark test functions.  Table 2 that the improved six bat algorithms are superior to the original bat algorithms. For the results of the unimodal and multimodal functions, the improved bat algorithm finds an optimal value of 0 every time, but the bat algorithm falls into a local optimum. For a fixed-dimension test function, the improved algorithm shows its superior performance. It can also be seen from the above figure that the convergence rate of the improved bat algorithm has been greatly improved, and the results are stable after multiple experiments. Thus we can say that the improved bat algorithm improves the convergence speed and convergence accuracy of the original bat algorithm.

IV. DATA CLUSTERING METHOD BASED ON IMPROVED BAT ALGORITHM
Based on unsupervised learning, a clustering method is proposed to divide objects into groups or classes. In unsupervised technology, the training data set is first grouped based only on the numerical information in the data (the cluster center), and then matched to the class. The adopted data set contains class information for each data. Therefore, the main goal is to find the center of the cluster by minimizing the objective function (the sum of the distance of the pattern from its center). The purpose of clustering is to minimize the objective function given N patterns [36]: where, K is the number of clusters, d is the Euclidean distance, c k (k = 1, 2, . . . , K ) is the center of the K -th cluster, and x i (i = 1, 2, . . . , N ) is the data of the K -th cluster. Clustering is to assign the patterns in the data to the cluster, so that the patterns in a cluster are similar based on a certain similarity measure. The most common measurement method is distance measurement. This paper uses the Euclidean distance between the minimized data center and the data set belonging to the center as the objective function [37], [38]: where, i = 1, . . . , K , D Train is the number of training data sets, c i is the i-th cluster center, Bl is the instance to which c i belongs, and x Bl(c i ) j is the training data matrix belonging to cluster i.
In this paper, the clustering center is the decision variable. The objective function shown in Eq. (31) is minimized to obtain the optimal clustering center. 75% of the data are randomly selected in the data set as training set so as to obtain the optimal clustering center, and then tested the remaining 25% of the data (test set) to obtain the accuracy of the clustering result. The F-measure and ARI indexes below classify all the data according to the optimal clustering center obtained from the training set to evaluate the clustering effect. The specific procedure of the clustering algorithm are described as follows.
Step 2: Input each cluster data.
Step 3: 75% of each type of data was randomly selected as training data.
Step 4: The fitness value is calculated according to the objective function, and the fitness value of the small value is denoted as f min and its corresponding global optimal position.
Step 5: In the iterative process, the training data were trained according to the improved bat algorithm and the population location was updated.
Step 6: Calculate the fitness value of the updated position after each iteration, and compare the minimum fitness value with f min . If less than f min , update the minimum fitness value and the optimal location, otherwise continue the iteration process.
Step 7: At the end of the iteration, the final global optimal position is obtained, which is the optimal cluster center.
Step 8: Repeat Step (4)-(7) to determine whether the maximum number of iterations is met, and then output the global optimal value. VOLUME 8, 2020 Step 9: Repeat steps (2)-(8) to find the optimal clustering center for the next cluster of data.
Step 10: The data sets are classified according to the distance from each data to each clustering center.

A. EVALUATION INDEX
There are three methods for testing the clustering effect: F-Measure, adjusted Rand index, and accuracy.

1) F-MEASURE
The F-Measure represents the harmonic mean between the accuracy and recall of the clustering of all classes [39]. Given the number of samples n i in the known class i, the number of samples n j in the cluster j, and the number of samples n ij in the cluster j belonging to the known class i, the accuracy can be defined as: The recall rate can be defined as: Then the overall F-Measure of the data set can be defined as: where, b = 1. The value range of the F-measure is [0, 1]. The larger the value, the better the clustering effect.

2) ADJUSTED RAND INDEX
The Rand index (RI) [40] needs to provide the actual category information C. Assuming that K is the clustering result, a represents the logarithm of elements of the same category in both C and K , and b represents both C and K is the logarithm of the elements in different categories, the Rand index is defined as: where, C n samples 2 is the total number of element pairs that can be composed in the data set. The value range of RI is [0, 1]. A larger value means that the clustering result is more consistent with the real situation.
For random results, RI does not guarantee a score close to zero. In order to achieve ''in the case where the clustering results are randomly generated, the index should be close to zero'', an adjusted rand index (ARI) [41] was proposed, which has a higher degree of discrimination. The value range of ARI is [−1, 1]. A larger value means that the clustering result is more consistent with the real situation. In a broad sense, ARI measures how well the two data distributions fit. VOLUME 8, 2020 3) ACCURACY The definition of Accuracy is defined as the ratio of the number of correctly classified samples to the total number of samples for a given test data set, that is to say the loss function is the accuracy on the test data set when the loss is 0-1 [42].
B. DATA SETS 7 kinds of real data in UCI library are adopted to perform clustering experiments, which are Iris, Wine, Bupa, Seeds, Heartstatlog, WDBC, and Wisconsin breast cancer.
(1) Iris (N = 150, d = 4, K = 3) is the most widely used data set, which can be divided into three types of iris plants. Each type contains 50 data, a total of 150 four-dimensional attribute data sets. Properties include sepal length, sepal width, petal length, and petal width. Two types of data are highly overlapping, and the other is linearly separable from the other two types.
(2) Wine (N = 178, d = 13, K = 3) data are the result of chemical analysis of wines from the same region of Italy but from three different varieties. It analyzes and determines the amount of 13 ingredients in each of the three wines.
(3) BUPA (N = 345, d = 6, K = 2). The BUPA liver disorders data set contains 2 types of 345 data with a total of 7 attributes, each of which represents a record of a male individual. The first five variables are blood tests and are considered sensitive to liver disease that can be caused by excessive drinking.
(4) Seeds (N = 210, d = 7, K = 3) includes seeds belonging to three different wheat varieties: Kama, Rosa, and Canadian wheat, each of which is randomly selected from 70 elements.
(6) WDBC (N = 569, d = 32, K = 2). Wisconsin Diagnostic Breast Cancer is a diagnostic breast cancer data set and contains 569 data sets of 2 types with 32 attributes. Features were calculated from digital images of fine needle aspiration (FNA) of breast masses.

C. SIMULATION EXPERIMENT AND RESULT ANALYSIS
The five improved bat algorithms are compared with the original bat algorithm, the flower pollination algorithm (FPA) [5], the harmony search algorithm (Harmony) [3], the whale optimization algorithm (WOA) [1], the particle swarm optimization (PSO) algorithm [2], a clustering algorithm combining whale algorithm and grey Wolf optimizer (WEGWO) [43], Chaotic particle swarm optimization (CPSO) [44] and a hybrid PSO and SA clustering algorithm (PSO_SA) [45]. The main parameter settings of each algorithm are shown in Table 3. Clustering experiments were performed by using seven real data set from the UCI database, namely Iris, Wine, Bupa, Seeds, Heartstatlog, WDBC and Wisconsin breast cancer. The effectiveness of stochastic algorithms depends to a large extent on the choice of initial solution, so all algorithms in this paper take randomly generated initial solutions, and for each data set, the algorithm is executed ten times to perform their own validity tests. The clustering results were evaluated by using F-measure, ARI, and Accuracy performance indicators. The running results are shown in Table 4-10. It can be seen from Table 4 that the five proposed improved schemes are better than the original bat algorithm in clustering effect and accuracy, and have a certain improvement in stability. Compared with the seven swarm intelligent optimization algorithms, the improved bat algorithm has improved the clustering effect and accuracy, but there are some shortcomings in the stability of the clustering algorithm, such as bats based on the power function and exponential function in the Iris data set. The ARI index of the algorithm is larger than the standard deviation of FPA, Harmony, WOA, PSO, CPSO, PAO_SA, which shows that the stability of the algorithm is insufficient. However, in the comparison of three indicators, the six improved algorithms have better clustering results than the other five typical algorithms, and the best of them is the bat algorithm based on the power function. Seen from Table 5 on the Wine data set, the six improved BAT algorithms are better than BAT, FPA, Harmony, WOA, WEGWO, CPSO and PSO_SA in clustering results and stability, among which the improved BAT algorithm based on tangent function has the best effect. PSO algorithm is superior to the partially improved BAT algorithm in terms of stability, but in terms of clustering effect, the six improved methods are better than BAT, FPA, Harmony, WOA and PSO algorithm, which also reflects the superiority of our improved BAT algorithm to some extent. In the Table 6 about Bupa dataset, the accuracy of the six improved algorithms is higher than that of the typical eight swarm intelligent optimization algorithms. The ARI index results are 0.0926, 0.0881, 0.1027, 0.0966, 0.0991, 0.1086 compared to -0.0016, -0.0037, -0.0027, -0.0044, 0.0062, 0.012, 0.0244, 0.0184 of BAT, FPA, Harmony, WOA, PSO, WEGWO, CPSO, PSO_SA algorithm. It can be seen that the clustering effect has been significantly improved. F-measure also show the effectiveness of the six improved bat algorithms in terms of results and stability. The Table 7 on Seed data set shows that PSO_SA algorithm is slightly superior to the improved algorithm based on tangent function in accuracy, f-measure and ARI, but the other five improved algorithms are all superior to other algorithms. However, it is slightly insufficient in stability. Although Harmony and PSO algorithm are slightly inferior in the comparison of indicators, they are indeed more stable than the improved bat algorithm. This shows that the improved bat algorithm is not stable enough in the clustering process. It can be seen from Table 8 on the Heartstatlog data set that the improved bat algorithm based on the Gaussian function has the best effect on the F-measure and ARI indicators is, while the highest accuracy of clustering is the improved bat algorithm based on the exponential function. It can be seen that high clustering accuracy does not mean that it has a good clustering effect. Seen from Table 9 and 10, the improved bat algorithm based on the exponential function has the best effect on both Wdbc and Cencer datasets, but the stability of the F-measure and ARI of the improved bat algorithm based on the exponential function are not minimal. Compared with the original bat algorithm, the performance of the improved bat algorithm has been greatly improved, but for other swarm intelligent optimization algorithms, the stability is still slightly insufficient. Fig. 4 and Fig. 5 are the convergence trends of F-Measure and ARI indicators in 100 iterations of 11 algorithms on different data sets. The curve obtained is the average of ten runs. Fig. 5 is a comprehensive display of the accuracy of each algorithm in different data sets. Simulation results show that the average of F-Measure, ARI and Accuracy of this algorithm is better than other algorithms. This indicates that these clusters are well separated in space. The simulation results in the table show that the hybrid evolution algorithm converges to the global optimum with a small standard deviation, and naturally concludes that the six improved bat algorithms are a feasible and robust data clustering technique.

V. CONCLUSION
The clustering problem is a very important problem that has attracted the attention of many researchers. Among them, the meta-heuristic swarm intelligence algorithms have more and more applications in clustering because of its good optimization ability to effectively find the optimal clustering centers. The bat algorithm has the disadvantages of being easily trapped into local minima, and the optimization precision is not high. By improving the bat algorithm, this paper effectively improves the global optimization and local optimization capabilities of the algorithm so that it can better solve the clustering problems. The algorithm has been implemented and tested on several known real data sets, and the results obtained are encouraging. Many random search algorithms have the disadvantage of unstable searching. The improved algorithm in this paper also has this problem, and more work is needed in the future. In general, the algorithm proposed in this paper has high precision and low standard deviation, so the improved bat algorithm can be applied to the case of a known number of clusters.