A Novel Dynamic Clustering Method by Integrating Marine Predators Algorithm and Particle Swarm Optimization Algorithm

Data clustering is the process of identifying natural groupings or clusters based on a certain similarity measure in muti-dimensional data. Aiming at the dynamic clustering problem where the number of clusters cannot be determined in advance, a hybrid dynamic clustering method based on the marine predators algorithm (MPA) and particle swarm optimization (PSO) algorithm was proposed. The position update strategy of the PSO algorithm was used to make up for the lack of MPA in global searching. The fixed-length coding strategy with the real number coding method was used to deal with the variable length clustering optimization problem, and the unfeasible solution processing strategy and the penalty function strategy are adopted to improve the performance of the algorithm and achieve simultaneous optimization of the number of clusters and cluster centers. The proposed MPA-PSO algorithm with PSO algorithm, MPA, Differential Evolution (DE) algorithm, Spotted Hyena Optimizer (SHO), Lightning Searching Algorithm (LSA) and Equilibrium Optimizer (EO) are adopted to carry out the clustering simulation experiments on four artificial data sets and six real data sets (Iris, Wine, Wisconsin breast cancer, Vowel, Seeds, and Wdbc) in UCI databases. Three performance indicators (the number of clusters, ARI and Accuracy) are used to evaluate the clustering results. The experimental results show that the proposed method can not only successfully find the correct number of clusters, but also obtain stable results for most test problems.


I. INTRODUCTION
Data clustering is the process of identifying natural groupings or clusters in muti-dimensional data based on a certain similarity measure (such as Euclidean distance) [1], [2]. A cluster is usually identified by a cluster center or center of gravity [3]. In recent years, clustering has been widely used in the fields of engineering, computer science, biology and medicine, social science and economics [4]. Since the data may have different shapes and sizes, it is usually not known how many clusters should be formed [5]. In the solution to the number of clusters, the cluster validity functions can be used to obtain the effective number of clusters. At present, many effective cluster validity indexes have been proposed, such as The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan .
Although the above cluster validity functions are widely used, many researches are developing clustering algorithms that do not require a predefined number of clusters. This research direction is called automatic clustering or dynamic clustering. In recent years, many clustering algorithms have been developed. The optimization ability of genetic algorithm was used to automatically find the number of clusters and the appropriate clustering of any data set [18]. In order to encode a variable number of clusters, a string encoding method consisting of real numbers and unconcerned symbols is adopted, and then a muti-objective clustering algorithm MIE-MOCK was proposed to improve the accuracy of clustering. The algorithm uses randomly selected crossover operator and mutation operator to increase search diversity instead of a single crossover operator and mutation operator [19]. By using the so-called muti-information exchange operator to increase search diversity, this algorithm can ultimately improve clustering quality. The Empire Competition Algorithm (ICA) was proposed for automatic clustering problems, and a new method based on random and homogeneity-based merge-splitting methods was proposed to change the number of clusters [20]. The experimental results show that the algorithm has advantage in convergence speed and the solution quality. A new symmetry-based genetic clustering algorithm was proposed to automatically evolve the number of clusters from the data set and divide the clusters appropriately. Then a new point symmetry-based clustering validity index (Sym index) was proposed to measure the effectiveness of clustering [21]. Hong Yu et al. proposed an effective automatic clustering method and extended the rough set model of decision theory to clustering [22]. The automatic clustering problem with muti-objective optimization (MOO) problem was studied and two clustering validity indexes are optimized at the same time [23]. Cong Liu et al. designed three adaptive coding schemes. The fixed-length chromosomes were used to deal with variable-length optimization problems so as to automatically detect the number of clusters [24]. A muti-elite particle swarm optimization (MEPSO) method was proposed for clustering complex data sets and linear inseparable data sets.The kernel function can cluster the linearly inseparable data in the original input space into isomorphic groups of the transformed high-dimensional feature space [25]. A new locally adjusted differential evolution algorithm (DELA) was proposed to find the best optimal number of clusters [26]. An improved differential evolution (DE) algorithm was proposed to solve the automatic clustering problem and the best solution effect and acceleration factor were proposed to improve the convergence of the solution algorithm [27]. Mahamed proposed a dynamic clustering method based on particle swarm optimization (DCPSO) algorithm, which was applied to the image segmentation of synthetic images and natural images [28]. R. J. Kuo et al. proposed a dynamic clustering method (DCGA) based on particle swarm optimization (PSO) and genetic algorithm (GA) [29]. Shih-Ming Pan proposed an automatic clustering algorithm framework (called ETSAs), which does not require the user to give every possible value of the required parameters (including the number of clusters) [30]. The simulation results show that ETSA is superior in finding the correct number of clusters. This paper proposes a hybrid MPA-PSO algorithm that combines marine predator algorithm (MPA) and particle swarm optimization (PSO) algorithm to solve dynamic clustering problems. The fixed length coding strategy with the real number coding method was used to deal with the variable length clustering optimization problem to realize automatic clustering, that is to find the optimal number of clusters and clustering centers. The effectiveness of the proposed method is verified by simulation experiments between the MPA-PSO algorithm and several other intelligent optimization algorithms on four artificial data sets and six UCI data sets.

II. HYBRID ALGORITHM BY INTEGRATING MARINE PREDATORS ALGORITHM AND PARTICLE SWARM OPTIMIZATION ALGORITHM (MPA-PSO) A. MARINE PREDATORS ALGORITHM
The main inspiration of the Marine Predator Algorithm (MPA) is the extensive foraging strategy of marine predators, namely Levy flight and Brown motion and the optimal encounter rate strategy in the interaction of predator and predator organisms [31]. Marine predators use Levy strategy for environments with low prey concentration, and Brown motion for areas with abundant prey. The speed ratio v from prey to predator represents a trade-off between Levy flight and Brown motion.
1) In the low speed ratio (v = 0.1), the best strategy for the predator is the Levy flight strategy. The prey is either in Brow motion or Levy flight.
2) In the unit speed ratio (v = 1), if the prey moves in Levy flight, the best strategy for the predator is the Brown motion, otherwise scenarios depend on the size of the system.
3) In the high-speed ratio (v ≥ 10), the best strategy for the predator is to be completely immobile. In this case, the prey is either Brown motion or Levy flight.
The mathematical model of the MPA algorithm is introduced in details. Firstly initialize a set of solutions in the searching space.
where, X min and X max are the lower and upper bounds of the variable, and rand is a uniform random vector in the range [0, 1]. According to the survival of the fittest theory, the top predators in nature are more talented in foraging. Therefore, the best solution is designated as the top predator to construct a matrix called the elite matrix. This matrix array is used to find the prey based on the position information of the prey.
where, − → X I represents the top predator vector and it is copied n times to construct the elite matrix; n is the number of searching agents, and d is the searching dimension.
Another matrix with the same dimension as the elite matrix is called the prey matrix, and the predator updates its position based on it, that is to initially create the initial prey matrix, and build the elite matrix with the fittest (predator). Its prey matrix is defined as follows.
where, X i,j represents the ith dimension of the jth prey.
In the iterative process of the main loop of the MPA algorithm, the optimization process is divided into three stages according to the speed ratio. The model is described as follows.

1) HIGH SPEED RATIO STAGE (v ≥ 10)
The exploration stage at this stage can be expressed as follows. While where, R B is a vector of random numbers, and these numbers are based on the normal distribution to represent the Brown motion; ⊗ is termwise multiplication; P = 0.5 is a constant, and R is a vector of uniform random numbers in the scope [0, 1]. This situation occurs in the first third of the iteration, the step size or moving speed and the exploration capability are higher; Iter is the current iteration, and Max_Iter is the maximum iteration.

2) UNIT SPEED RATIO STAGE (v = 1)
This part occurs in the middle stage of optimization, and the exploration attempts to be converted to exploitation. At this stage, both exploration and exploitation are important. Therefore, half of the individuals are designated for exploration and the other half are designated for exploitation. At this stage, the prey is responsible for hunting, and the predator is responsible for prospecting. While 1 3 Max_Iter < Iter < 2 3 Max_Iter For the first half of the population: For the latter half of the population: where, − → R L is a random number vector based on the Lévy distribution, representing the Lévy flight. At this stage, the first half of the prey moves with the Lévy strategy, while the other half moves with the Brown motion strategy.
CF is considered as an adaptive parameter, which is used to control the step length of the predator's movement, and the specific expression is shown in Eq. (10).
At low speed ratios or when the predator is moving faster than the prey, this happens in the last stage of the optimization process, which is mainly related to high development capabilities. At low speed ratio conditions, the best strategy for the predator is Lévy flight.
Another cause of changes in the behavior of marine carnivores is environmental problems, such as vortex formation or fish gathering device (FADs) effect. Sharks spend more than 80% of their time near the FADs, and for the remaining 20% of the time, they may make longer jumps in different dimensions to find an environment where another prey is distributed. FADs is considered a local optimal solution, and its role is to capture these points in the searching space. These longer jumps are taken into account during the simulation, which avoids the local optimal stagnation. Therefore, the mathematical expression of the FADs effect can be represented as: where, − → U is a binary vector containing an array of 0 and 1. Its construction method is described as follows. Generate a random vector in the scope [0, 1]. If the array is less than 0.2, change its array to 0. If the array is greater than 0.2, change its array to 1. r is a uniform random number in the scope [0, 1]. The probability that FADs will affect the optimization process is FADs = 0.2. The Particle Swarm Optimization (PSO) algorithm is a global random searching algorithm based on swarm intelligence proposed by Kennedy and Eberhart, which is inspired by simulating the migration and swarming behavior of birds in the foraging process [32]. The velocity and position update formula of the PSO algorithm is described as follows.
where, t is the number of iterations; i is the number of particles; d is the number of dimensions; c 1 and c 2 are learning factors; φ 1 and φ 2 are random numbers of [0, 1]; ω is the inertia weight; P id is the local optimal solution found by the i-th particle so far, and P gd is the best particle in the entire community or a certain neighbor so far.

C. HYBRID ALGORITHM BY INTEGRATING MARINE PREDATORS ALGORITHM AND PARTICLE SWARM OPTIMIZATION ALGORITHM (MPA-PSO)
As a global optimization algorithm, MPA divides the number of iterations to perform different functions. Global exploration is performed in the first 1/3 of the iteration, half of the particle exploration and half of the particle exploitation in the middle 1/3, and partial development of the last 1/3. As the first step to find the best solution, the exploration ability should be good to give the next step of local development an optimal direction to find the optimal solution, but the initial global searching ability effect is not very ideal. The PSO algorithm is an effective swarm intelligence optimization algorithm, and its global searching and local searching ability can be confirmed. Therefore, this article combines MPA and PSO algorithm to improve the optimization efficiency of MPA, and replaces the position update strategy of PSO algorithm with the first stage strategy of MPA. The flowchart of the proposed MPA-PSO algorithm is shown in Fig.1.

III. DYNAMIC CLUSTERING METHOD BASED ON MPA-PSO ALGORITHM A. CODING METHOD
In the automatic clustering problem, there are usually two main problems, namely the determination of the number of clusters and the clustering centers. In order to adapt to these two goals, this paper adopts fixed-length coding in the real-number coding method [33]. Each example in this coding strategy consists of two parts, the latter part is responsible for representing the number of clusters, and the other part represents the cluster centers. In this method, for n particle, each particle contains d-dimensional data, the maximum number of clusters K max is specified, and the form of each particle X i (t) in the t-th iteration is a matrix with K max rows and d + 1 columns. The single particle coding strategy is shown in the Fig. 2. The m i,j in the first column is the cluster center, and each cluster center contains d-dimensional data. The last column is K max positive floating-point numbers in [0,1], each of which can determine whether the corresponding cluster is activated (that is, whether it is actually used to classify data), which is named as the activation threshold. When Y i,j > 0.5, its corresponding cluster center is activated. Fig. 3 is an example of a single particle encoding, that is to say that the position coordinates of a set of specific parameters. The cluster center contains 3 dimensions, K max is 5, among which the 1, 2, and 5 thresholds are greater than 0.5, and the corresponding cluster center is activated and can be used to partition the data set (marked in bold in the Fig. 3). The pros and cons of the particle division can be judged by appropriate clustering validity index.

B. FITNESS FUNCTION
In this study, the PBMF index, a product form of validity index proposed by Pakhira et al. [14], was used. It consists of three parts. The first part decreases as the number of clusters increases. The second part is the ratio of the sum of the distances between the data set as a whole and divided into k categories. The numerator is fixed and it estimates the compactness of k classes. As k increases, this factor increases the strength of the PBM index. The third part is the maximum distance between k classes, which represents the separation between classes. When k increases, the first part VOLUME 9, 2021 decreases, and the other two parts increase. Therefore, these three factors compete with each other and balance each other, which promotes the closeness and separation of clusters. The maximization of the PBMF index ensures that a small number of compact clusters are formed when there is a large gap between at least two clusters. The PBM index is defined as follows.
where, u ij represents the membership function. Because this article is a hard classification, in this article u ij is a constant and equal to 1 (if x j is in the i-th cluster) or 0. This paper uses the reciprocal of the PBMF exponent as the fitness function, so it is the problem of finding the minimum objective function.

C. UNFEASIBLE SOLUTION PROCESSING STRATEGY
In any heuristic method, unfeasible solutions are universal. For automatic clustering, if the number of active clusters is less than 2, that is to say that the cluster size is less than 2, an unfeasible solution will appear. In this mechanism, if the number of clusters of a particle is less than 2, select 2 to K max random bits and activate the corresponding clusters, which can increase the randomness of the algorithm.
The pseudo code of the unfeasible solution processing strategy is described as follows.
Given particle X i (t), N is the number of clusters that are activated.
If N < 2 Generate integer number a(2 < a < K max ) Sort the threshold and its corresponding cluster center from largest to smallest. The first i thresholds of the sorted particles are changed to 1, that is, i cluster centers are activated.
In addition, all data points are assigned to the nearest activated cluster, and then the size of the cluster is evaluated. If the cluster has more than two data points, then it is valid. Here our practical penalty strategy is to add a real number α to the fitness function, so that the fitness value of this particle will change and the clustering effect will be worsened, thereby effectively avoiding the unfeasible solution from being selected as the optimal solution.

D. ALGORITHM FLOWCHART
Step 1: Initialize each particle to contain K max randomly selected cluster centers and K max activation thresholds in [0,1].
Step 2: Use the rules described in Section 3.1 to find the active cluster center of each particle and calculate the fitness value of each particle.
2) Calculate and compare the distance between the activated multiple cluster centers and the data samples to obtain the corresponding cluster division.
3) When encountering an unfeasible solution, correct it through repair and punishment strategies.
Step 4: Use the cluster centers and divisions obtained by the optimal particles (the particles that produce the maximum fitness function value) at time t = t max as the final solution.

A. PERFORMANCE INDICATORS
There are three methods for testing the clustering effect: the number of clusters, the adjusted Rand index and the Accuracy rate.

1) ADJUSTED RAND INDEX
Rand index [34] needs to be given actual category information C. Suppose K is the clustering result, a represents the logarithm of elements of the same category in both C and K , and b represents the logarithm of elements of different categories in both C and K , then the Rand index can be defined as:  where, C n samples 2 is the total logarithm of elements that can be formed in the data set, and the value range of RI is [0, 1]. The larger the RI value, the more consistent the clustering result is with the real situation.
For random results, RI does not guarantee that the score is close to zero. In order to be sure that the index should be close to zero in the case of random clustering results, the adjusted rand index (ARI ) [35] was proposed, it has a higher degree of discrimination.
The value range of ARI is [−1, 1]. The larger the value, the more consistent the clustering result is with the real situation.  In a broad sense, ARI measures the degree of similarity between two data distributions.

2) ACCURACY
The definition of accuracy is the ratio of the number of correctly classified samples to the total number of samples for a given test data set. That is, the accuracy of the test data set when the loss function is 0-1 loss [36].

B. EXPERIMENTAL DATA SETS
This paper selects 4 kinds of artificial data sets and 6 kinds of real data sets in UCI database for carrying out the clustering simulation experiments. The selected UCI database data sets include Iris, Wine, Wisconsin breast cancer, Vowel, Seeds and Wdbc. Four artificial data sets have linear inseparable clusters with different shapes. Table 1 shows the data points of the simulation experiment data sets. The scatter graphs of the artificial data sets are shown in Fig.4 and information such as quantity, number of clusters and data dimensions are listed in Table 1.

C. SIMULATION EXPERIMENTS AND RESULT ANALYSIS
This paper compares the improved MPA-PSO algorithm with several swarm intelligence optimization algorithms in simulation experiments, including Particle Swarm Optimization (PSO) Algorithm [32], Marine Predators Algorithm (MPA) [31], Differential Evolution (DE) Algorithm [37], Spotted Hyena Optimizer (SHO) [38], Lightning Search Optimization Algorithm (LSA) [39] and Equilibrium Optimizer (EO) [40]. The specific parameter settings of each algorithm are listed in Table 2. Set the maximum number of clusters K max = 20. Six kinds of real data sets and four VOLUME 9, 2021  kinds of artificial data sets in UCI database are adopted for carrying out the clustering experiments. The effectiveness of random algorithms largely depends on the choice of the initial solutions, so all the algorithms in this paper adopt randomly generated initial solutions, and for each data set, the algorithms are executed ten times to test the effectiveness of various algorithms. Three performance indicators (cluster number, ARI and Accuracy) are used to evaluate the clustering results. Running results are shown in Table 3-12. It can be seen from Table 3 that although the MPA-PSO algorithm divides the Iris data correctly, it slightly loses the accuracy and ARI in DE algorithm. So the MPA-PSO algorithm has the best clustering effect compared with other algorithms. In other words, the MPA-PSO algorithm still achieves a good clustering effect. Seen form Table 4, the MPA-PSO algorithm has obtained good results not only for clustering but also for clustering effect on the Wine data set. Seen from Table 5, all algorithms other than SHO in 10 times simulation experiments have been correctly clustered, and it is worth noting that the clustering effect of the MPS-PSO algorithm is the best. The MPA-PSO algorithm has shown significant effects in the Vowel data division in Table 6, and the obtained clustering accuracy and ARI are also better than other algorithms, but the shortcoming is stability. The correct division of Seeds and Wdbc data by the MPA-PSO algorithm also shows its superiority. In the division of the artificial data sets, it can be seen that in the automatic clustering of the Data2_2 data set in Table 9, except for the PSO algorithm and SHO, the accuracy of the remaining algorithms has reached 100%. While for the Data3_2 data set, the MPA-PSO algorithm in automatic clustering shows high accuracy and   stability. In the clustering of data sets Data5_2 and Data4_3, the MPA-PSO algorithm not only performs correctly in the number of clusters, but also has better clustering effect than other algorithms.      times. Fig. 6 is a comprehensive display of the accuracy of all optimization algorithms on different data sets. The simulation results show that the proposed MPA-PSO algorithm outperforms other algorithms in the number of clusters, ARI and Accuracy. The simulation results in the tables show that the hybrid algorithm converges to the global optimum and has a small standard deviation. Therefore, the proposed MPA-PSO algorithm is a feasible and robust data clustering technique.

V. CONCLUSION
This paper firstly proposes the combination of MPA and PSO algorithm to make up for the shortcomings of MPA's lack of randomness, and then adopts MPA-PSO algorithm to solve the problem of pre-setting the number of clusters, and find the appropriate number of clusters according to data characteristics. Four kinds of artificial synthetic data sets and six kinds of real data sets in UCI database are adopted to verify that the MPA-PSO algorithm can obtain the correct number of clusters and can obtain better clustering results. The MPA-PSO algorithm is compared with six algorithms of G. WANG received the B.Eng. degree in automation from the University of Science and Technology Liaoning, Anshan, China, in 2020, where he is currently pursuing the M.Eng. degree with the School of Control Science and Engineering. His research interests include machine learning and intelligent optimization algorithm. VOLUME 9, 2021