Automatic Clustering of DNA Sequences With Intelligent Techniques

With the discovery of new DNAs, a fundamental problem arising is how to categorize those DNA sequences into correct species. Unfortunately, identifying all data groups correctly and assigning a set of DNAs into k clusters where k must be predefined are one of the major drawbacks in clustering analysis, especially when the data have many dimensions and the number of clusters is too large and hard to guess. Furthermore, finding a similarity measure that preserves the functionality and represents both the composition and distribution of the bases in a DNA sequence is one of the main challenges in computational biology. In this paper, a new soft computing metaheuristic framework is introduced for automatic clustering to generate the optimal cluster formation and to determine the best estimate for the number of clusters. Pulse coupled neural network (PCNN) is utilized for the calculation of DNA sequence similarity or dissimilarity. Bat algorithm is hybridized with the well-known genetic algorithm to solve the automatic data clustering problem. Extensive computational experiments are conducted on the expanded human oral microbiome database (eHOMD). A comparative study between the experimental results shows that the proposed hybrid algorithm achieved superior performance over the standard genetic algorithm and bat algorithm. Moreover, the hybrid performance was compared with competing algorithms from the literature review to ascertain its superiority. Mann-Whitney-Wilcoxon rank-sum test is conducted to statistically validate the obtained clusters.


I. INTRODUCTION
The clustering problem is an unsupervised problem, which aims at assigning similar groups together to discover unlabeled similar structures in data without any prior knowledge [1], [2]. Generally, many clustering algorithms have been developed in such a manner that objects in the same cluster should be similar to each other while objects in different clusters should be dissimilar [3], [4]. Clustering algorithms have been widely applied in solving many problems in various fields such as data analysis [5], [6], data mining [7], [8], machine learning [9], and image retrieval [10]. DNA is the code of life; it is composed of a sequence of four nucleotides. They are adenine (A), guanine (G), cytosine (C), and thymine (T). To learn an unknown DNA sequence and reveal the evolutionary information of the same gene The associate editor coordinating the review of this manuscript and approving it for publication was Qilian Liang .
in several species, researchers tended to seek the functions that describe it as well as comparing it with existing known DNAs based on a proximity measure either a similarity or dissimilarity following certain strategies. Recently, clustering of DNA sequences is considered an essential task in computational biology and bioinformatics. In spite of the fact that a massive amount of DNA/RNA is being sequenced, most of them have unknown structures and functions [11]. In solving this problem, traditional methods have been conducted such as k-means and its alternatives. Although the k-means has solved many clustering problems, it converges to a local minimum under certain conditions [12], and it is very sensitive to the presence of noise. Moreover, the number of clusters must be predetermined, and it is hard to provide an appropriate number of clusters that will obtain the optimal cluster formation, especially when the user must guess it where it is a tedious trial-error work. Besides, finding an optimal solution for this problem is NP-hard when the number of clusters is greater than 3 [13]. To overcome this inconvenience, many attempts to solve the clustering problems using metaheuristic techniques were developed. Researchers viewed the clustering problem as an optimization problem, which aims at optimizing the objective function. From a simple perspective, optimization algorithms are classified into deterministic or stochastic. If an algorithm works without any randomness, it is called deterministic. If there is a random nature in the algorithm, it is called stochastic such as genetic algorithms, particle swarm optimization (PSO) [14], firefly [15] and bat algorithm. Algorithms with stochastic nature are often referred to as metaheuristics in the recent literature [16], [17]. In this study, a new method based on pulse coupled neural network introduced by Xin Jin et al. [18] is applied to find similarity or dissimilarity of DNA sequences where DNA is transformed into a numeral sequence using four number mapping schemes representing the DNA effectively without losing any genetic information. It processes on DNAs with several sizes taking into consideration the local and global features; therefore, it is adopted. Also, a hybrid of two algorithms, genetic algorithm and bat algorithm, is introduced in this paper; it is referred to as GABAT. The proposed architecture is capable of finding a proper number of clusters and classifying the DNAs into these clusters at the same time.
The rest of this paper is organized as follows: Section II introduces the soft computing techniques utilized in this research. Section III, a review of related work is demonstrated. Section IV illustrates the proposed framework. Section V explicates the chosen data set, and it discusses the results of the conducted experiment. Finally, section VI highlights the conclusion.

II. SOFT COMPUTING TECHNIQUES A. PULSE COUPLED NEURAL NETWORK (PCNN)
In 1993, pulse coupled neural network (PCNN) was introduced based on visual cortex theory. In this study, PCNN is utilized to get the distance between DNAs as proposed in [18]. PCNN is used in image processing where the image can either be colored or gray scaled. After n iterations applied with PCNN on the original image, a sequence of n binary (0,1) images results. The number of 1's in each binary image is counted and a vector of numbers results. This vector is called the signature of an image. More information about PCNN is found in [19].

B. GENETIC ALGORITHM (GA)
Genetic Algorithm is a numerical randomized search and optimization algorithm that was introduced by John Holland in the early 1970s. GA emulates the principle of Darwin's classical theory of natural evolution such as inheritance, mutation, selection, and crossover [20]. It has been widely applied in many problems (e.g., clustering, classification, etc.) to find a near-optimal solution [21]. In general, the genetic strategy starts by generating a set of strings called population. Each string is encoded randomly as a binary or floating-point string and called an individual or a chromosome. The number of iterations that a genetic algorithm run is known as generation and it may be specified by the user. In each generation, a new population emerges based on the previous one through the following steps: (1) fitness evaluation: each individual in the population is evaluated using a fitness function to measure the degree of goodness of a string where individuals with better fitness are chosen. Consequently, the fitter the individual, the more possible is to survive via passing its traits to the next generation.
(2) selection: apply the roulette wheel to select the parents for the reproduction phase. (3) mating: a crossover between selected individuals is applied to produce a new offspring for the next generation. (4) mutation: each individual in the offspring undergoes mutation with a predetermined percentage. It is continuously repeated till finding a satisfactory solution or a fixed number of generations have elapsed. Many clustering techniques based on genetic algorithms are studied on different data sets to find the optimal solution [22].

C. BAT ALGORITHM
Bat algorithm is a relatively recent nature-inspired metaheuristic algorithm authentically developed by Xin-She Yang in 2010. Lately, the bat algorithm caught the scientists' interest because of their good performance in solving unimodal problems such as clustering. It was inspired by the behavior of the microbats in avoiding objects and detecting prey via echolocation. Bat algorithm idealized some rules can be summarized as follows: 1. All bats use echolocation to avoid obstacles or detect prey and they also know the difference between both. 2. Bats fly randomly with a velocity V i at the position X i with a varying frequency between f min and f max . Also, they can automatically adjust the rate of pulse emission r ∈ [0, 1] depending on how far they reached the target and a varying loudness A 0 to detect the prey location. 3. Loudness varies from A 0 to a minimum constant value A min .
More information about the nature inspired bat algorithm is found in [23].

III. RELATED WORK
During the last decades, research and development in cluster analysis have been committed to assigning the data elements in heterogeneous groups through evolutionary metaheuristic techniques and their applications. However, a few attempts have been recorded to automatically determine the optimal number of clusters. Most of the studies suggest clustering techniques, which require the number of clusters must be predetermined. In 2001, Tseng and Yang [24] presented the earliest attempt for automatic clustering based on a genetic approach, namely, CLUSTERING; its algorithm consists of two stages: in the first one, the authors compute the distance between the objects based on the average of the nearest neighbor to find the connected components where objects close to each other are grouped. This stage aims to reduce the complexity in the second stage. The second stage is the genetic algorithm where the small clusters obtained are used as a seed for generating larger ones. The user may specify the number of generations that was equal to 100 in this experiment. The user also may specify a parameter that affects the compactness or the enlargement of the clusters. The authors considered two artificial data sets that consist of a group of points with different sizes and one real-life data set containing 20, 000 spectral feature vectors derived from 40 speeches. CLUSTERING algorithm is compared to k-means, singlelink, complete link. The run time of the proposed algorithm is O (n 2 + GNm 2 ). The first stage of the algorithm spends O (n 2 ) where n is the size of the data set and the second stage complexity is O (Nm 2 ) where N denotes the population size, m represents the chromosome length, and finally, G specifies the number of the generations the user wants to run the algorithm. A Hierarchical genetic algorithm is applied by Lai [25] for automatic clustering. In HGA, the chromosome was made up of two parts -the control genes and the parametric genes. The control genes are coded as binary strings that play a role in activating the associated parametric genes provided the value of a control gene is ''1'' or deactivate it otherwise ''0''. The total number of ''1'' represents the number of clusters and the parametric genes represent the real coordinates of clusters that were randomly picked up from the data set. The author used the Davies-Bouldin index as a fitness function which is equal to the ratio of the sum of within-cluster scatter to between-cluster separation. The generation number in this experiment was 500 while the population size was 20. The proposed approach was applied to five artificial and two reallife data sets and it showed promising results.
Maulik and Bandyopadhyay presented a genetic algorithm for clustering in [26], namely, GA-clustering. The authors attempt to search for appropriate cluster centers where K must be determined prior. Floating-point representation in chromosome design was developed where the cluster centers were randomly picked from the data set. The algorithm aims to minimize the summation of the absolute Euclidean distances between each point and their respective centers. A comparison between the proposed algorithm and k-means was applied on four artificial and three real-life data sets. Both algorithms terminated after a fixed number of iterations, which was equal to 100. The genetic algorithm did not show any undesirable output while k-means got stuck at sub-optimal solutions. Although the algorithm shows a good performance, the user will face an obvious obstacle to choose a proper number of clusters for the unlabeled data sets [27].
In 2008, Liu et al. introduced a genetic algorithm called Automatic Genetic Clustering for unknown K (AGCUK) [28]. The algorithm was able to detect the number of clusters without any prior knowledge about the correct number of clusters. The chromosomes were made up of a random number of clusters that range between [2, √ N ], where N equals the number of objects to be clustered. Each cluster represents the coordinates of an object that was randomly picked from the data set. Davies-Bouldin (DB) index was applied to evaluate the fitness of a chromosome and measure the validity of a cluster. In the selection phase, they adopted noise selection to assure that the population is not occupied by the fittest individuals. Afterwards, they applied divisionabsorption mutation that consists of two operations: Division operation was applied where the sparser cluster was the most possible to be partitioned by k-means. Absorption operation where the closer is two clusters to each other, the more possible for a cluster to be absorbed by another. The Algorithm termination took place after a fixed number of generations which is equal to 50. The time complexity of this algorithm is O (GPKmN) Where G equals the number of generations, P equals population size, K denotes the number of clusters, and m represents the number of object attributes.
In 2012, Satish in [29] proposed a genetic algorithm called (OCGA) that estimates the number of clusters. Firstly, hierarchical clustering is applied to the data set to obtain a dendrogram where each level represents a different number of clusters and different clusters formation. Then, a genetic algorithm is applied to determine at which K level the dendrogram is cut to obtain K clusters. The initial population is initialized to a level in the dendrogram randomly. Dunn index was used to measure the fitness of a chromosome. The proposed algorithm was applied on two different artificial data sets and it determined the number of clusters correctly. Also, the algorithm took less time for computing the optimal clusters compared to evaluating the clusters at each level in the dendrogram. The author suggests using other optimization methods and different cluster validation indices. Chehouri et al. [30] introduce a genetic algorithm that uses cluster analysis to sort the population and select the parents to apply crossover referred to as KGA. The performance of KGA was tested on a class of unconstrained optimization problems through two proposed versions using either a fixed number of clusters or an optimal clusters number. The proposed technique is able to permit the evaluation of multimodal functions. However, a future investigation is needed in solving constrained optimization problems and multi-objective formulations.
Later, Dina and her colleagues proposed metaheuristic techniques that determine the number of clusters for a variable length chromosome [31]. Genetic algorithm, artificial immune system, and a hybrid between both, namely, IGA were applied on five real-world data sets. The fitness function of each chromosome aims to minimize the sum of distances between genes in the same cluster and to maximize the sum of distances between genes in the different clusters. The authors adopted the ERX as a crossover operator. They clarified the order of the hypermutation and the crossover in the IGA algorithm to prevent the premature convergence of the population. Mann-Whitney-Wilcoxon (MWW) rank-sum test was applied and showed significant results. In recent studies, a multi-objective genetic k-means clustering algorithm was presented by Hung Nguyen in [32]; the chromosome was divided into two parts. The first part consists of a sequence of binary numbers either 0 or 1 indicating whether the cluster is active or inactive, respectively. The second part represents the centers of the clusters. The fitness of a chromosome is evaluated by three different benchmarks including i) Sum of squares with cluster, ii) Davies-Bouldin index, and iii) Silhouette index. Randomly selected parents that have the same number of clusters are selected for crossover. Gaussian noise is added to the centers of the clusters for mutation. Also, the number of clusters can be changed by activating or disabling a center. K-means operator is applied to each generation according to a user-defined probability. The proposed algorithm was applied on 16 disease data sets and 5 singlecell data sets.
Komarasamy and Wahi [33] introduce a new metaheuristic method called KMBA, a hybrid between the k-means and bat algorithm. The algorithm can detect the number of clusters without any previous knowledge. The clusters were evaluated using F-measure and they are tested against traditional k-means via iris data sets. In 2013, Sood and Bansal [34] proposed a hybrid approach between k-medoid and bat algorithm that improves the quality of the clusters and deduces the centers of the clusters correctly since k-medoid is highly affected by the initially selected objects. Later, Kumar and Kaur [35] proposed three variants of the bat algorithm to solve the clustering problem, namely, BA-CN, BA-C and BA-CNE. The performance of these variants was tested on twelve benchmark data sets. BA-CNE reported the best clustering solutions compared to the variants and other clustering algorithms.
A hybrid of atom search optimization (ASO) and sinecosine algorithm (SCA) [36] are introduced to determine the number of clusters and their centers. Results are validated on sixteen benchmark data sets showing superior performance over competing algorithms. The proposed algorithm uses SCA as a local search method to improve the convergence of ASO, which leads to finding a globally optimum solution.
Absalom and others proposed a hybrid of two algorithms firefly and particle swarm optimization [37]. The hybrid FAPSO showed significant results based on two validity indices, namely, Davies-Bouldin (DB) and compact separated (CS) validity index. Results were validated on fourteen data set and compared to existing metaheuristic algorithms such as GCUK, DE, DCPSO and ACDE. These experiments prove that the CS index is the best one in terms of the clustering solutions compactness. In addition, a comparative study shows that the FAPSO hybrid outperformed FAIWO, FAABC, and FATLBO. On the contrary, FATKBI seems to have relative equivalent performance in terms of speed and clustering solutions [38].
In 2021, a new hybrid approach based on binary differential evolution and marine predators' algorithm was introduced to solve the automatic clustering problem referenced as DEMP [39]. The proposed method was validated on eight multi-omics data sets from the cancer genome atlas (TCGA). Also, the proposed algorithm outperformed the competing algorithms and achieved better execution time.
A recent attempt for automatic clustering was presented by behanz and her colleague in [40]. The authors utilized a binary encoding scheme with a predetermined range [ k min , k max ], where k min =1and k max = [ √ m], and m is the number of data points. The proposed methodology showed superior performance over the competing algorithms, namely, BPSO, BGA, and BDA in all binary data sets from the UCI Machine learning and KEEl repositories.
A water wave optimization algorithm was proposed by kaur and his colleague in [41] for addressing the data clustering problem. The authors declared that WWO achieved significant results for most unconstrained and constrained optimization problems. However, WWO obtained less optimal solutions with complex optimization problems. The WWO was evaluated among thirteen clustering datasets using F-score and accuracy showing better clustering results compared to other algorithms. Finally, all the aforementioned methods showed significant performance, however, suffer from one or more drawbacks. Some algorithms could not determine the number of clusters optimally while others showed high execution time. Also, it became a fundamental problem to design a computer-aided technique to categorize the newly discovered DNAs into a proper cluster or category similar in functionality according to species. The designed technique running time should be minimized compared to the other traditional techniques that perform poorly due to the dramatic increase in time to find the required number of clusters [42]. Moreover, in the past decade, notable similarity measures have been developed due to the ever-growing demands nevertheless a few ones preserve the functionality of DNA sequences.

IV. PROPOSED SYSTEM
The proposed methodology aims to design a soft computing technique that determines the best estimate for the number of clusters and identifies the optimum cluster formation for a variable DNA length. To tackle this problem, PCNN is utilized to generate a signature vector for each DNA. The distances between DNA signature vectors are computed. A new design for the genetic algorithm is developed to optimize the PCNN parameters; it produces better signature vectors and cluster the DNAs. Also, bat algorithm is implemented to cluster the DNA signature vectors. Subsequently, a detailed description of the proposed hybrid GABAT clustering algorithm design is discussed.

A. PULSE COUPLED NEURAL NETWORK (PCNN)
Since PCNN cannot work with the alphabet DNA sequences directly, the DNA sequences are transformed into a vector of numbers. The bases A, G, C, T are encoded by 1, 2, 3, 4, respectively. Then, normalization of the resultant numerical DNA sequence will take place to be compatible with PCNN as it works with grey-scale images that range from 0 to 1. The rule applied to get the normalized sequence is denoted by (1): where C is the normalized value, i is the encoded value, I max and I min are the maximum and minimum of all encoded values. Fig. 1 illustrates an example for DNA encoding and normalization. PCNN model is summarized by the following equations: where the subscripts i, j refers to the neuron location in a PCNN and n is the iteration number. The receptive field of a PCNN neuron consists of two main components one for linking where L ij is the linking value associated with neuron (i, j) and the other one for feeding where F ij is the feeding value associated with neuron (i, j), both known as L and F channels. Both channels communicate with the neighboring neurons through the synaptic weights W and M respectively where both W and M are equal and declared as one-dimensional array where the element of the matrix is equal to the reciprocal of the distance between the central neuron and its adjacent neurons. For the neuron (i, j) and the adjacent neuron (k, l) the element of the linking matrix can be denoted by equation (7). The length of W is 8 since each codon consists of three nucleotides (bases) and interacts with the adjacent two codons.
Y kl 's are the outputs of neurons from a previous iteration [n − 1] and external stimulus S that represents the input quaternary code. V l and V f are normalizing constant values. α F and α l are the decay exponentials. The internal activity of a neuron U ij is denoted by equation (4), which equals the combination of L, F and linking strength β. The internal state of a neuron is compared to a threshold value θ where V θ ij is a constant value and ∝ θ is the decay coefficient. The output of each neuron is either 1 or 0 that indicates pulse or non-pulse.

B. ENTROPY OF DNA SEQUENCES
As mentioned above, the output of the PCNN is twodimensional image that consists of a sequence of 0's or 1's.
The resultant image describes the features in this image. Each DNA will have its signature according to PCNN characteristics. The distance between two DNAs is calculated according to the resultant signature vector based on two methods. The first method depends on calculating the entropy of the binary sequence on each iteration; for example, if the number of iterations equals to N, one-dimensional array of size N is declared where elements of the array are denoted by equation (8).
is the entropy of at nth iteration, p 1 is the probability of (1) in a binary image and p 0 is the probability of (0) in a binary image [43]. Then, the Euclidean distance between the resultant vectors is calculated.
Equation (10) shows how the Euclidean distance works, such that X, Y are DNA signature vectors, k is the signature length. The second method calculates the Euclidean distance between the summation of the entropies at each iteration for every DNA sequence, where ES(n) is the entropy of DNA sequences denoted by equation (9). N is the total number of iterations, and n varies from 1 to N. The PCNN performs in an iterative manner till a user-defined value. This value is defined as the signature length, which was selected as an average value within the range suggested and used by researchers in [19], [44] and equals 70.

C. CLUSTERING WITH GENETIC ALGORITHM
We propose a new chromosome design that can identify the optimal number of clusters for variable-length chromosomes without any prior knowledge. Also, it is noticed that the PCNN generated signature is highly affected by α l , α θ , β, V l , V θ , α F , V f . Consequently, the proposed genetic algorithm is implemented in such a way that can adjust the PCNN parameters to the optimal values. Fig. 2 shows the genetic optimization mode. The chromosome is divided into three parts shown, in abstract form, as follows: 1) The first seven genes represent the PCNN parameters where each gene corresponds to one of the PCNN parameters that will be optimized; they are α l , α θ , β, V l , V θ , α F , V f , respectively. 2) A sequence of random 0's and 1's where (1) indicates that the cluster is valid or active and (0) indicates that the cluster is invalid or disabled with a predefined length K which is equal to total number of DNAs the algorithm runs on. K indicates the maximum number of clusters.
3) The centroids corresponding to each cluster, respectively, where a centroid act as the center of a cluster. An Example that illustrates chromosome structure for proposed design is shown in Fig. 3.
where i, j, k ∈ N−0, D j − c i 2 is the Euclidean squared distance between a DNA and center c i , F t is the fitness value and k is the number of clusters. The fitness function aims to minimize the average Euclidean distance within clusters sum of squares. Hence, DNAs with similar traits will be clustered together. The best fitness is the minimum one over population of chromosomes. Two types of mutation are applied, one for the binary segment and the other for the floating-point segment. The normal binary flip mutation is applied to the binary segment. Non-uniform mutation [45] is adopted to mutate floating-point segments formulated as follows: where r is a random number between [0-1], t denotes the current iteration number. T is the maximum number of iterations (generation number) and b measures the dependency factor on iteration number. In this experiment, we used a dependency factor that ranges from [0. . GA uses Roulettewheel for selection. A normal single crossover is applied. Elitist strategy is adopted for a replacement that guarantees the fittest individual will pass its traits to the next generation. Fig. 4 shows a pseudo-code for the genetic algorithm implementation. Function run Evaluations are used to evolve the population replacing the older one till max generation VOLUME 9, 2021 size. After reaching the stopping criteria, the number of ones generated by the genetic algorithm indicates the optimal max number of clusters.

D. CLUSTERING WITH BAT ALGORITHM
Let the given DNAs D = {d 1 , d 2 , . . . , d n } be classified into non-overlapping clusters C= {c 1 , c 2 , . . . , c k }; thus, the dimension of d i (i = 1, 2, . . . , m) is b. A centroid acting as center of a cluster is P = {p 1 , p 2 , . . . , p k }. For a b-dimensional DNA vector, the following rules are applied: In the proposed bat algorithm, the number of bats/solutions are initialized through the objective function where f (x) , X = (x 1 , x 2 . . . , x n ). As described above, each solution is k x b-dimensional vector, D n×b , where X i = (x 11 , x 12 , . . . , x 1b ) , (x 21 , x 22 , . . . , x 2b ), . . . (x k1 , x k2 , . . . , x kb ). The goal of the clustering method is carried over the fitness function defined in equation (11). Where the smaller the value of f t , the better the compactness. Velocities and frequencies are initialized randomly. Moreover, pulse rate and loudness are defined. The user may specify the generation size. While the number of iterations is less than the specified generation size, the algorithm proceeds to cluster the DNAs according to the following steps: 1. The Euclidean distance between K initial clusters locations and signature DNAs is calculated where each DNA signature is assigned to a cluster index that indicates the nearest centroid distance. 2. The fitness of a solution is computed as the average distance between the DNA signatures and their nearest centroids. 3. The bats can iteratively update their frequencies, velocities and locations based on equations (16)(17)(18).
where β ∈ [0, 1] is a random vector drawn from a uniform distribution, and x * is the current best global solution. 4. The best solution is considered to be the minimum one over the generation.

E. HYBRID GENETIC BAT ALGORITHM (GABAT)
The hybridization technique discussed in this study focuses on exploiting the merits of both algorithms into a single efficient framework reducing the deficiencies that individual algorithms might have. Moreover, hybridizing the bat algorithm with the well-known genetic algorithm should improve the balance between exploitation and exploration, which is a challenging task for most metaheuristic algorithms. GABAT is proposed with its main aim to improve the premature convergence of the bat algorithm towards the local minima due to the random initialization of the centroids. The GABAT starts with computing the signature vectors using PCNN, and it calculates the distances between DNAs. The population of chromosomes (solutions) are initialized randomly. Thereafter, the fitness value of each chromosome is computed using equation (11), after which the chromosomes are updated by genetic algorithm operators. The same process is  repeated iteratively till reaching a satisfactory fitness value or a maximum number of generations is reached. Subsequently, the PCNN is utilized to compute the signature vectors using the optimized parameters produced by GA. The bat algorithm uses the best solution generated by GA search results as its initial search population. Iteratively, the locations and velocities of the new solutions generated by the bat algorithm are updated through equations (16)(17)(18). Finally, the best candidate solution has the smallest fitness value. Fig. 6 illustrates the design of the proposed hybrid algorithm.

A. DATA SET DESCRIPTION
The series of experiments were tested on the expanded Human Oral Microbiome Database (eHOMD) [46]. It provides information about bacterial species found in the human aerodigestive tract (ADT) including the nasal passages, sinuses, throat, esophagus, mouth, and lower respiratory tract. eHOMD includes a total of 775 microbial species and more than 1,000 microbial DNAs. The size of each DNA ranges between 1∼10MB. A subset data set of 100 DNAs is used to conduct the tests.

B. SYSTEM CONFIGURATION AND PARAMETER SETTING
Tables 1 and 2 demonstrate the parameters settings in the conducted experiments.
The execution of the evolutionary algorithms was repeated ten times for the mentioned settings to assure the accuracy of the results and observe if any of the algorithms showed any unexpected behavior. The average best fitness was recorded. All tests were performed on an Intel Core i7-9750H (9th generation) CPU, under a 64-bit Windows 10 with 16 GB Ram using C# on Microsoft Visual Studio 2017 and Java on Eclipse 2018; the database was managed by MYSQL. The statistical test for the validation of the obtained GABAT results was carried out via IBM SPSS Version 27.

C. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, the experimental results are divided into two main subsections: the first section demonstrates the DNA entropy values and shows the similarity distances between DNA samples. The second section illustrates the experimental results of the evolutionary algorithms.

1) DNA ENTROPIES USING PCNN
As previously mentioned, the execution of the proposed techniques is applied to the eHOMD data set. Fig. 7 shows the resultant entropies for a sample of 50 DNAs before and after the enhancement of PCNN parameters with the help of genetic algorithm. The distance between two DNAs is either calculated using Euclidean distance of  signature vectors or entropy. Both distances are computed before and after the optimization of the PCNN parameters. Similar strains evidently have closer entropy values and a high degree of similarity. Tables 3, 4, 5 and 6 show the similarity matrix between DNA samples. The Euclidean distance between the same strains such as 1054 and 1099 has the smallest similarity distance compared to the others.

2) METAHEURISTIC ALGORITHMS
The aforementioned evolutionary algorithms' settings were tested ten times on different PCs with similar specifications, and the outputs were averaged to acquire the best results from each technique. Figures 8 and 9 show the convergence performance of the algorithms with respect to the eHOMD    Mann-Whitney-Wilcoxon (MWW) rank-sum test is conducted to statistically validate the obtained clusters. In our study, MWW is preferably used rather than a T-test due to the non-deterministic nature of the metaheuristic techniques since non-parametric tests are believed to be more resilient [47]. Table 7 shows the results of the test conducted. A p-value less than 0.05 is statistically significant. As shown in Table 7, it is evident that the bat algorithm outperformed the GA and the GABAT outperformed the bat. This   shows that GABAT is performing relatively better than its rivals. Table 8 show the performance of the GABAT against prementioned algorithms in the literature review to ascertain  the superiority of the new hybrid. These algorithms are genetic algorithm, bat algorithm, firefly, particle swarm optimization and the hybrid FAPSO [37]. It is evident that   GABAT achieved optimal mean and StDev. As previously mentioned, the proposed metaheuristic algorithms are applied on 100-dimension data set and the tests are halted after 10000 maximum evaluations. A stopping criterion is automatically enforced when a maximum number of fitnessfunction evaluations is achieved, or the optimal fitness is reached. All tests are repeated ten independent times to assure   our statistical analysis. The performance measure UE represents the consumed number of fitness-function evaluations to achieve fitness F. For more explanation, it represents the minimum number of evaluations that took the algorithm to achieve such fitness. T represents the elapsed time in seconds to reach fitness F. This procedure is demonstrated as shown in Tables 9 and 10.

VI. CONCLUSION AND FUTURE WORK
In this study, a new soft computing technique is proposed to solve the automatic clustering problem for a variable DNA length. Despite that many clustering algorithms have been proposed, most of those algorithms require prior knowledge for the number of clusters. In this paper, pulse-coupled neural networks are utilized to measure the similarity between the DNAs of the eHOMD data set. A new design for the genetic algorithm is implemented for clustering and PCNN parameters optimization. Also, a hybrid of two algorithms, genetic algorithm and bat algorithm referred to as GABAT is implemented. The simulation results have shown that the hybrid GABAT outperformed the two state-of-the-art clustering algorithms and other metaheuristic algorithms, namely, firefly, particle swarm optimization and FAPSO. Wilcoxon test is conducted to statically validate the obtained clusters, and it showed a significant p-value of less than 5%. Finally, further research can extend to this area by exploring other clustering techniques such as fuzzy logic, neural networks, and other metaheuristic optimization techniques capable of detecting the number of clusters for a given data set. Besides, combining different metaheuristic approaches can obtain a hybrid system with efficient and enhanced capabilities.