Multi-Layer Perceptron Training Optimization Using Nature Inspired Computing

Although the multi-layer perceptron (MLP) neural networks provide a lot of flexibility and have proven useful and reliable in a wide range of classification and regression problems, they still have limitations. One of the most common is associated with the optimization algorithm used to train them. The most commonly used training method is stochastic gradient descent with backpropagation (or backpropagation for short) because it is mathematically tractable (given that the activation functions are differentiable). However, backpropagation is not guaranteed to find the globally optimal set of weights and biases. As a result, the MLP is often incapable of obtaining a desirable solution to the problem. Clonal selection algorithms (CSA) are optimization procedures that effectively explore a complex and large space to find values near the global optimum. Consequently, CSA can be used to solve the problem of training MLP networks. This paper presents a novel implementation of CSA for training MLP architectures to solve real-world problems such as breast cancer diagnosis, active sonar target classification, wheat classification, and flower classification. The CSA is used to find the optimal weights and biases that will significantly increase the classification accuracy of the MLP. The performance of our proposed approach is compared with other popular training methods: genetic algorithm (GA), ant colony optimization (ACO), particle swarm optimization (PSO), Harris hawks optimization (HHO), moth-flame optimization (MFO), flower pollination algorithm (FPA), and backpropagation (BP). The comparison is benchmarked using five classification datasets: Iris Flower, Sonar, Wheat Seeds, Breast Cancer Wisconsin, and Haberman’s Survival. Comparative study results illustrate the improvements in MLP performance gained by using CSA over other training methods, and hence it can be considered a competitive approach to training MLP networks when solving real-world applications in various disciplines.


I. INTRODUCTION
The development of fast, reliable and inexpensive computing power has transformed the optimization field over the last decades [1]. Optimization refers to the process or methodology of making something (such as a system, design or decision) as ideal, functional, or efficient as possible. In mathematical terms, it refers to either maximization or minimization of an objective function relative to a given set, often representing a range of all possible choices in The associate editor coordinating the review of this manuscript and approving it for publication was Wei Jiang . a particular situation [2]. The objective function allows a comparison of the different options to determine which might be the ''best'' [3]. Almost every problem in science and engineering can be formulated as an optimization problem. Some problems can be dealt with and solved by classical optimization techniques; however, most problems are either unmanageable or too costly in terms of time and other resources to be solved using traditional methods [4]. Fortunately, these hard problems can be solved by modeling inspiration from nature. Nature has been a great source of inspiration for the development of new computing technologies that are flexible, robust, and scalable VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ by observing how naturally occurring phenomena solve complex problems in various environmental situations [5]. This produced groundbreaking research that led to the development of optimization techniques like CSA, GA, PSO, ACO, HHO, MFO, and FPA [6]- [11]. These are global optimization techniques that can generate near-optimum or even optimum solutions. They have numerous advantages which have made them enormously popular; these include: • Better at handling very large spaces where there are a large number of parameters involved.
• Have great parallel abilities.
• Optimizes both discrete and continuous functions as well as multi-objective problems.
• Do not need any derived knowledge that might not be available for many real-world problems.
• Faster and more effective than classical approaches and provide not just one good solution but many of them. Training neural network architectures such as MLP can be formulated as optimization problems; hence, they can be solved by some optimization methods. Stochastic gradient descent (SGD) is a popular optimization algorithm that is often adopted to train neural network models since it is flexible and mathematically tractable. SGD uses backpropagation to calculate the gradients for each parameter (i.e, weight, bias) so that new values for the variables can be computed [12]. However, the SGD learning algorithm might converge to a set of sub-optimal weights and biases from which it cannot escape. As a result, the MLP model is often incapable of finding a desirable solution to a problem at hand. Additionally, the algorithm requires the activation and loss functions to be differentiable in order to calculate the gradients. Generally speaking, it is not a simple task to find an optimum solution or even sub-optimum solutions. The good part is that it is possible to solve these optimization problems by drawing inspiration from nature to produce near-optimal solutions [13].
Several nature-inspired algorithms have been proposed in the literature to train MLP systems, with promising results [14]- [22]. GA methods have been widely used to solve a variety of optimization problems, including the training of the MLP. An early and successful approach is described in [23]. The authors proposed using GA to find a near-optimal set of weights of MLPs in a relatively short time. They ran a number of experiments using data from a sonar image classification problem. Their experimental results showed the GA is better than SGD for training MLP models. Similar works that used GA to train MLP models for various classification and regression tasks are reported in [24]- [28]. PSO algorithm has also been used successfully used to train neural networks. One of the first applications of the PSO algorithm was to train MLPs, which involved finding appropriate weights for MLPs [29]. Other remarkable works that use the PSO algorithm to train neural networks are described in [30]- [32]. ACO has also shown promising results for training MLP models [33], [34]. Recnetly, artificial immune system (AIS) methods have been applied to train MLP mdeols due to their ability to balance the search space exploration and exploitation [35]- [39]. HHO [40], MFO [41], and FPA [42] have also been used to find the weights and biases of MLP models and have proven to be very competitive in terms of achieving high classification accuracy.
Differential Evolution [43], Artificial Bee Colony [44], Grey Wolf optimizer [45], Lightning Search Algorithm [46], Multi-Verse Optimizer [47], and Whale Optimization Algorithm [48] are also other popular nature-inspired algorithms that have been used to train MLPs and have shown good performance. Hybrid systems, which combine two algorithms, have also been proposed to train MLP models effectively. In [49], the authors presented a hybrid model that combines GA and gradient descent backpropagation, where GA is utilized to initialize and optimize the weight parameters of the MLP model to classify medical data. In [50], the authors introduced a hybrid model that combines PSO and Gravitational Search Algorithm to train MLPs (or FFNNs) to investigate the effectiveness of these algorithms in avoiding local minima. Other proposed hybrid models for training MLPs are described in [51]- [53].
This research focuses on utilizing the CSA [54], to train MLP models, to improve their performance for solving classification tasks. CSA are metaheuristic optimization algorithms inspired by theoretical and experimental immunology to solve computational problems from optimization, mathematics, and engineering. It belongs to the board field of AIS. AIS encompass any system or computational tool that extracts ideas and metaphors from the biological immune system to solve problems [55], [56]. As opposed to swarm intelligence and evolutionary algorithms that emerged from one main central idea and developed into several branches and variations, AIS was proposed considering various features, processes, and immune system models [57]. The idea was first proposed by Farmer and team in 1986 [58], with some important work on Bersini and Varela's immune networks in 1990 [59]. Many variants, such as the CSA, negative selection algorithm, immune networks, and others, have been implemented over the last few decades. CSA is very encouraging, and it has demonstrated efficient performance in the optimization and pattern recognition domains. [60]. CSA can estimate known solutions and simulate the adaptive immunity behavior in complex systems. It searches the solution space by creating a population of candidate solutions called antibodies. It then evaluates each antibody's affinity (fitness) and matches it against the affinity of the potential solution called an antigen. Those antibodies with higher affinity than the antigen affinity will then be selected, and a number of clones are created from each antibody proportional to the respective affinity. The clones then undergo a hypermutation process, inversely proportional to corresponding affinities. This hypermutation process is called affinity maturation. Finally, depending on the mutate clones affinities, re-selection is performed to select the optimal mutate clone, which evolves and becomes the main antigen for the next generation, and the same process is repeated several times until the desired solution is obtained. This paper proposes an effective CSA-based method for training MLP architectures to solve difficult real-world problems in a variety of domains, including breast cancer detection, active sonar target classification, wheat classification, and flower classification. The following are the contributions of the study: 1) We develop an encoding scheme for MLP weights and biases that enables efficient implementation of various mutation types. 2) We evaluate our proposed methodology on five real-world benchmark classification datasets with varying levels of difficulty. 3) We compared our method to six well-known natureinspired methods and found that it produced better results. The paper is structured as follows. Section II delves into the details of the clonal selection theory, focusing particularly on those elements abstracted and implemented in the CSA method. Section III provides an overview of the CSA method, explaining in detail the powerful features and main operators of the proposed algorithm with examples. Section IV discusses MLP models and how they learn using gradient descent backpropagation. The complete implementation of the proposed CSA-based method for training MLP systems is explained in Section V. Experimental design and a discussion of the presented results are presented in Sections VI and VII, respectively. Finally, conclusions and future work are presented in Sections VIII and IX, respectively.

II. CLONAL SELECTION THEORY
The key advances within the AIS centered on three main immunological theories: clonal selection, negative selection and immune networks. AIS researchers have focused mostly on the biological immune system's learning and memory processes inherent in the clonal selection and immune networks and the negative selection principle to generate detectors that can classify changes in self. This research's main focus is the CSA, a novel implementation of AIS inspired by the clonal selection theory. The clonal selection theory is an immunology theory that explains the basic features of the adaptive immune system, particularly a theory to describe the diversity of antibodies used to protect the organism against foreign attacks. [61].
An antibody itself is a a compound produced by B lymphoid cells that has the ability to neutralise a specific molecule. Each B lymphocyte produces special or unique antibodies of a specific type. When first proposed, the theory was controversial and competed with another paradigm named template theory. Currently, the clonal selection theory is marked as fact, taking into account the vast amount of experimental evidence [62]. The clonal selection theory specifies that an organism has a a set of individually unique antibodies that can recognize all antigens with a specific degree of specificity. Whenever an antibody's primary antibody is recognised, a cell is stimulated to molecularly connect towards the proliferate, antigen, and start producing further molecules with much the same antibody [63]. During in the cell proliferation phase, genetic mutations take place inside the replica of cell lines that enhance the match as well as predilection only with antigen. The above means allowing the cells' potential to adhere towards the antigen to help to strengthen both with antigen exposure. This antigen-driven sampling of reproduction cell lines can be viewed as a Darwinian perfect example in which the fittest cell lines (best candidate with antigens) have been picked to stay alive, and gene mutation can provide cell diversity [64]. Figure 1 gives an overall picture of the clonal selection process.
It depicts B lymphocyte cells that specifically bind antigens only at upper edge. Following that, the B cell multiplies (mitoses) and generates a large number of B lymphoblasts, which differentiate into plasma cells producing memory cells or immunoglobulin.In an effective immune response, plasma cells produce a large number of antibodies to a specific antigen, which results in the removal of the antigen. Memory cells are known to remain within the original host and support a quick supplementary response if a specific immune resurfaces. This is the phenomenon of the acquired immunity. Another distinguishing feature of acquired immunity is how the system develops the ability to differentiate between self and non-self. This capacity is called tolerance and explains the failure of the system to launch an immune response against a specific antigen, such as selfantigens [66].This ability develops prior to the organism's birth, while the immune system itself originates. This ability to not produce antibodies against one's own cells can be acquired for foreign antigens given to the organism before birth. This tolerance, however, cannot be maintained if the antigen is not permanently present in the organism, which means that the system may forget unused knowledge [67].

III. CLONAL SELECTION ALGORITHM A. OVERVIEW
The CSA is a global optimization method inspired by the principles of the clonal selection theory of acquired immunity. CSA has shown high performance in diverse, challenging optimization problems over traditional optimization techniques [68], [69]. The algorithm takes inspiration from the following elements of the clonal selection theory: • Immune system produces multiple antibodies to combat invaded antigen/pathogen.
• Filter the antibodies produced based on affinity. Affinity is the degree of recognition between an antibody and an antigen. The higher the affinity, the better the recognition.
• Antibodies with higher affinity are cloned.
• The clone set of duplicate antibodies is then subjected to affinity maturation (mutation) process to match the antigen better. In the CSA algorithm, an antigen represents the problem to optimize, and the algorithm generates a set of candidate solutions (antibodies) to solve the problem [70]. The quality of the candidate solutions is evaluated based on the affinity (better matched with antigen). The process of exploring feasible solutions is the means of immune cells identifying antigens and launching immune responses in the immune system [71]. CSA is comparable to GA; it typically develops solutions by repeating bio-operators such as cloning, mutation, and selection cycle to a population of candidate solutions and remains the best solutions. These operators help increase diversity and provide a means for potentially escaping local optima [72].

B. ALGORITHM
The following gives an overview of the steps of the CSA algorithm [64].
2) In response to the initial antigen, a random population of candidate solutions (antibodies) with the same format is created to fight the antigen.
3) The affinity (fitness) score is measured for each antibody. 4) Select the antibodies that have affinity higher than the antigen. 5) Clone the selected antibodies to produce more like them. Depending on the designer's choice, number of clones can be fixed for all antibodies or proportion to their affinity (rank-based). 6) Mutate each antibody randomly to enhance diversity further and implement a means for potentially escaping local optima. In other words, an attempt to discover even more desirable candidates. Here, the mutation rate is inversely proportional to the affinity. 7) Affinity is measured for every mutated antibody; select the best and kills off the rest. 8) Replace the initial antigen with the best antibody and repeat the process until stopping criteria is reached, such as optimal solution has been found or the maximum number of iterations has been achieved. Figure 2 provides a complementing diagram of the workflow phases of the CSA method.

1) ANTIGEN INITIALIZATION
When solving an optimization problem, i.e., optimizing a function's variables, an antigen is a potential solution based on which the final optimum solution is evaluated. This potential solution will help select those antibodies in each generation whose affinity (fitness) are better than the antigen's affinity. First, a random antigen is created, representing the function's possible parameters over a specific range. There are as many antigen genes as the number of the function parameters, which their optimal value needs to be found to minimize or maximize the function. In other words, each gene represents a parameter value. For further explanation, let's assume that a function has six parameters, and the task is to find their optimal values, which will best minimize the objective function. In this case, the antigen can then be  represented as a decimal vector containing six parameter values. These values are first initialized randomly in the predefined range (say in the range [-3.5, 3.5]), as shown in Figure 3.

2) ANTIBODIES INITIALIZATION
Similarly to how the human body generates several antibodies to combat an antigen, antibodies are generated as candidate solutions to combat an antigen in the CSA algorithm [73]. Antibodies and antigens have the same format. Figure 4 depicts a randomly generated population of antibodies (candidate solutions) in response to the antigen.

3) AFFINITY CALCULATION
Following the generation of the initial population of antibodies, the affinity (fitness value) of each antibody is calculated using a fitness function. The antibody evaluation is a crucial part of the CSA because antibodies are selected for cloning and mutation based on their fitness. The fitness function changes based on the problem being solved, and it must quantitatively measure how fit a particular solution. For instance, the fitness function can be defined by the reciprocal of the error, where the error is the absolute difference between the predicted value and the true value. The higher the fitness value, the better the antibody. In CSA, affinity calculation is performed repeatedly, and thus it should be sufficiently fast to compute. A slow computation of the affinity can negatively affect a CSA and make it exceptionally slow. Table 1 below shows the fitness values of all candidate solutions (antibodies) and the potential solution (antigen). Antibody2 has the highest fitness value, whereas Antibody4 has the lowest one.

4) SELECTION
We select a number of antibodies n with a higher fitness value than the antigen based on the previously calculated fitness value. As shown in Table 1, antibodies 2, 3, and 5 are more fit than the antigen and will thus be used for cloning and mutation. Antibodies that are less fit than the antigen will be eliminated from the population.

5) CLONING
Cloning is the process of producing multiple copies of the n antibodies chosen in the preceding step. Using a rankbased measure, the number of clones generated for each of the selected antibodies is proportional to their affinity. This is accomplished by first sorting the chosen antibodies in ascending order of affinity. The ordered list is then iterated, and the number of clones generated for each antibody is calculated as shown below [54]: In Eq. (1), N c is number of clones generated for a provided antibody, β is a clonal factor, N is the population size, i is the antibody current rank where i ∈ [1, n], and round (.) is an operator that rounds towards the nearest integer. Table 2 shows the number of clones generated from each of the selected antibodies, assuming the population size (N = 50) and β = 1. The highest affinity Antibody2 (i = 1) will produce 50 clones, while the second highest affinity Antibody5 (i = 2) produces 25 clones, and finally Antibody3 (i = 3) will produces 17.
In some CSA implementations, the affinity proportionate cloning is not implemented, meaning that the number of clones created from each of the selected antibodies is the same. This implies that each antibody will be viewed locally and have the same clone size regardless of its affinity (fitness).

6) MUTATION
By randomly altering genes in a given antibody, mutation introduces diversity into a population. Mutation is a key VOLUME 10, 2022 component of the CSA, allowing for global optimization while avoiding local optimization. In CSA, the mutation is inversely proportional to an antibody's affinity. To put it another way, the higher the affinity, the lower the mutation, and vice versa. This is referred to as affinity maturation. The following method can be used to implement the mutation rate [74]: In Eq. (2), α is the rate of mutation, ρ is a decay controller, and f is the affinity of the antibody normalized in [0,1]. Given L as the length of the clone, then the number of genes to be mutated N M can be calculated as follow [75]: There are many mutation operators that can be used depending on how antibodies are encoded. Here we list widely used mutation operators for real-valued encoded antibodies (i.e., weights and/or bias). This is not an exhaustive listing, and the CSA designer may find a combination of these methods or a problem-specific mutation operator more useful.
• Uniform mutation: This mutation operator replaces the value of the selected gene from the given antibody with a uniform random value selected between the upper and lower bounds specified by the user for that gene.
• Boundary mutation: This operator replaces the value of the selected gene from the given antibody with either upper or lower bound randomly.
• Gaussian mutation: This mutation operator is more efficient in converging than the previously mentioned operators. It adds a random value from a Gaussian distribution to the selected gene of an antibody. If the new new gene value falls outside of the user-specified range for that gene, then this new value is clipped.

7) AFFINITY CALCULATION AND STOPPING CRITERION
Following mutation, each clone is evaluated based on its affinity, and the best one is chosen as the antigen for the next generation. The process is repeated several times until the algorithm reaches the maximum number of generations and/or the best fitness value in the population does not improve after a certain number of generations.

IV. MULTI-LAYER PERCEPTRON
A MLP is a type of feed-forward neural network (FFNN) containing one or more hidden layers, where each layer has one or more neurons [76]. It is an extension of the perceptron network and is perhaps the most widely used neural network model. MLP with a single hidden layer is termed a shallow neural network; with a sufficient number of hidden neurons, a single hidden layer MLP can provide a universal approximation for almost any problem with tabular data. When MLP contains more than one hidden layer, it is called a deep neural network. Adding more hidden layers may yield little benefit. However, it is computationally more expensive as the number of trainable parameters increases and is more prone to overfitting [77]. An MLP with one hidden layer is shown in Figure 5. It consists of three layers of nodes: an input layer, a hidden layer and an output layer.
The data are only transferred in a forward direction from the input nodes, through the hidden nodes and to the output nodes. Except for the input nodes, each node is a neuron that includes a bias neuron and performs some computations using a non-linear activation function [78].

A. INPUT LAYER
The input layer is responsible for getting the inputs from an external source such as a CSV file. It requires one input node per variable (or feature). For example, the famous Iris Flower classification dataset contains four features (sepallength, sepal-width, petal-length, and petal-width); hence, four nodes are in the input layer. In case two features out of four were only selected, then the input layer will include only two nodes. There is no precise general rule for selecting the number of neurons in a hidden layer; it is highly dependent on the problem. Therefore, programmers can use methodical experimentation to determine what works best for their particular dataset.

C. OUTPUT LAYER
The output layer is the final layer. It is responsible for producing the final output of the network. The number of neurons depends on the task format. For instance, in a binary classification problem (spam/ham), we use a single output neuron with sigmoid activation. However, in a multi-class classification problem, the number of neurons is usually equal to the number of classes/categories (e.g., three neurons for the three classes in the Iris Flower dataset). In this case, a softmax activation function is used to ensure the final probabilities sum to 1.

D. BACKPROPAGATION
Backpropagation is a widely used algorithm to effectively train neural networks through a technique called chain rule. Backpropagation was first introduced in 1969 by Bryson and Ho [79] but was neglected due to its demanding computations. In 1986, the backpropagation learning method was rediscovered by Rumelhart et al. [80]. They described various neural networks where backpropagation works much faster than previous learning methods, making it desirable to apply neural networks to solve problems that had previously been unsolvable [81]. Currently, backpropagation algorithm is the workhorse of learning in ANNs, and a major component in modern deep learning models. [82]. As a general rule, training neural networks using backpropagation involves two passes: forward and backward [83]. In the forward phase, we first initialize all the parameters with small random values. Then, at each iteration, we feed the model with a training instance, and each neuron in the hidden and output layers determines its output in a way similar to Rosenblatt's perceptron except for the activation function where a nonlinear function such as sigmoid is used instead of a linear function (e.g., step function). Finally, we check what the neural network predicts and compute the error (or loss) using a certain cost function. In the backward pass, backpropagation performs a backward pass to calculate the gradients of the cost function with respect to all model parameters. After finding the gradients, we use stochastic gradient descent (SGD) optimizer to recalculate the new values of the parameters. The same forward and backward passes with the new updated parameters values are repeated again for the second training instance and so on. This training process ends when a predefined stop condition is met, i.e., the maximum number of iterations has been reached. To derive the backpropagation algorithm, we will consider the MLP model shown in Figure 6 where sigmoid function (σ ) is used for both hidden and output layers. For the sake of simplicity, we assumed that each training instance contains only one feature (x 1 ). Accordingly, we considered only one neuron for the hidden layer and one neuron at the output layer. First, we start with the forward pass which involves finding the predicted output (ŷ) of the network: Next, we define the cost (or loss) function, which potentially could be any function that measures the error, such as the squared error. One of the most common loss functions used in classification problems is the Cross Entropy Loss. In a binary classification problem, where number of classeŝ C = 2, the Cross Entropy Loss can be defined as: where it's assumed that there are two classes:Ĉ 1 andĈ 2 . y 1 andŷ 1 are the true and the predicted score forĈ 1 , and y 2 = 1 − y 1 andŷ 2 = 1 −ŷ 1 are the true and the predicted score forĈ 2 . Following that, we employ backpropagation to compute the derivative of the network's cost function (L) with respect to all network parameters. This is performed by using the chain rule recursively from the last to the first layer of the network.
Finally, we use SGD to update the network's parameters.
For the rest of this paper, the term backpropagation will be used loosely to refer to the entire learning algorithm for the MLP, including how the gradient is used by algorithms such as SGD to perform learning.

V. IMPLEMENTATION OF THE PROPOSED APPROACH
This section describes MLP training for data classification using our proposed CSA. Two prerequisites must be met in order to use the CSA: 1) a solution representation or encoding of the antigen; and 2) an affinity measure function to evaluate the solutions produced during the process. Once an encoding has been determined and an appropriate affinity measure function has been chosen, the CSA will perform selection, cloning, hypermutation, and re-selection based on the affinity until stopping criteria are met. The overall procedure of MLP training using our CSA-based method is depicted in Figure 7.   The following subsections V-A through V-H describe the complete procedure.

A. ANTIGEN ENCODING
To use the CSA to find the optimal set of weights and biases for MLP networks, we must first represent the problem domain as an antigen. Here, we want to find an optimal set of weights and biases for the MLP model shown in Figure 8. Initial weights and biases in the MLP network are chosen randomly within some range, i.e., [-3, 3]. The weights and biases can be represented by a 1D vector in which a decimal number corresponds to a particular weight or bias. In total, there are 6 weights and 3 biases in Figure 8. Since an antigen is a collection of genes, a set of weights and biases can be represented by a 9-gene antigen, where each gene corresponds to a single weight or bias.

B. ANTIBODIES
Antibodies (A b ) are created in response to the initial antigen initialized above. In simple terms, a random population of N antibodies (N number of MLP networks with different weights and biases) is created, as illustrated in Figure 9.

C. AFFINITY EVALUATION
The affinity of each generated antibody is evaluated according to the predetermined affinity function. This means training the MLP model with the training samples using the set of weights and biases determined by the antibody genes and then seeing how well it performs at classifying the test dataset. Since we are solving classification tasks, test classification accuracy can be used as the affinty function, which needs to be optimized or maximized. Test classification accuracy is defined as the ratio between the correctly classified samples and the total number of samples in the test dataset. The higher the accuracy, the higher the affinity.

D. SELECTION
Antibodies with higher affinity than the antigen are chosen to proceed to the cloning and hypermutation stages. Antibodies that have a lower affinity for the antigen will be eliminated from the population. This is a good approach for shortening the algorithm's execution time.

E. CLONING
In our proposed implementation, proportionate affinity cloning is not implemented. This means that, regardless of their affinity values, all selected antibodies will have the same clone size.

F. MUTATION
The Gaussian mutation is used on the cloned antibodies in this study. Gaussian mutation works well for real-value genes, as each weight and bias is encoded as a real value. Gaussian mutation makes small random changes in the antibodies in the population. It adds a random value from a Gaussian distribution to the chosen genes. For the CSA, the mutation is inversely proportional to the affinity of an antibody. Meaning, the higher the affinity, the fewer genes will be mutated or altered. This is known as affinity maturation. The goal is to preserve high-affinity antibodies without disturbing them while improving the affinity of low-affinity antibodies. When using Gaussian, there is a chance that some gene values will fall outside the specified range we set. Our implementation performs clipping on all genes after mutation to have all the gene values within the allowed range. Figure 10 depicts a mutation example.

G. AFFINITY EVALUATION OF THE MUTATED CLONES
The affinity of each mutated clone is calculated. The mutated clone with the highest affinity is selected, and the rest are eliminated.

H. STOPPING CRITERION
The selected clone replaces the original antigen and becomes the antigen for the next generation. The process is repeated from V-B until a criterion, such as a certain number of generations, is met.

VI. EXPERIMENT DESIGN A. DATASETS 1) IRIS FLOWER DATASET
The Iris Flower dataset is a multi-class (3-class) classification problem introduced by Ronald Aylmer Fisher [84]. It is considered the ''hello world'' dataset in machine learning and statistics. The dataset consists of 50 samples each for the three flower species, viz., Iris Setosa, Iris Versicolor, and Iris Virginica. It consists of 4 features measured in cm, namely, sepal length, sepal width, petal length, and petal width. The problem is to identify any iris flower category based on its four input characteristics of sepal length, sepal width, petal length, and petal width [85], [86].

2) SONAR DATASET
The Sonar dataset is a binary (2-class) classification problem developed by Terry Sejnowski in collaboration with R. Paul Gorman of the Allied-Signal Aerospace Technology Center [87]. The problem is to classify an object as a mine or rock. It contains 111 examples for the mine class obtained by bouncing sonar signals off a metal cylinder at various angles, and 97 examples for the rock class obtained from rocks under similar conditions. The dataset has 60 input features, with each feature representing the energy within a specific frequency band, combined during a certain duration.

3) WHEAT SEEDS DATASET
The Wheat Seeds dataset is a multi-class (3-class) classification problem that involves the classification of species given measurements of seeds belonging to three different varieties of wheat, namely Kama, Rosa, and Canadian. The number of samples for each class is balanced (70 for each), making 210 in total. It has 7 input variables (or features) that were constructed using a soft X-ray technique and the GRAINS package [88].

4) BREAST CANCER WISCONSIN DATASET
The Breast Cancer Wisconsin dataset is a binary (2-class) classification problem in which we attempt to predict one of two possible outcomes (benign or malignant). The dataset contains various measurements of breast tissue samples for cancer diagnosis. It contains measurements like the thickness of the clump, the marginal adhesion, the uniformity of cell size and shape, etc. The dataset was originally provided by Wolberg and Mangasarian [89] from the University of Wisconsin Hospitals in Madison. There are 569 cases of data where 357 cases belong to class 0 (benign), and the remaining 212 cases belong to class 1 (malignant). Therefore, the dataset is imbalanced and more challenging. The number of input variables or features to be used for this dataset is 30.

5) HABERMAN's SURVIVAL DATASET
The Haberman dataset is a binary (2-class) classification problem [90]. It includes cases from the University of Chicago's Billings Hospital's research on the survival status of patients who had undergone breast cancer surgery between 1958 and 1970. There are 306 data cases (or examples). 225 cases are classified as class 1 (the patient lived for 5 years or more), while the remaining 81 cases are classified as class 2. The dataset contains 3 input variables. The goal is to predict whether a patient will survive for 5 years or longer or die within 5 years after the surgery.

B. EXPERIMENTAL SETUP
The proposed CSA method for training MLP models and other algorithms are evaluated using the five classification datasets introduced above. Table 3 presents the specifications for each dataset. The number of input features, data examples, and classes are respectively presented for each dataset. Accordingly, the MLP architecture used by each dataset is defined prior to the training process, taking into account the dimensions of each dataset. Table 4 shows the MLP architectures for all datasets. The input layer for a dataset has one node per input feature. The output layer contains only one neuron with sigmoid as the activation function for the binary classification datasets. The domain of the sigmoid function is the set of all real numbers, R, and is defined as follows: Here, σ is the sigmoid and z is the input vector. For multiclass classification datasets, the output layer contains one neuron per class with softmax as the activation function. In contrast to the sigmoid function, which takes a single input and assigns a number (the probability) ranging from 0 to 1 to whether it is a yes (positive), the softmax function can take multiple inputs and assign a probability to each one. The softmax activation function's equation is as follows.
Here, S is the softmax, z is the input vector, e z i is the standard exponential function for the input vector, c is the number of classes in the dataset, and e z j is the standard exponential function for the output vector.
With respect to the hidden layer, it has little or nothing to do with data dimensions. To the best of the author's knowledge, there is no standard or accepted method for calculating the number of neurons in a hidden layer. As a result, we used systematic experimentation to determine the number of hidden neurons that work best for each dataset.  For all datasets, ReLU is used as the activation function in all hidden neurons. ReLU outputs the input directly if it is positive, otherwise it outputs zero. The ReLU formula is deceptively simple, as defined below.
Here, R is the ReLU and z is the input vector.
It should be noted that there is no computation involved in the input layer; thus, we have used the term nodes rather than neurons to represent the input layer. The input layer is simply a layer that receives input features.
Regrading the development enviroment, we used Jupyter Notebook with Python 3.9 to implement the CSA-based trainer and other trainer algorithms. The weights and biases of the MLP models are initialized randomly in the first iteration to small random values in the range of [-1,1]. Additionally, the input features of each dataset were scaled to lie between zero and one, as given in Eq. (11).
where x = (x 1 , . . . , x n ) and z i is the i th normalized input feature. Furthermore, each algorithm includes a number of control parameters whose values must be carefully chosen. We experimented with various values for these parameters and chose the optimal values that produced the best performance. The goal is to find the best parameters for each algorithm in order to conduct a fair performance comparison. Table 5 shows each algorithm's control parameters. We can see that all nature-inspired algorithms have the same population size and run for 100 iterations. Other parameters are specific to each algorithm.

VII. RESULTS AND DISCUSSION
The algorithms' performance is evaluated using the 5-fold cross-validation method. As shown in Figure 11, each dataset will be randomly split into 5-folds of approximately equal size. The first fold is used as a test set to compute a performance measure such as accuracy, and the remaining 4-folds are used as a training set to train the model. This approach generally results in a less biased estimate of the model and can be very useful in problems with a small number of data examples.

1) RESULTS FOR THE IRIS FLOWER DATASET
The MLP architecture for solving this dataset is 4 − 11 − 3. Therefore, a fully-connected 4 − 11 − 3 MLP will have (4 × 11) + (11 × 3) + (11 + 3) = 91 weights and biases. The goal of all algorithms is to find the optimal values for these weights and biases so that the Iris flowers can be correctly classified. Table 6 gives the average accuracy (AVG) and  standard deviation (STD) for each algorithm obtained using the 5-fold cross-validation approach. As given in the table, CSA outperforms all other algorithms. The results of CSA follow by those of HHO, MFO, FPA, GA and BP. PSO and ACO show similar performance, but PSO has a lower standard deviation.

2) RESULTS FOR THE SONAR DATASET
The MLP architecture for solving this dataset is 60 − 30 − 1. Therefore, a fully-connected 60 − 30 − 1 MLP will have (60 × 30) + (30 × 1) + (30 + 1) = 1861 weights and biases. Table 7 gives the average accuracy (AVG) and standard deviation (STD) for each algorithm obtained using the 5-fold cross-validation approach. As given in the table, CSA again outperforms all other algorithms. The results of CSA follow by those of HHO, MFO, FPA, GA, PSO, ACO and BP, respectively. Because the Sonar dataset is imbalanced and contains many more input features than the Iris dataset, all algorithms achieved lower results. Another reason is that the number of data examples for the Sonar dataset is very small, considering the high number of input features.

3) RESULTS FOR THE WHEAT SEEDS DATASET
The MLP architecture for solving this dataset is of 7 − 25 − 3. Therefore, a fully-connected 7 − 25 − 3 MLP will VOLUME 10, 2022   Table 8 gives the average accuracy (AVG) and standard deviation (STD) for each algorithm obtained using the 5-fold cross-validation approach. As given in the table, CSA again outperforms all other algorithms. The results of CSA follow by those of HHO, MFO, FPA, GA, PSO, ACO and BP, respectively.

4) RESULTS FOR THE BREAST CANCER WISCONSIN DATASET
The MLP architecture for solving this very challenging dataset is of 30 − 25 − 1. Therefore, a fully-connected 30 − 25 − 1 MLP will have (30 × 25) + (25 × 1) + (25 + 1) = 801 weights and biases. Table 9 gives the average accuracy (AVG) and standard deviation (STD) for each algorithm obtained using the 5-fold cross-validation approach. As given in the table, CSA outperforms all other algorithms slightly. The results of CSA follow by those of HHO, MFO, FPA, GA, PSO, ACO, and BP respectively.

5) RESULTS FOR THE HABERMAN's SURVIVAL DATASET
The MLP architecture for solving this very challenging dataset is of 3 − 9 − 1. Therefore, a fully-connected 3 − 9 − 1 MLP will have (3 × 9) + (9 × 1) + (9 + 1) = 46 weights and biases. Table 10 gives the average accuracy (AVG) and standard deviation (STD) for each algorithm obtained using the 5-fold cross-validation approach. As shown in the table, CSA outperforms all other algorithms. The results of CSA follow by those of HHO, MFO, FPA, GA, PSO, BP and ACO, respectively.  It is unlikely that any model developed using the Haberman's Survival dataset will generalize due to the small dataset's size and the fact that the data is based on diagnoses and operations for breast cancer that occurred many decades ago.
According to the experimental results on the five datasets, it is clear that CSA outperforms other algorithms, particularly the backpropagation algorithm. There are many motives why CSA should be considered rather than backpropagation to train MLP neural networks. Backpropagation relies on computing gradients and is highly sensitive to the initial values of the weights and biases, the learning rate, and the momentum. In some cases, a small change in any of these values has a significant impact on the predictive performance of the MLP network being trained. In contrast, CSA is a gradient-free optimization algorithm that can jump out of local minima or the weights and biases can be reinitialized to start looking in a new area of the search space, allowing it to find a good optimal solution if run long enough. The drawback, however, is that CSA and other nature-inspired algorithms often take longer than backpropagation to reach a solution. So, when the training time is reasonable, which can vary greatly depending on the problem, CSA is a better alternative MLP network training technique than backpropagation.

VIII. CONCLUSION
MLP networks are one of the most well-known and widely used machine learning algorithms. They have been widely applied in a variety of real-world applications, including medical diagnosis, electronic signal analysis, active and passive sonar target classification, seed and flower classification, and more. MLP network training is an optimization problem, and therefore, the optimization algorithm used is of primary importance. Backpropagation is the most widely used algorithm to train MLP networks. However, backpropagation is not ideal and often unable to find the global minimum. Optimization algorithms inspired by nature can be used to effectively train MLP networks. This research presents an efficient MLP training technique that employs CSA to improve the predictive accuracy of MLP when solving realworld problems. CSA involves selecting candidate solutions called antibodies based on affinity by matching against the primary antigen. The selected antibodies are cloned, and then undergo hypermutation inversely proportional to their affinity. Aside from the encoding paradigm used to represent the MLP's weights and biases as antibodies, our proposed methodology includes genetic operators for cloning and mutation to improve the MLP's ability to avoid getting stuck in local optima. The proposed CSA training method is tested on five real-world benchmark classification datasets of varying difficulty. Different MLP architectures are trained, taking into account the dimensions of each dataset. To validate the proposed CSA method's effectiveness in training MLPs, its performance is compared to that of BP and other six popular nature-inspired training algorithms, including GA, PSO, ACO, HHO, MFO, and FPA. We conduct systematic experimentation with a robust test harness in order to determine the optimal parameters of each algorithm that will provide the best possible performance for each algorithm. Experiment results show that MLP models provide better results in all five datasets when trained by CSA compared to other benchmark algorithms. This is due to the CSA's ability to avoid getting stuck in local optima. Therefore, it can be concluded that the proposed CSA method shows great promise for search and optimization problems, such as training neural networks.

IX. FUTURE WORK
The future direction of current research will focus on a thorough assessment of the CSA method for large benchmark datasets. Another potential future direction in this area is training two popular classes of deep learning neural networks: Convolution Neural Networks (CNN) and Recurrent Neural Networks (RNN). Both CNN and RNN are changing the way we communicate with the world. They are at the core of the deep learning revolution, powering a wide range of real-world applications such as self-driving cars, unmanned aerial vehicles, speech recognition, etc. Aside from the network architecture, which has been the primary focus of researchers' efforts to optimize, their performance is also heavily reliant on the training algorithm chosen to optimize the network weights and biases. Despite the fact that several optimizers have been proposed in the literature to address the shortcomings of traditional gradient descent approaches (e.g., SGD), there is still the possibility of getting trapped in local optima. Therefore, the future direction in this domain would be to propose efficient CSA methods for training deep neural networks in order to reduce the risk of falling into local optima. This can be achieved more easily than ever before due to the recent significant increase in processing power. A hybrid CSA and gradient descent (like Adam) model could also be developed in the future. The CSA can be used to find values for the initial weights and biases for the gradient descent algorithm, allowing it to avoid local optima and thus improve the predictive accuracy of the deep neural network being trained.