Artificial Neural Networks Hidden Unit and Weight Connection Optimization by Quasi-Refection-Based Learning Artificial Bee Colony Algorithm

Artificial neural networks are one of the most commonly used methods in machine learning. Performance of network highly depends on the learning method. Traditional learning algorithms are prone to be trapped in local optima and have slow convergence. At the other hand, nature-inspired optimization algorithms are proven to be very efficient in complex optimization problems solving due to derivative-free solutions. Addressing issues of traditional learning algorithms, in this study, an enhanced version of artificial bee colony nature-inspired metaheuristics is proposed to optimize connection weights and hidden units of artificial neural networks. Proposed improved method incorporates quasi-reflection-based learning and guided best solution bounded mechanisms in the original approach and manages to conquer its deficiencies. First, the method is tested on a recent challenging CEC 2017 benchmark function set, then applied for training artificial neural network on five well-known medical benchmark datasets. Further, devised algorithm is compared to other metaheuristics-based methods. The efficiency is measured by five metrics - accuracy, specificity, sensitivity, geometric mean, and area under the curve. Simulation results prove that the proposed algorithm outperforms other metaheuristics in terms of accuracy and convergence speed. The improvement of the accuracy over the other methods on different datasets are between 0.03% and 12.94%. The quasi-refection-based learning mechanism significantly improves the convergence speed of the original artificial bee colony algorithm and together with the guided best solution bounded, the exploitation capability is enhanced, which results in significantly better accuracy.

machine learning, focused on the algorithms inspired by the human brain functionality and structure, that are known under the name artificial neural networks (ANNs). The ANNs are formed by a number of interconnected processing nodes (neurons) that transform a collection of inputs into a collection of outputs. The transformation is defined by the characteristics of the nodes, together with the weights of the connections between the nodes. It is possible to adapt the network by modification of the connections between neurons. During the learning process in ANNs, the weights and the biases are adjusted, and their values makes stronger the connection between neurons in various layers. The value of the weights and biases are updated during the learning process, in a way to reduce the classification error rate, which is measured by loss function. The connection weight and bias value adjustment is also called neural network training. Neural network training belongs to the type of supervised learning, which learns from the labeled data, comparing the actual class to the predicted class.
There are two great challenges that neural networks facing. One is the network training, and another is, finding the appropriate network structure. The standard optimizers are typically used to train the neural networks. The survey of the available literature shows that metaheuristics can be successfully utilized instead of optimizers. The second challenge is to find the appropriate network structure for the given task, which is also known as the process of hyperparameter optimization. Both challenges are considered to be NPhard problems by nature, in other words, they cannot be solved by traditional approaches in an acceptable amount of time. Instead, they require the application of stochastic approximation approaches, such as natureinspired metaheuristics.
Deterministic and stochastic approaches are used for training ANNs. Gradient-based training and backpropagation [1] are used most commonly for neural network optimization, which are deterministic approaches, and they have a disadvantage of local optima stagnation, vanishing gradient and slow convergence. In the back-propagation (BP) methods, additional learning parameters should be determined, such as learning rate, momentum.
These issues motivated researchers to find algorithms and approaches which will avoid getting trapped in the local minima and to speed up the convergence. Addressing this issue, different derivativefree algorithms, such as metaheuristic algorithms have been applied for neural network training. First, in 1989, genetic algorithm (GA) is proposed by Montana and Davis [2] to train ANNs. In the paper, the results shows that GA outperforms BPon sonar image classification problems. Later, other metaheuristics are applied successfully for weight and bias optimization [3], [4], [5], [6]. The utilization of metaheuristic based algorithms improves the search ability of neural network training for such values of weights and biases which will reduce the classification error rate more than gradient-based approaches.
In this study, an enhanced version of a well-known and widely applied artificial bee colony (ABC) swarm intelligence metaheuristics, that overcomes observed cons of the basic approach, is devised. Proposed improved method incorporates quasi-reflection-based learning and guided best solution bounded mechanisms in the basic artificial bee colony, and according to experimental findings manages to significantly improve convergence speed and results' quality of the original algorithm. Devised method was first tested on recent challenging CEC 2017 benchmark function set, then adapted and applied for training artificial neural network and evaluated on five well-known medical benchmark datasets. This paper is motivated by the following research questions: How to develop a neural network training method to achieve higher accuracy and to speed up the training process? How to develop an efficient metaheuristic-based algorithm for neural network training? The objective of this work addresses these research questions as follows: • develop improved ABC metaheuristic which outperforms the basic ABC and its variants in terms of convergence speed and quality of solutions; • adopt the newly developed ABC method to optimize the connection weight and biases in the neural network, which results in better accuracy and faster execution than other existing methods and • extend the experiment by hidden unit optimization, by keeping high accuracy and reducing the computation time. The rest of this paper is organized as follows: Section II presents the background and related work, Section III provides an overview of artificial neural networks and their optimization, Section IV describes the original ABC algorithm, its deficiencies and the proposed method, Section V presents the CEC 2017 simulations, following by the experiment of artificial neural network training optimization and hidden unit optimization, and Section VI concludes the paper.

II. BACKGROUND AND RELATED WORK
There are numerous implementations of swarm intelligence metaheuristics, either in original or in enhanced/hybridized forms, that were tested against standard unconstrained and constrained benchmark functions set. Additionally, a large number of algorithms were validated on practical NP-hard challenges in various domains. One of the first algorithms from this group was particle swarm optimization (PSO), described in [7]. PSO mimics the behavior exhibited by the flocks of birds or fish, and it was used to solve different practical problems, such as the task scheduling problem in the cloud computing [8], [9]. Another famous swarm algorithm is the ant colony optimization (ACO) algorithm [10], which was inspired by the social behavior of the colony of ants. Artificial bee colony (ABC) is another well-known representative of swarm intelligence, which is considered to be a very efficient optimizer [11]. ABC has been tested against benchmark [12] and applied in solving various practical problems from different domains [13], [14]. Other well-known algorithms that belong to the group of swarm metaheuristics include the firefly algorithm (FA) [15], [16], the bat algorithm (BA) [17], [18], the whale optimization algorithm (WOA) [19], [20], and the elephant herding optimization (EHO) [21], [22]. Some of the newer swarm approaches include the moth search algorithm (MS), proposed by Wang in 2016 [23], which is considered to be one of the most efficient algorithms according to the test results against standard benchmark problems [24], and it had shown very promising results when it was applied in the real-world NPhard scenarios, such as drone placement problem [25] and lifetime optimization in wireless sensor networks [16]. There is a great number of domains and practical problems where swarm intelligence algorithms can be successfully applied. In some cases, swarm intelligence algorithms were able to achieve state-of-the-art results, including: path planning [26], node localization problem and energy efficiency in wireless sensor networks [27], [28], cloud computing and task scheduling [29], [18], [30], COVID-19 cases prediction [31], [32], feature selection problem [33], [34], ANN and CNN training optimization [35], [36], [37], [38], [39], text document clustering [40], as well as many others [41], [42].
The ANNs have a large domain of applications, ranging from the image classification task [35], [36], [37], [38], time series prediction [43], [44], [31], to the wind speed forecasting [45], [46], [47]. The learning process of ANN is considered to be one of the most difficult challenges in machine learning and has attracted many researchers recently. Metaheuristics approaches have been widely used in the process of training artificial neural networks, as can be seen from the recent literature. Grasshopper optimization algorithm (GOA) was proposed as a hybrid training algorithm for multilayered perceptron neural networks (MLP), and as authors stated in their paper [48], it has obtained promising results. Whale optimization algorithm (WOA) was used in [49] to train ANN for the intrusion detection model, which is able to classify the binary-class, triple-class, and multi-class cyber-attacks and power-system incidents. WOA was also used to train ANN by optimizing the connection weights in [50]. Another recent research, published in [51], utilized a hybrid wolf-bat algorithm for training MLP networks. Swarm intelligence metaheuristics and EA approaches were also used in the domain of CNN hyperparameters' optimization, according to the recent literature survey. The goal of the hyperparameters' optimization is to create an automated framework that will be able to generate either optimal or near-optimal CNN structure for the specific task that needs to be solved. As this task is extremely complex, many researchers tried to optimize only a few CNN hyperparameters, while keeping all other parameters fixed. Two PSO-based approaches for CNNs design were proposed in [52], [53]. The parer [52] presents the improved PSO approach, enhanced with the gradient penalties for generated optimal CNN structures. The authors have validated the proposed approach against three emotional states of subjects which were obtained by using EEG signals and achieved respectable results. On the other hand, in [53], the authors used an orthogonal learning particle swarm optimization (OLPSO) approach to optimize the hyperparameters' values for VGG16 and VGG19 CNNs, and later applied the generated CNNs to diagnose the plant disease. The proposed OLPSO approach was validated against other state-ofthe-art algorithms on the same dataset and obtained better classification accuracy.
The problem of the over-fitting was addressed in [54], where authors have implemented and utilized four well-known swarm intelligence algorithms, namely FA, BA, CS, and PSO to establish an adequate selection of the regularization parameter dropout. All four algorithms were validated on the well-known image classification MNIST dataset and achieved satisfying accuracy.
PSO approach was also used in [55], where authors utilized the canonical PSO for CNN (cPSO-CNN) and managed to adapt to the CNN hyperparameters' variable ranges through the improvement of the canonical PSO exploration capabilities and redefinition of the PSO scalar acceleration coefficients to the vector form. The proposed method was then validated against seven state-of-the-art methods on the same image classification task and proved to be superior both in terms of the classification accuracy and processing costs. Another PSO-based approach was used in [56], where authors managed to create CNNs with a better configuration for a set of five given image classification tasks than AlexNet.
Evolutionary algorithms have also been applied together with CNNs. In [57], the authors proposed an approach which combines CNNs and GA in the case of non-invasive glioma classifications by utilizing the magnetic resonance imaging -MRI. The proposed approach was based on an automatic framework for neuroevolution that utilizes the GA for deep network evolving. Another research, published in [58], focused VOLUME 4, 2016 on generating a DPPN -a differentiable version of the compositional pattern producing network (CPPN). DPPNs was created by utilizing the microbial GA for CNN structure replication. Recently, a new project called DEvol was established by [59]. The goal of the project is to automate deep neural network architecture. DEvol has a support for a variable number of deep and convolutional layers. Available documentation suggests that the proposed framework achieved a test error rate of 0.6% on the MNIST dataset, which is considered to be the state-of-the-art result.
Despite of the fact that numerous algorithms are proposed for the learning process in ANNs, new algorithms should be developed to avoid local minima stagnation, slow convergence, improper explorationexploitation balance.

III. ANN TRAINING AND HYPERPARAMETER OPTIMIZATION
Neural network training is very important process and plays a crucial role in building a model which will perform better. During the weight learning process, the loss function needs to be optimized. Numerous optimizers have been suggested to address this task, as it can be seen from the recent literature overview. Some of the proposed algorithms for optimization include stochastic gradient descent, Adam, adadelta, adagrad, momentum, and many others [60], [61], [62].
The common problem with the neural network training process that happens when there is a big difference in the training and test accuracy is called over-fitting. This problem indicates that the network has learned very specific data, and it is not able to correctly predict when the new data is fed to the inputs. To address the problem of the over-fitting, various regularization approaches can be applied, including L 1 and L 2 regularization [63], dropout [64], drop connect [65], batch normalization [66], early stopping, and data augmentation.
Artificial neural networks can be applied to solving difficult problems from various domains. ANNs can obtain respectable results when handling supervised or unsupervised machine learning tasks [67]. These tasks include machine perception problems, where it is not possible to individually interpret the set of available primary features, as stated in [68]. Consequently, ANNs have been intensively utilized for implementation of the pattern recognition, classification, clustering, and predicting problems. For example, in the medical domain, different forms of ANNs were utilized for diagnostics [68] and classification of heart diseases or diabetes. The application of ANNs enabled shortening of the time required of diagnostics by processing large volumes of data during the ANN training.
The most abstract type of ANN is called a singlelayer perceptron (SLP). SLPs have only two layers, one input and one output layer, as described in [69]. Unfortunately, this type of ANN is not capable to efficiently process nonlinearly separable patterns, as discussed in [5]. Later, models with multiple layers were proposed, and they are known as multilayer perceptrons (MLP). This kind of neural network overcomes the deficiencies of the SLP model by utilizing one or more hidden layers. MLPs are arguably the most popular form of the ANNs today, with the advantages that include learning capacity, parallel processing, robustness. The most important feature of MLPs is the capacity to generalize [70]. In the research proposed within this paper, MLP with a single hidden layer (SHL) networks are observed, with a goal to optimize the number of hidden units in the hidden layer as well as to optimize the connection weights and biases.
The capabilities of any ANN can be drastically enhanced depending on the chosen learning strategy which was utilized for the network training. Among supervised training techniques, two main approaches exist, namely gradient-based and stochastic methods [5]. The back-propagation is the most widely utilized gradient-descent approach today. It can be applied as an algorithm for the local search, due to the exploitation tendency. However, as the goal is to find the global optimum, the chosen optimizer should be balanced between exploration and exploitation. The exploration phase is necessary to search through the unknown regions of the search space, while the exploitation phase is responsible to focus on the already explored areas. The drawbacks of gradient-based approaches include getting trapped in the local optimum and lazy convergence, to name the few. Therefore, the stochastic optimizers can be utilized for the MLP training, including metaheuristics trainers, which are able to escape from the local optimums.
In the case where it is needed to optimize both the structure of the network and the weights, it is required that the MLP trainer address a large-scale task [5]. As it was discussed in the Section II, both evolutionary and swarm intelligence metaheuristics have been utilized to optimize the connection weights of the network, as well as the MLP's structure and parameters. In the research presented within this paper, we consider both optimizing weights and biases in the SHL networks, and additionally, to optimize the number of hidden units within the hidden layer.
The MLP networks are a sub-type of feedforward neural networks (FFNN). The FFNNs are formed from a collection of neurons, which perform the role of the processing elements. These neurons are grouped in a series of fully connected layers. MLP consists of three types of parallel layers, namely input, hidden and output layers. Figure 1 shows the MLP architecture with a single hidden layer. Neurons in MLP are set up in a onedirectional regime. The layers are connected by the con- here, n stands for the number of input values, I i represents the input value i, ω ij is the connection weight, and finally, β j denotes the bias term. The activation function is executed over the output of the Eq. (1). There are several possible types of the activation function, for instance, it is possible to utilize an S-shaped curved sigmoid function, which is given by the Eq. (2): The performance of the network is measured by loss function. Common choice of the loss function is the MSE, the mean-squared error loss, which calculates the sum of the squared distances between actual class and the predicted class as follows: For example, if the data has two features, which corresponds to three neurons in the input unit, and if the hidden layer has three hidden units, the neural network can be represented as:

IV. PROPOSED METHOD
The ABC algorithm, originally proposed by Karaboga [11], was mainly inspired by the foraging behavior of the swarms of honey bees. ABC was used to solve many challenges, including global [71] and constrained [72] optimization problems, as well as the practical industrial challenges [73]. This section first gives an overview of the basic ABC algorithm, followed by the observed deficiencies and proposed improvements of the basic algorithm. Finally, this section presents the proposed improved ABC approach.

A. BASIC ABC
The original ABC algorithm considers three types of bees: employed, onlookers, and scouts. These bees guide the processes of exploration and exploitation. The artificial bee colony is separated into two parts. One-half of the bees are employed, while the other half are onlookers. Employed bees exploit the sources of food represented by the candidate solutions. On the other side, onlooker bees determine which sources of food to exploit according to the feedback received from the employed bees. If a certain source of food can not be improved in a previously determined number of iterations, the employed bee which was exploiting that source becomes a scout and starts with the exploration process. In the beginning, the ABC algorithm creates an initial population consisted of the solutions distributed in a random fashion [72], by applying the Eq. (4).
where x i,j denotes the j-th parameter of the i-th solution, while ub j and lb j define the upper and lower borders of the j-th parameter respectively. In each round of the execution of the algorithm, every employed bee from the population discovers a source of food within its neighborhood, as given with the Eq.
here, x i,j denotes the j-th parameter of the old solution i, x k,j denotes j-th parameter of a neighbor solution k, φ denotes a random value in the interval (0, 1), and M R represents the modification rate, a control parameter that prevents convergence to the suboptimal regions of the search space. After finding a neighborhood solution, its fitness value is compared to the old one, and in case it is better, this new solution is kept in the population.
When the intensification is finished, employed bees will give feedback to the onlooker bees about the quality of the food source. Onlooker bees choose a food source i with a probability proportional to the fitness value, which can be mathematically modeled by Eq. (6): VOLUME 4, 2016 where p i denotes the probability that the food source i will be chosen, m stands for the total amount of food sources, while f it represents the fitness value. The Eq. (6) states that the greater number of onlooker bees will be attracted by the good food sources. After onlooker bees determine which food source will be exploited, they start searching around its neighborhood in the same manner as employed bees, which is described with Eq. (5). If an employed bee is not able to improve certain sources, it will be abandoned, while the bee will become a scout bee. The deserted food source will be replaced by the new, random one. The control parameter limit is used to determine which food source (solution) will be deserted. The original ABC can be simplified with the pseudo-code given as Algorithm 1.

Algorithm 1 Original ABC algorithm
Initialization Phase repeat Employed Bees Phase Onlooker Bees Phase Scout Bees Phase Memorize the best solution obtained so far until iteration = maximum iteration number

B. OBSERVED DEFICIENCIES AND PROPOSED IMPROVEMENTS
According to the extensive empirical simulations on standard bound-constrained and unconstrained benchmarks conducted for the purpose of this research, as well as from the results of previous studies [13], [74], the original ABC algorithm suffers from few drawbacks. The algorithm is good in exploration due to the scout bee mechanism, however, the exploitation procedure is not sufficient, therefore basic algorithm does not exhibits good convergence speed. Also, the best solution is not used to guide the algorithm to explore the search space around the current best solution. To overcome the deficiencies of the original ABC, additional two mechanisms are introduced into the basic version.
First, to enhance the exploitation and utilize the information of the current best solution, after 20% of iterations, in subsequent steps, 25% of the worst solutions are replaced by new random solutions created within the boundaries of the minimum and maximum values of the best solutions' components according to Eq. (7). Proposed modification is named guided best bounded (gBestB) mechanism. The values of 20% and 25% were determined empirically by conducting extensive simulations on the benchmark functions, as well as on practical optimization challenges. This behavior could be further adjusted if additional control parameters are employed, however novel parameters would just make harder for the user to adjust algorithm behavior, therefore they are not included.
x best = min(x best )+r ×(max(x best )−min(x best )), (7) where the best solution is denoted by x best , the minimum value of the best solution's element is denoted by min(x best ), max(x best ) indicates to the value of the best solution's element, and r is a random number from the uniform distribution.
Second, to further improve the exploitation capability, the quasi-reflection-based learning (QRL) mechanism [75], which also improves the exploration, and besides that, significantly improves the convergence speed, is incorporated.
Opposite numbers are generated from the solutions as follows: where denotes the opposite number of the solution x j . Parameters lb j and ub j denotes the lower bound and upper bound of solution x j elements. The quasi-opposite number is calculated as: where the mean of the lower bound and upper bound is The quasi-reflected component x qr j is defined as reflection of x qo and calculated as: where the mean of the lower bound and upper bound is calculated by lb j + ub j 2 , rnd lb j + ub j 2 , x j generates random number from uniform distribution in range lb j + ub j 2 , x j . In this way, the quasi-reflexiveopposite individuals will be generated, and in case that the original solution is located far away from the optimal value, a fair chance exists that the opposite solution could be located within the area where the optimum is residing. The QRL mechanism is employed in each iteration in the following way: first, for each solution, its quasireflexive-opposite is generated according to Eq. (10) and in this way quasi-reflexive population P qr is created. Afterwards, all solutions from P ∪ P qr are sorted based on the fitness in descending order and best N P individuals are propagated to the subsequent iteration. Notations P and N P represent original population and number of individuals in population.
The proposed approach is named as ABCQRBEST, and its pseudo-code, in terms of maximum number of iterations T as termination condition, is presented in Algorithm 2.

Algorithm 2 Pseudo-code of the proposed ABC-QRBEST algorithm
Randomly create the set of the initial population P of random N P solutions Evaluate the value of the fitness of each solution while t < T do for i = 1 to N P do Employed bee phase Onlooker bee phase Scout bee phase if t >= T × 0.2 then Generate random solutions by utilizing Eq. (7) end if end for

end if end for
Add solution x qr i to P qr end for Merge P and P qr (P ∪ P qr ) Evaluate fitness of each solution and sort population Choose best N P solutions for next iteration end while Return the best solution Post-process and visualize results The flowchart of the proposed method is presented in Fig. 2.

C. COMPLEXITY AND LIMITATIONS OF PROPOSED APPROACH
The most expensive operation in metaheuristics execution is fitness function valuation (F F E). Therefore, based on the most relevant and recent computer science literature, the algorithms' complexity is measured in terms of employed F F Es [15]. Basic ABC algorithm evaluates fitness function in the initialization phase and in the solutions' update phase (employee and onlooker mechanisms). However, due to implementation of QRL mechanism, proposed ABCQRBEST metaheuristics performs additional N P evaluations in each iteration and due to the gBestB mechanism it also utilizes additional N P · 0.25 evaluations in T · 0.8 iterations. In practice, the scout bee phase is rarely executed and it can be omitted from the complexity calculation.
Therefore, by taking all into account, the complexity of proposed ABCQRBEST metaheuristics in terms of F F Es is given as: Compared with the original ABC, when taking T as termination condition, proposed ABCQRBEST employs higher number of F F Es in each iteration, which is one disadvantage of proposed method compared to the original one. However, in practice the number of F F Es is taken as termination condition and this drawback does not have influence on the fair comparison between methods.
Also, the another limitation of proposed method stems from the fact that universal control parameters adjustments, that can obtain best performance metrics for all problems (according to the no free lunch theorem -NFL), do not exist. From this context, the percentage of worst solutions that should be replaced with gBestB mechanism and execution point when gBestB is trig- VOLUME 4, 2016 gered, need to be determined empirically. In this study, optimal values for this behavior are determined and they certainly apply for a vast number of optimization problems.

A. CEC 2017 BENCHMARK FUNCTIONS SIMULATIONS
The bound-constrained validating process of the suggested ABCQRBEST algorithm was executed on a novel and highly challenging CEC 2017 benchmark function set [76]. The CEC 2017 function set consists of 30 benchmark functions classified into 4 distinctive groups: F 1-F 3 represent uni-modal, F 4-F 10 represent multi-modal, F 11-F 20 represent hybrid functions, and finally benchmarks F 21-F 30 represent very complex composite functions. The fourth group of functions has the properties of all uni-modal, multi-modal, and hybrid functions at the same time; additionally, these functions are shifted and rotated.
Test function F 2 has been removed from the test set as the result of unstable behavior [77]. Fundamental details of CEC 2017 benchmark functions are provided in Table 1. The simulations were performed with 30-dimensional function variants (D = 30), while the results for mean (average) and standard deviation (std) over 50 independent runs were disclosed. The suggested ABCQRBEST algorithm has been validated against famous metaheuristics approaches including the basic FA with dynamic α , cutting-edge enhanced Harris hawks optimization (IHHO) presented in [78], basic Harris hawks optimization (HHO) [79], differential evolution (DE) [80], grasshopper optimization algorithm (GOA) [81], gray wolf optimizer (GWO) [82], moth-flame optimization (MFO) [83], multi-verse optimizer (MVO) [84], particle swarm optimization (PSO) [85], whale optimization algorithm (WOA) [19], sine cosine algorithm (SCA) [86], and the basic version of the ABC ( [72]). All algorithms included in comparative analysis were implemented in this study and tested with the parameters suggested in the original publications.
This paper utilizes the same simulation configuration as presented in [78]. The referenced paper [78] published results obtained by utilizing N P = 30 and T = 500. Since the ABCQRBEST method uses more F F E in every run, the maximum number of F F E (maxF F E) has been set as the termination condition in this study. As every other method in this comparison employs only one F F E for each individual in both initialization and update phases, and with a goal to allow valid grounds for fair comparisons, maxF F E was set to 15,030 (N P + N P · T ).
The proposed ABCQRBEST utilizes the same value for M R parameter of 0.8, as suggested for original ABC [11], however, since maxF F E has been taken into the account, value for the limit parameter was empirically determined and set to maxF F E/2 * N P (in this case to 250). The configuration of the control parameters for the opposing methods can be found in [78].
The overall results obtained over the CEC 2017 benchmark functions set are shown in Table 2, where the best results for mean and std indicators for each function are bolded. The results presented in the Table 2 indicate that the ABCQRBEST metaheuristics achieved the best results on 21 benchmark function instances, namely F1, F3, F5, F6, F7, F8, F11, F12, F13, F15, F17, F19, F20, F21, F22, F23, F25, F26, F28, F29, and F30. For some instances, ABCQRBEST achieved the best result, but was tied with results of another method. In such situations, both results were marked in bold. Generally speaking, the proposed ABCQRBEST method outperformed all other metaheuristics included in the experiments, including the IHHO.
In order to better visualize the stability of algorithms over 50 independent runs, box and whiskers (box plot) diagrams have also been generated, as shown in Fig.  3. The box plot diagrams were generated for randomly chosen functions from the CEC2017 set. The proposed ABCQRBEST was tied for the first place with the IHHO for functions F3, F6, F19, F21, and F29. Also, for some benchmarks, several functions were tied at the first place, and all were marked in bold. For benchmark F9, the best results for mean metrics were obtained by MVO and PSO algorithms. For benchmark function F9, both basic ABC and proposed ABCQRBEST obtained good scores, and differences between these two algorithms are minimal, except the proposed ABCQRBEST is more stable, as it can bee seen that it has lesser std. For this function, WOA and DE obtained poor performances, while PSO obtained the best results, considering the best, worst and mean values that were almost the same. For benchmark F11, ABCQRBEST was tied on the first place with PSO. At the end, for benchmarks F13 and F15, ABCQRBEST shared the first place with DE. Additionally, for F15, ABCQRBEST has better stability than DE. It can also be noted that ABCQRBEST was outperformed by IHHO algorithm in case of benchmarks F4 and F14. PSO metaheuristics obtained the best results on benchmarks F10 and F16, while the DE method outperformed all other approaches on benchmarks F18, F24 and F27. Considering the F18, the proposed ABC-QRBEST has shown better stability than DE, while the WOA has shown the best stability. For benchmark F10, although the PSO obtained the best result, the proposed ABCQRBEST shown the best stability, as the std value was the smallest. Taking all into the account, the suggested ABCQRBEST approach was clearly superior to all competitor methods included in the experiments, justifying the implemented modifications. For benchmark F23, the ABCQRBEST obtained the best results, however not the best stability (dispersion). For F29, the ABCQRBEST obtained the best results for both mean and std, as it can be also seen from the provided box VOLUME 4, 2016 plot diagrams, presented in Fig. 3.
The proposed ABCQRBEST algorithm obtained better stability than the basic ABC, as it can be clearly seen from the presented box plots, where the basic ABC attains greater dispersion between the best and worst runs. However, it is also necessary to note that the basic ABC obtained better results for some benchmark functions than some advanced algorithms such as IHHO.
To demonstrate the statistical significance of the differences between the suggested ABCQRBEST method and all other observed approaches, the statistical Friedman test [87], [88] and the two-way variance analytic by ranks were executed. The results of the observed algorithms on the CEC 2017 function suite for Friedman test rank and the aligned Friedman test rank are shown in Tables 3 and 4.  The findings presented in Table 3 indicate that the suggested ABCQRBEST approach obtained better performances than other popular metaheuristics approaches included in the analysis. The proposed ABC-QRBEST achieved the average ranking of 1.551, therefore outperforming the previously best approach IHHO (achieved average ranking of 3.138) by significant margin.
To make sure that presented findings were statistically accurate, the Iman and Davenport test [89] was conducted, as this statistical test has the potential to give improved statistical conclusions referring to the precision than the χ 2 , as proven and published in [90]. The summarized results of Iman and Davenport's test has been given in Table 5.
After performing the necessary calculations, the final result of the Iman and Davenport test was 36.95, and it has been compared against the F -distribution critical value (F (9, 9 × 10) = 1.820), and it finally shown that the Iman and Davenport test returned a statistically significant higher result. This test also rejected H 0 . Additionally, the Friedman statistics (χ 2 r = 181.50) are greater than the χ 2 critical value with ten degrees of freedom (1.82), when observing the significance level of α = 0.05.
Finally, it allows to reject the null hypothesis (H 0 ). At the end, it can be concluded that the proposed ABC-QRBEST method performed significantly better than other algorithms that were included in tests. As the null hypothesis has been rejected by both executed statistical methods, the non-parametric Holm step-down procedure has been executed as well, and the findings are given in Table 6. This approach sorts all algorithms based on their p value and compares with α/(k − i), where k and i denote the degree of freedom and the algorithm number. For this research, the value of α was set to 0.05 and 0.1. The findings shown in the Table 6 indicate that the suggested ABCQRBEST method significantly outperformed every other opposed algorithm, except state of the art IHHO, at both significance levels.

B. ANN TRAINING AND HYPERPARAMETER OPTIMIZATION EXPERIMENT
This subsection first describes the datasets used in this experiment, following by the description of metrics used for the proposed ABCQRBEST algorithm evaluations. Next, the adaptations of method for ANN  training are elaborated, along with the setup of neural network weight optimization experiment, comparative analysis and obtained results' interpretation. Finally, at the end of this section, the hidden unit and weight optimization experiment's setup and results are described.
Experiment design in terms of employed datasets, pre-processing, metrics utilized for comparison is the same as in the study proposed in [48].

1) Datasets description
The proposed optimization algorithm is tested on the following five different medical datasets for binary classification: • Breast cancer dataset; • Parkinson dataset; • Diabetes dataset; • SAheart dataset and • Vertebral dataset All datasets are freely available and downloadable from globally recognized UCL machine learning repository [91] and all of them have two classes, while the number of features and instances varies from one to the other. Details of employed datasets are summarized in Table 7. The breast cancer dataset [92], [93] was created by Dr. William H. Wolberg from the University of Wisconsin Hospitals, Madison. The dataset has two classes, one class is indicating to benign and another class to the malignant cancer diagnosis. The total number of instances is 699, and each instance is presented by 9 numerical features (Clump Thickness, Uniformity of Cell Size, Uniformity of Cell Shape, Marginal Adhesion, Single Epithelial Cell Size, Bare Nuclei, Bland Chromatin, Normal Nucleoli, Mitoses).
The Parkinson dataset [94] was created at Oxford University by Max Little. Each column in the table is a particular voice measure, and each row corresponds to one of 195 voice recordings from the individuals. The dataset has two classes, one class is indicating to a healthy patient and the other class to a patient diagnosed with Parkinson's diseases. Each instance in the dataset is presented by 22 numerical features.
The diabetes dataset is created by the National Institute of Diabetes, Digestive, and Kidney Diseases. The dataset has two classes, indicating whether the patient is diagnosed with diabetes or not. The total number of instances is 768, and each instance is presented by 8 numerical features, including the number of pregnancies, glucose, blood pressure, skin thickness, insulin level, BMI, diabetes pedigree function, and age.
The SAheart (South African Hearth Disease) dataset [95] contains the data of a heart-disease high-risk region of the Western Cape, South Africa. The samples in the observations are only of males. The dataset has two classes, indicating the person has coronary heart disease (CHD) or not. The total number of instances is 462, and each instance is presented by 9 numerical features, such as systolic blood pressure, cumulative tobacco, low-density lipoprotein cholesterol, adiposity, obesity, current alcohol consumption, family history of heart disease, type-A behavior, and age.
The Vertebral dataset is created by Dr. Henrique da Mota. The dataset has two classes, classifying orthopedic patients to normal and abnormal (patients having Disk Hernia or Spondylolisthesis). The total number of instances is 310, and each sample is presented by 6 biomechanical features.
The feature distributions of five datasets are presented in Fig. 4.

2) Classification metrics
The performance of proposed approach is evaluated on various standard metrics, such as accuracy, specificity, sensitivity, geometric mean (g-mean) and area under the curve (AUC). The metrics are calculated by using the following expressions: where T P denotes true positive, T N the true negative, F P the false positive, and F N the false negative values from the confusion matrix. The confusion matrix is presented in Fig 5. The accuracy calculates the correct prediction of all samples. The specificity provides information about the correct prediction of negative samples out of all negative actual values (T N and F P

3) Proposed ABCQRBEST adaptations for ANN training
In the neural network training, the ABCQRBEST optimizes the values of the weights in the hidden units.
In the algorithm, one solution encodes the weights and biases of the neural network. Thus, the solution vector consists of the connection weights and biases between the input layer and the hidden layer, as well as the connection weights and biases between the hidden layer and output layer. The procedure of the neural network training by ABCQRBEST can be described as follows: Save the current best solution with minimum error rate end while return the best solution and test the best network with the test dataset.
The flowchart of the proposed method on neural network training is presented in Fig. 6  In the neural network weight optimization experiment, the ABCQRBEST proposed approach is used to optimize the values of weights and biases in the hidden units. Each solution encodes the weights and biases of the neural network. Thus, the solution vector consists of the connection weights and biases between the input layer and the hidden layer, as well as the connection weights and biases between the hidden layer and output layer. Therefore, dimension of a solution is calculated as follows: where W denotes the one-dimensional vector of weights and biases, I represents the number of input features in the input layer, H indicates the number of hidden units in the hidden layer, and O denotes the output layer, which consists of two nodes in all conducted experiments since all datasets fall into the group of binary classification challenges.
For the purpose of conducted study five medical datasets are split into training and testing data. The training data consists of 2/3 , while the testing encompass 1/3 of all observations. In data pre-processing phase, with the goal of adjusting the influence of each feature on classification performance, data normalization on all features in each dataset, by changing the range of the data between 0 and 1, is applied, according to the following formula: where X i is the i-th feature, the minimum and the maximum value of features are X min and X max , respectively. The normalized feature data is denoted by X norm . In this work, the neural network model has only one hidden layer and the number of hidden units in the layer depends on the number of features of the corresponding dataset. The hidden unit number is calculated as: where the number of hidden units is denoted by H, and I indicates to the number of input features in a given dataset.
The fitness function of the algorithm is given as the mean-squared error (MSE) loss: where the total number of instances are denoted by n, the y represents actual value, while the predicted value is represented byŷ.
In the initialization phase, the population of N individuals with W components is generated randomly within the boundaries [−1, 1] for each parameter (weight or bias) of every solution. The experimental setup is similar to the one conducted in [48], where population of 50 individuals (N P = 50) is iterated during the course of 250 iterations (T = 250). However, to make objective comparative analysis, as in the case of CEC 2017 benchmark simulations, instead of T , the maxF F E was set as the termination condition. The maxF F E is calculated by using expression N P + N P · T , which in this case yields the total of 12,550 F F E in one run.
Obtained results of ABCQRBEST metaheuristics are compared to other algorithms, which are evaluated against same datasets in [48], namely the GOA, basic genetic algorithm (GA), PSO, ABC, flower pollination algorithm (FPA) [96], bat algorithm (BAT/BA) [17], firefly algorithm (FF/FA) [15], monarch butterfly optimization (MBO) [97] and biogeography-based optimization (BBO) [98]. All these methods were also implemented and tested in this study and similar results as in [48] were obtained, therefore validity of the study from [48] was confirmed. All obtained results are generated over 30 independent runs However, on top of the above mentioned approaches employed in comparative analysis, with the goal of conducting wider and more rigid evaluation of proposed ABCQRBEST, other well-known metaheuristics gray wolf optimization (GWO) [82], fruit fly optimization algorithm (FOA) [99], whale optimization algorithm (WOA) [19], salp swarm algorithm (SSA) [100] and brain storm optimization algorithm (BSOA) [101] were also implemented for the same problem and its results are included in comparison tables.
All methods implemented for the purpose of comparative analysis were tested with the same control parameters' setup as suggested in original studies and devised ABCQRBEST method was tested with the same parameters as in CEC2017 experiments (Subsection V-A).
The experimental results on the five datasets are presented in tables 8, 9, 10, 11, 12, while the results for each method are summarized in Table 13. The statistical results in the tables includes the average, standard deviation, specificity, sensitivity, g-mean and AUC. Results highlighted with bold are indicating the best results.
To provide better insights into the methods performance, mean classification error convergence speed graphs for some better performing algorithms are depicted in Figure 7.        ABCQRBEST  14  2  19  11  46  ABC  0  1  5  0  6  GOA  3  0  3  2  8  GA  0  0  2  1  3  PSO  2  0  3  0  5  FPA  2  0  1  1  4  BAT  0  1  1  1  3  FF  2  17  1  5  25  MBO  0  0  3  0  3  BBO  0  3  1  3  7  GWO  0  1  1  0  2  FOA  0  0  1  0  1  WOA  0  0  1  0  1  SSA  0  0  2  0  2  BSOA  0  0  1  0  1 The proposed method achieved the best results on average accuracy, specificity, g-mean, and AUC on the breast cancer dataset test. ABCQRBEST shows high performance on the accuracy, specificity, and AUC, and the g-mean metrics. In the case of the standard deviation, the FF resulted in the best values on most metrics. In the test on the Parkinson dataset, ABC-QRBEST resulted in best values on the accuracy, gmean, and AUC. FF and BBO have the best performance in specificity and sensitivity, respectively. On the diabetes dataset, the ABCQRBEST shows the best performance on the accuracy and g-mean, the second-best performing algorithm is FF and the third is GOA. In the heart test results, ABCQRBEST shows the best average, best, and worst statistical results on the accuracy and AUC. In the experimental results of the fifth dataset, the Vertebral dataset, the proposed method has the highest accuracy, sensitivity, g-mean, and AUC.
Summarizing the obtained experimental results, ABCQRBEST has 46 best values in total, and the second method is FF with a total of 23 best statistical results. In the case of the ABCQRBEST, the best values are achieved on average and best, while in the case of FF, the best results are achieved on the standard deviation.
From mean classification error rate convergence speed graphs shown in Figure 7, some important conclusions regarding the time complexity in terms of F F E can be derived. It is observed that the proposed ABCQRBEST algorithm for all datasets obtains best reported accuracy (classification error) after between 50% and 90% of maxF F E, while all other approaches converge throughout the whole run. This means that the ABCQRBEST can reach best reported results within smaller number of F F E, therefore computation time compared to other state-of-the-art methods is reduced. To further test this claim, all experiments for ANN training are executed again with only 10,000 F F E and it is observed that the same results as shown in tables 8 -12 are obtained.

5) Hidden units and weight optimization in neural networks
The hidden units and weight optimization in neural networks are an extension of the previous neural network weigh optimization experiment. In this experiment, besides the weights, the number of hidden units is also optimized. In a solution, the vector is extended with the maximum number of hidden units (Eq.18). By using binary encoding strategy, if the value is less than 0, the value will be 0, otherwise 1. The value 1 indicates to active units if the value is 0, the units are deactivated, and consequently, the wights and biases also become deactivated. Table 14 presents the results of the average 30 independent runs. In the table, the three best solutions are presented for the breast cancer, diabetes, SAheart, and vertebral datasets, in the case of the Parkinson dataset, the five best solutions are presented considering the  Based on the obtained results, it can be concluded that with fewer hidden units, it is possible to achieve the same or even better accuracy, and on the other hand, the computation time is reduced significantly.

VI. CONCLUSION
In this work, an improved ABC algorithm is proposed for weight connection optimization and hidden unit number optimization in neural networks. The proposed method improves the exploitation capability of the basic ABC algorithm, by incorporating quasireflection-based learning and random current best solution mechanisms. The algorithm is tested on the recent challenging CEC 2017 benchmark function suite to test the exploration and exploitation capability and solution quality. The obtained results are compared to the original ABC and other recent metaheuristics approaches, and the statistical results shows the robustness of ABC-QRBEST, which results are significantly better than the result of other approaches.
The proposed ABCQRBEST is employed for weight connection and hidden unit number optimization in neural networks. For evaluating the performance, the simulations are conducted on five well-known medical datasets. The obtained statistical results are compared to similar metaheuristic based approaches. The proposed method outperformed the other, current metaheuristic based methods.
The limitations of the proposed algorithm are as follows. First, it is necessary to put additional effort to set up the algorithm, in terms of control parameters, for a particular problem that is being solved, and it is done empirically, by trial and error. The second drawback of the algorithm is that it requires more calculations in each iteration, due to the generation of the quasi reflexive learning population. This means N P * F F E more evaluations of the fitness function in each iteration. However, as shown in experiments, when the maxF F E is taken as the termination condition, this drawback is only conditional.
Based on the findings, it can concluded that ABC-QRBEST is very promising and competitive over the current approaches in neural network weight optimization and hidden unit optimization. In the future research, plan is to adapt and to test devised method for other datasets, as well as to combine it with other machine learning algorithms.
IVANA STRUMBERGER started her University career in 2013 as teaching assistant at Faculty of Computer Science in Belgrade. She received her P.h.D. degree from Singidunum University in 2020 from the domain of Computer Science (average grade: 9,93). She currently works as assistant professor at Faculty of Informatics and Computing, Singidunum University, Belgrade, Serbia. She conducts research in the domain of computer science and her specialty includes swarm intelligence, machine learning, optimization and modeling, cloud computing, computer networks and distributed computing. She has published more than 70 scientific papers in high quality journals and international conferences indexed in Clarivate Analytics JCR, Scopus, WoS, IEEExplore. She has also published 15 book chapters in Springer Lecture Notes in Computer Science series and 2 books from the domain of Cloud Computing. She is regular reviewer of many international state-of-the-art journals with high Clarivate Analytics and WoS impact factor.

ABEER B AHMED
is senior academic and business professional.PhD holder with entrepreneurial leadership experience in Stock Market and Computer Science domains.Passionate for understanding and positioning technology and products, educating customers and working in consultancy driven business environments.A research professional and technical analyst with prime specialisation in the use of intelligent methods to discover hidden patterns within stock data.Having project management skills for research and software development of stock market related product suits. Founder and managing director of Middle East Sentiment Consultant Inc (MESC). The company?s clientele include major investment corporate in Egypt, Saudi Arabia, and United Arab Emirates.