Kernel Parameter Optimization for Support Vector Machine Based on Sliding Mode Control

Support Vector Machine (SVM) is a supervised machine learning algorithm, which is used for robust and accurate classification. Despite its advantages, its classification speed deteriorates due to its large number of support vectors when dealing with large scale problems and dependency of its performance on its kernel parameter. This paper presents a kernel parameter optimization algorithm for Support Vector Machine (SVM) based on Sliding Mode Control algorithm in a closed-loop manner. The proposed method defines an error equation and a sliding surface, iteratively updates the Radial Basis Function (RBF) kernel parameter or the 2-degree polynomial kernel parameters, forcing SVM training error to converge below a threshold value. Due to the closed-loop nature of the proposed algorithm, key features such as robustness to uncertainty and fast convergence can be obtained. To assess the performance of the proposed technique, ten standard benchmark databases covering a range of applications were used. The proposed method and the state-of-the-art techniques were then used to classify the data. Experimental results show the proposed method is significantly faster and more accurate than the anchor SVM technique and some of the most recent methods. These achievements are due to the closed-loop nature of the proposed algorithm, which significantly has reduced the data dependency of the proposed method.


I. INTRODUCTION
Support Vector Machine (SVM) is one of the widely used machine learning classification algorithms, among other classifiers such as: nearest neighbor [1], boosted decision trees [2], regularized logistic regression [3], neural networks [4], and random forests [5]. SVM can be used to achieve robust and accurate classification results, even from non-linearly separable input data, by mapping the data into a higher-dimensional space using kernels [6], [7]. SVM is a Quadratic Programming (QP) problem that is aimed at finding a separating hyperplane to achieve maximum margin between classes of data [8], [9]. It was first proposed for binary classification by Vapnik in the early 1990s, however, The associate editor coordinating the review of this manuscript and approving it for publication was Zheng Chen. its extensions can be used for multi category problems [10]. Since SVM achieves a unique solution and can learn independently from the dimensionality of feature space, it is robust against overfitting and it is superior to other classifiers [6], [10]. SVM has been used in many applications, including text categorization [11] and face detection [12], where it delivers robust and accurate results. SVM has also been used in some control branches, e.g., nonlinear control [13] and optimal control [14], because of the unique and optimal answer that it generates. Despite the advantages and wide range of applications of SVM, it suffers from some limitations such as low classification speed, especially when dealing with large scale problems, due to the large number of support vectors that SVM uses for classification [15], [16], dependency of its performance on kernel parameter, kernel selection and its regularization parameter. SVM's test phase VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ time complexity is O(1) + 4O(n) + 2O(n 3 ), where n is the number of support vectors [10]. This indicates that the SVM classification computation cost increases as its number of support vectors increases. Various methods have been proposed by the researchers to find optimal kernel for SVM and reducing its number of support vectors as the performance and speed of the algorithm depend on the kernel function and its parameters. These techniques can be classified into two main groups called: closed-loop and open-loop methods, where they either try to find the optimal kernel function and its parameters or dealing with some of the SVM's problems by modifying the training set or its set of support vectors. Closed-loop systems/algorithms have a feedback in their structure so that when a control input (input) changes the output of the system/algorithm, the resulting output is used for correcting and changing the control input (input) for arriving at the desired output. They operate in a self-adjusting mode, while open-loop systems/algorithms need a person to manually review and make the adjustments. Therefore, a close loop system/algorithm converges faster than open loop systems and is more robust to uncertainties and disturbances [17]. The closed loop-based methods for finding optimal kernel function and its parameters mainly use two approaches to achieve this. The group 1 methods first introduce an objective function, which is dependent on SVM and kernel parameters, then use different gradient descent methods to find optimal parameters for the kernel functions [18]- [23]. The group 2 methods try to find the global optimal solution for the kernel and its regularization parameters [24]- [31]. Since the goal is arriving at a global solution, they use various optimization algorithm including genetic-, dragonflyand evolutionary-algorithms with different fitness functions. Genetic Algorithm (GA), Ant Colony Optimization (ACO) algorithm and Particle Swarm Optimization (PSO) algorithm are all Swarm Intelligence (SI) based methods, that one of their main properties is acting in a self-organized mode, and their capability to evolving the components into a good form without any external help. GA is population-based strategy which mainly includes five components: a random number generator, a fitness evaluation unit, a reproduction process, a crossover process, and a mutation operation. It first creates an initial population by random or heuristic, then determines the fitness and performance of each individual, and ranks them using a fitness function. When all individuals are ranked, the resulting low ranked individuals are omitted from the population, and the rest will be used in the reproduction process. GA uses confounded parameter settings, which is one of its main positive points, however, using a random procedure in the crossover and mutation process reduces the GA's convergence speed towards the optimal values, which is considered as its biggest drawback.
ACO is a metaheuristic approach, which has four main components: ant, pheromone, daemon action, and decentralized control. The ACO tries to find the shortest path to the optimal solution in a weighted graph. Hence, in the first step of each iteration every ant constructs its own solution (path) stochastically, then the paths that are built by different ants are compared and in the last step the level of each edge's pheromone is updated. The ACO algorithm can be used in dynamic applications due to its great adaptation to changes such as new distances and suggests a positive feedback results in rapid discovery of good solutions. However, it has slower convergence speed compared with other heuristic-based methods and its theoretical analysis is difficult, research is experimental rather than theoretical and lacks a centralized processor to guide the algorithm towards good solutions. PSO is an optimization technique that is inspired by swarm behavior in birds flocking and fish schooling for searching global optimal solutions. The PSO algorithm first initializes the population, then calculates the fitness value for everyone. After finding all fitness values, it updates the population, the speed, and particles' position. Except its first step, the other steps are repeated till termination condition is satisfied. The PSO has no mutation calculation, and its searching speed is very fast, but it cannot address scattering and non-coordinate system problems, and it is less exact at the regulation of its speed and the direction [32], [33]. Although GA, ACO, and PSO algorithms are acting in a self-organized mode. They are not purely and truly closed-loop methods. For clarification, a block diagram representation of a closed-loop and open loop system are shown in Fig. 1. As it can be seen from Fig. 1b, in an open loop system, a collection of inputs (population) is fed to the system and the input (s) that create the best output, will be used for controlling the system or generating the new set of the inputs. This is exactly the procedure in SI based algorithms, while in a closed-loop system ( Fig. 1.a) after choosing an initial value as the input, the closed-loop structure will update the input value based on the resulting output. In a closed-loop method, the best input value is not selected by comparing different potential inputs.
Sliding Mode Control (SMC) is a closed-loop method, which benefits from this great property, because acting in a closed-loop manner brings more robustness against disturbances, uncertainties, and un-modeling, and has a superiority to those of SI based algorithms from this aspect. Unlike GA and ACO, the SMC has a vivid, simple, and well-defined theory and mathematics behind itself, which makes it possible to theoretically analysis it. Unlike GA, SMC has no randomness in its structure, and based on the results it just takes around 8 steps on average to arrive at its best result for different datasets, which is significantly faster than other methods. By defining the SVM algorithm as a closed-loop control system, it provides capability to control and monitor its transient and steady state behavior in detail. In the proposed method, Sliding Mode Control (SMC), which is a powerful tool for robust control of nonlinear systems, is used. Since there are uncertainties in the modeling of real-world systems, it is hard to control such plants with uncertain models and arrive at the desired performance. The SMC is often used to deliver good tracking in systems with uncertain models [34], [35]. For achieving this goal an error equation and a sliding surface are first defined and the SMC then tries to drive the state trajectory of the system onto the sliding surface and force the trajectory to maintain on this surface for all subsequent time by using a control input. The state trajectory is driven to the sliding surface by just estimating that it is in which part of the sliding surface. There is no strict region, and it is not important that how far the state is from the sliding surface. Due to this feature of the SMC, it is more robust than other approaches in the control field. Hence, the uncertainties of the system model do not affect its performance and its convergence is guaranteed [36]. Hence, in this research the application of the SMC in improving the performance of the SVM algorithm and reducing its limitation when dealing with large data, which are coming from different fields with no information and knowledge about their dynamic, is investigated.
In this paper, a support vector machine based on sliding mode control RBF kernel parameter optimization is presented. The proposed method does not need a system model to find the optimum value for the RBF kernel parameter and 2-degree polynomial kernel for speeding up the test phase of SVM and improving its prediction accuracy. The proposed method first defines the specification of an error equation and the sliding surface and then it tries to arrive at a good tracking and low training error by updating the parameter(s) of those kernels. This procedure will be repeated until the validation accuracy continues its decreasing trend for a specific number of iterations. The effort of the proposed method to achieve high training accuracy results in prediction accuracy enhancement and finding a smaller set of support vectors, thereby reducing the speed of classification. Experimental results show that the anchor SVM does not necessarily generate the optimal number of support vectors and its kernel parameter selection could affect both accuracy and its resulting number of support vectors, where an optimal and smaller set of support vectors will increase both the speed and accuracy of the classification. Hence, the proposed method is significantly faster and more accurate than the anchor SVM technique. Furthermore, the proposed method generated more accurate results in compared with some of the latest techniques. All of these and its high robustness against uncertainties, which are existed in the data comes from different sources, are due to the closed-loop nature of the SMC algorithm used in conjunction with SVM method. The main contribution of this paper can be summarized as follows: (i) Looking at SVM's problems and concepts from a control field of view and making a connection between these two fields. (ii) Using a non-model based and close loop method, SMC, for finding the optimal value for RBF kernel parameter. The rest of this paper is organized as follows. In Section II-III, a brief overview of SVM and SMC methods are presented, respectively. In Section IV, the proposed method for finding the optimum kernel parameters is explained. Experimental results are presented in Section V and Section VI concludes the paper.

II. SUPPORT VECTOR MACHINE
There are many methods that can be used to classify two-class linearly separable data but all of them give infinite answers. To find the best answer, the SVM method could be one of the solutions. The SVM finds the best hyperplane that separates the data using the idea that the best decision boundary is the one that has the maximum distance and margin from both classes of the data. SVM called maximum margin classifier, too. SVM has been shown to produce accurate results that can be explained easily, unlike other methods, e.g., neural networks. If the data known to be linearly separable, hard margin SVM is usually used. Assume that there are n data points in the dataset that their labels are either −1 or 1. The first step is to find its margin and then maximize it. If the equation of hyperplane be w T x + b = 0, where w is an orthogonal vector to the hyperplane and b is the bias then the distance of a point to the hyperplane can be formulated as: where x i is the ith data point and d i (x) is its signed distance. It means if the data is on one side of the hyperplane, its sign will be positive, otherwise its sign is negative. By multiplying the distance of each point by its label, an unsigned distance, y i d i (x) is calculated, where y i is the label of the data. To find the margin, min y i w T x i +b w is determined. w and b can be rescaled in a way that distance of all points to the hyperplane become at least one so the margin drives as follow: SVM is searching for the maximum margin. So, based on eq. 2, the problem can be formulated as following quadratic problem: Quadratic Problem (QP) is a convex problem that results in a global minimum or global maximum solution. By solving this QP problem, both w and the hyperplane are calculated. VOLUME 10, 2022 Classifying nonlinear data with a linear algorithm like SVM can be done by reshaping and increasing the dimension of the data, resulting in a linear dataset. However, increasing the dimensionality of the data, the curse of dimensionality will appear. SVM uses the kernel concept in case of nonlinear data to benefit of dimension enhancement but gets rid of its curse [37]. In the case that SVM is used for classifying nonlinear data, it called soft margin SVM. In this case, the decision boundary is nonlinear because the data is not linearly separable. It means that some points cross the margin or locate in the other side of the hyperplane and cause misclassification like the one that shows in Fig. 2. So, the constraint in hard margin SVM is not valid anymore because some points have y i (w T x i + b) ≤ 1. The constraint is changed to include these cases or points, too. The nonlinear case problem is formulated as follows [38], [39] min In (4), ξ i is added to the constraint for the points that violate the constraint. But by changing the constraint in this way all points can violate this. So, the number of points that can violate the margin restricted by adding a penalty or regularization parameter, C. One can solve the dual form of eq. (4) as: where α i is the dual variable that obtains via the QP. The points that their α i is greater than zero are support vectors and the points that their α i is equal to C are the ones that violate the constraint in hard margin SVM. Besides the advantages of SVM, due to the lack of a control perspective on the SVM problem, there are many aspects that are ignored. By studying SVM from a control point of view, the kernel function and its parameters are like the inputs of the SVM algorithm along with data, and that the algorithm finds support vectors as the output of SVM by using them in its training mode. So, both kernel and its parameters are vitally importance in SVM. Their unwise selection will result in poor set of support vectors, which increases the test error and time.
It can be concluded that by using control methods, the inputs of the SVM algorithm can be found in a way that increase both performance and accuracy of the SVM algorithm. As both model and dynamic of the datasets are unknown, model-based methods of control theory are not applicable. Therefore, Sliding Mode Control (SMC), which is not a model-based algorithm and is highly robust to the dynamic of the data and is a closed-loop procedure, seems to be one of the solutions to speed up the algorithm when dealing with large nonlinear data. Moreover, both soft margin and hard margin problems have counterparts in control theory because they both grapple with training error in different ways. In hard margin SVM, a zero-training error is desired, while in soft margin SVM, a limited non-zero value error is acceptable; these two trends are achieved by defining some constraint in the SVM. In control theory, there are many procedures for managing the error, e.g., using integral of absolute/square error or paying attention to the transient behavior of the error besides its steady-state error, while in SVM mainly, steady state error is considered. In addition, there is a vast variety of control methods for dealing with the steady-state errors like methods in classical control, robust control, adaptive control, optimal control, nonlinear control, and intelligent control. In the next sections, after a brief introduction, SMC as a suitable robust control strategy will be used to develop desired kernel functions. Other control algorithms can be applied in the same way.

III. SLIDING MODE CONTROL
Sliding Mode Control (SMC) is a powerful tool for robust control of nonlinear systems [40]. It is based on the idea that controlling 1 st -order systems are much easier than controlling n th −order systems, so by defining a notation, an n th order system is reformulated as a 1 st -order model [34]. This provides the construction of a sliding surface and drives the states of the system on it in the state space. Once the sliding surface is reached, the SMC keeps the states of the system on the close neighborhood of the sliding surface [40]. SMC consists of two part: the sliding surface, and the offsurface dynamics. The first step to drive this controller is to examine the expression of the error [40]. For the single input dynamic system of form y = f (x) + b(x)u, where y is the output, u is the input signal, f (x) and b(x) are system model, which are not exactly specified and have uncertainties. The goal is tracking the desired signal y d by output, y. So, the error expression can be written as follows: where y and y d are the output and desired output, respectively. A time-varying surface S(y; t) in the state space R can then be defined by the scaler space S (y; t) = 0, where: where λ is a strictly positive constant and for n = 2, Eq. (7) can be written as: S =ė + λe. The problem of tracking y ≡ y d is equivalent to that of remaining on the surface S(t) for all t > 0; indeed S (y, t) ≡ 0 represents a linear differential equation whose unique solution is e ≡ 0, given its initial condition. Thus, the problem of tracking the n-dimensional vector y d can be reduced to that of keeping the scalar quantity S at zero.
when the surface is driven to zero, the error drives to zero too, for t → ∞ [40]. To show that, we work backward by postulating that the off-surface dynamics must be of the form: where f (S) can be any non-decreasing odd function. This shows that the change in S and the 'distance' of the current state of the sliding surface, it is always opposite the sign of the S. The control input should force the states to approach it. So,Ṡ must be a function of our control input, u.Ṡ must also be a function of the second derivative of the error,ë, to just be a function of the input, u, this implies that S should only be a function of error, e, and its first derivative,ė. The simplest form of such a function that guarantees e → 0 as t → ∞ is given in Eq. (8) [40]. Consequently, driven of S to zero, drives the tracking error, e, to zero, too. For Eq. (8) the sliding surface is a line with a slope of −λ in the phase plane. By starting from any initial condition, the state trajectory drives to the sliding surface and then it slides along the surface exponentially towards the desired value, y d , with a time constant of 1/λ [34]. This procedure is shown in Fig. 3. SVM has widely used to classify non-linear separable data where there is always some uncertainty in selection of its parameters such as regularization and kernel. This has inspired the author to use the concept of sliding mode control to improve the performance of the SVM algorithm.

IV. PROPOSED ALGORITHM
SVM uses the kernel function to increase the dimension of the data and make the data linearly separable in the resulting high dimension space. However, the desired kernel function or its parameters are not specified, as a result, various methods have been introduced to find the best kernel function and its parameters to increase the performance of SVM. There is a variety of kernel functions and some of their well-known functions are RBF kernels and polynomial kernels, where different combination of these functions, e.g., linear, and nonlinear, are used to extend SVM capability to deal with non-linear data. All these kernels have some parameters, which need to be chosen in an appropriate way to solve the mentioned problems. In this paper, the Sliding Mode Control (SMC) is used to find optimum parameters of the kernel functions. To prove the effectiveness and performance of the proposed method, without losing its generality, the γ parameter of the RBF kernel as an advanced form and parameters of a 2-degree polynomial kernel as a basic form are calculated using the proposed method. In sub-section A, the application of SMC to determine the optimum γ parameter of the RBF kernel is presented. In sub-section B, the SMC is used to compute the parameters of a polynomial kernel. The calculated MC and MC-lbs parameters are used to update the RBF kernel parameter, TE is used to define the time to perturb the initial value of the RBF kernel parameter and VE is used to terminate the algorithm. For perturbing the value of γ , the algorithm checks the value of the TE. If it is zero, γ old will be perturbed as follows until a non-zero training error is achieved: it checks the value of the RBF kernel parameter, if its value is smaller than a threshold, it perturbs VOLUME 10, 2022 the kernel parameter with a small value, otherwise it will be perturbed with a larger value. In this research, the process is started with a small initial RBF kernel parameter value and when then the training procedure starts updating γ as follows: It first initializes three counters named r1, r2, and r3 with values of 1, thr1 with the Number of MisClassified train data (NMC), and thr2 with the Number of Training Data (NTD) and the Maximum Number of acceptable iterations to improve the Validation Error (MNVE) with a constant value. Then the algorithm goes through each element of Mis-Classified training data using its label, MC-lbs[r1], calculating its p and Q. If MC-lbs[r1] = -1, q will be calculated using q = − 1 2 Q † p T . After that the algorithm goes through elements of q using counter r2 and for each positive element of q, γ r2 2 is calculated, when all elements of γ r2 2 are calculated, it computes γ 1 = 1 l l i=1 γ i 2 but if MC-lbs[r1] in not equal to -1, it assigns γ new to γ 1 . The algorithm then assigns γ 1 and 0 to γ and γ 1 , respectively and increment r1 to point to the next misclassified train data. This procedure is repeated for all misclassified train data. When γ is calculated for all misclassified train data, the algorithm will check r3, to see if r3 has reached its maximum number of iterations that are acceptable for improving the validation error (MNVE) threshold value. If not, a new value for γ is calculated as γ new = m j=1 γ and it backs to 'Train SVM' block and the procedure is repeated until MNVE reaches its predefined threshold value, otherwise the training is completed and γ new is taken γ and use it to calculate the SVs. The resulting SVs are then used to classify the test subset.
The main aim of the proposed Sliding Mode Control based Support Vector Machine Radial Basis Function's kernel parameter optimization (SMC-SVM-RBF) is to use sliding mode control to find an optimum value for γ parameter of the RBF kernel to improve the SVM's performance in terms of its classification accuracy and speed. Mathematical prove of the proposed Sliding Mode Control based Support Vector Machine Radial Basis Function's kernel parameter optimization (SMC-SVM-RBF) method is detailed as follows: To make a relationship between SVM and SMC in this article, the error expression, considering equation (6), can be assumed as: where e j is the classification error, y j d and y j are the desired and predicted output values for each training data point related to the j-th misclassified training data, respectively. Based on SVM algorithm y j can be formulated as follows: where α j i is a dual variable, y j i represent the output of the training data x i , x is a misclassified training data but the aim is to find its true class label, y j (x) is the predicted class label for the misclassified training data, x, n is the number of the training data and β j is a bias related to the j th misclassified training data. After defining the expression for error, sliding surface is defined by equation (8). In equation (8), γ is considered as the input and the aim is finding an optimum value for γ to minimize the training error. By calculatingṠ from eq. 8 and replacingṠ with its value in eq. 9, the eq. 9 can be rewritten as:ë + λė = −f (S) (12) whereë andė are the second and first derivative of the error, e, respectively, f (S) is time-varying surface in the state space R and λ is a strictly positive constant. From equation (12), it can be seen that the first and second derivatives of y are needed, as e is a function of y and y is a function of γ . Since sign function does not have a derivative, the sign function is replaced with a sigmoid function in different ways to solve and formulate this problem. As the algorithm uses misclassified training points to update the γ parameter, it may result in two types of misclassified data: a) the mis predicted label for the training point is −1 ( y = −1), where its correct label should be 1.
In this paper, y = sigmoid(x) is considered as a function defining the belongness of a data point to the class Sign and sigmoid functions are illustrated in Figure 5. Figure 5 shows that the larger positive x values represent data with labels of 1 and when x → +∞, the data point is classified to y = 1 class. However, if x → −∞, y becomes zero, this implies that this data point is not belong to class y = 1. This misclassification is due to using unoptimized value for RBF parameter, γ , and α i . SVM uses the sign function to determine the class of each data point within the dataset. However, the proposed method uses different functions to find accurate class for identified misclassified data points. For simplicity, in this article sigmoid(x) = 1 1+e −x and −sigmoid(−x) = −1 1+e x functions, which are reversible functions with known derivative, are used to deal with misclassified data points in class −1 and 1, respectively. These two functions help to tackle the sign function irreversibility problem. Using the first assumption, eq. 11 can be re-written as: and the derivative of eq. 13 can be written as: Thus, by substituting eq. 13 into eq. 10 and removing j, which represent the j th data point in eq. 10 and eq 11, e,ė and e can be rewritten as: where x ∈ X is a misclassified training data point within the set of misclassified training data points, X , α i s are dual variables, y is predicted class label for the misclassified data point, x, y d is the desirable class label for the misclassified data point, x, y i is the true label of the training data point, x i , and n is the total number of the training data points.

VOLUME 10, 2022
For simplification m i , n i , and q i are defined as: Now by replacing e and the resulting expression forė in eq. 8, S can be rewritten as: And by substituting (17), (18) and (19) into (15) and (16) and then substituting (15) and (16) into (12),Ṡ can be derived as:Ṡ As these equations are derived for misclassified training data points with y = −1, by replacing y and y d into eq. 21, it results in: To solve eq. 22, this equation is written in matrix form as follows: where Q R n * n , p R 1 * n , and f (S) are a matrix, a vector and a constant, respectively, where the value of f (S) is calculated using the previous value of γ . The optimum value of q can be determined by calculating the derivative of eq. 24 with respect to q: As eq. 25 is an underdetermined problem, Q is not a full rank matrix and may have many solutions. In this paper, pseudo-inverse method is used to find an estimation for vector q. Since p T is not in the column space of Q in general, the calculated q vector is an estimation of q, where the column space of Q, named as C (Q) can be written as: + λI ] and R is an nbyn matrix. Consequently, q can be calculated using pseudo-inverse of Q as: where Q † represents pseudo-inverse of Q and q vector can be determined by solving eq. 26. However, only the positive elements of q, which satisfy eq. 19, are acceptable. Using eq. 19, vector can be calculated and written as follows: = γ 1 2 , γ 2 2 , . . . , γ i 2 , . . . , γ l 2 ∀i = 1, · · · , l. (27) where l is the number of positive elements of q vector and γ i 2 is the corresponding γ value of the i th positive element of q. By calculating the average of all elements of r vector, γ 1 is derived as: In the second stage, for the mis-predicted data with y = 1, y = −sigmoid(−x) function, which is illustrated in Fig. 6, is used to determine the level of belongness of each of these data points to their current class. From Fig. 6, it can be seen that data points with large negative x values are belonging to y = −1 class and data points with large positive values (x → +∞), which have y = 0, are belonging to other class. By considering −sigmoid(−x) function for these misclassified data points, eq. 11 can be re-written as: and the derivative of eq. 29 can be written as: Thus, by substituting eq. 29 into eq. 10 and removing j, which represent the j th data point in eq. 10 and eq 11, e,ė and e can be rewritten as: And by substituting (17), (18) and (19) into (31) and (32) and then substituting (31) and (32) into (12), and replacing y and y d with 1 and −1, respectively,Ṡ can be derived as: To solve eq. 33, assuming Q = 16 MM T and p = −4[N T + λM T ], eq. 33 can be written in matrix form as follows: where M , N and q were introduced in eq. 23. The resulting eq. 34 along with the procedures explained in eq. 25 to 28 are then used to calculate γ 1 parameter for this misclassified data point with y = 1.
For each mis-classified data point, based on its predicted y, one of the above-mentioned two methods is used to calculate its γ 1 value. The resulting γ 1 s for all mis-classified data points are then put together to form the γ vector, as follows: where m is the total number of mis-classified data points and γ i 1 represents γ 1 for mis-classified data point i. Finally, a value for RBF kernel parameter, γ , is determined by calculating the average of γ vector components using eq. 36: where m is the total number of misclassified training data points. The resulting γ will be used as the RBF kernel parameter in the next iteration. γ optimization procedure will be continued until the total number or iteration is reached or the validation error does not change for a pre-defined number of iterations.

B. POLYNOMIAL OPTIMAL KERNEL PARAMETER ESTIMATION USING SVM BASED ON SMC
Without losing the generality of the algorithm, the general form of a 2 nd order polynomial kernel is considered as ax T x j + b 2 , where a and b are the polynomial kernel parameters, x is a mis-classified training data point and x j is the j th training data point for j = 1, . . . , n and n is the total number of the training data points. The aim of this algorithm is to find optimum polynomial parameters. In this paper, for simplicity a 2 nd order polynomial was considered. However, a higherorder polynomial can also be used in a similar way. The procedure of the proposed SVM based on SMC algorithm for finding polynomial kernel optimum values is the same as the one that is explained for RBF method, which is illustrated in Fig. 4 with some differences. These differences are detailed as follows: 1. In this algorithm, polynomial kernel parameters, a and b, are first initialized with ones and then updated in each iteration. If using these initial values results in a zero-training error, their values are perturbed in the same way that was explained in Section IV.A for RBF parameter. These initial values were used because Zhang [23] and Zhiliang Liu [30] had also used them in their techniques and the performance of the proposed method in this paper, will be compared with their techniques. 2. Both resulting positive and negative values of q vector are acceptable in polynomial kernel parameter optimization, while only positive values of q vector were acceptable for updating RBF parameter optimization, as explained in Section IV.A. To find optimum values for the polynomial kernel parameters, a and b, the procedure is started using eq. 8 and 12, where sigmoid(x) and −sigmoid(−x) functions are used for the two types of the mis-classified data points, y = −1 and y = 1, respectively. Hence, eq. 8 and 12 are derived using eq. 10 and 11 for each type of mis-classified data points, as follows: VOLUME 10, 2022 1) PROCEDURE FOR FINDING OPTIMUM VALUE FOR a WHEN y = −1 By replacing the 2 nd order polynomial kernel in eq. 13, y (x) can be written as: Andė,ë can be derived by replacing eq 37 into eq. 10 and calculating 1 st and 2 nd derivative with respect to a.
Now by replacing e and the resulting expression forė in eq. 8, S can be rewritten as: And by substituting eq. 38 and 39 into eq. 12,Ṡ can be derived as:Ṡ Now by considering m i = 2α i y i x T x i , n i = 2α i y i x T x i x T x i and q i = (ax T x i + b) and replacing y = −1 and y d = 1 in eq. 41 and rewriting it in a matrix form, eq. 41 can be rewritten as: where Q = −16MM T , p = −4λM T , N = −4 n 1 , n 2 , . . . , n n T , M = m 1 , m 2 , . . . , m n T , r = 1, 1, .., 1 T and n is the total number of the training data points. The optimum value of q can be determined by calculating the derivative of eq. 42 with respect to q: Now the q vector can be written as: where Q † represents pseudo-inverse of Q and q vector can be determined by solving eq. 43. By using q i = (a i 2 x T x i + b), vector is obtained as: where lis the total number of q elements and a i 2 is the i th element of and a i 2 = q i −b x T x i . Finally, a 1 is computed by determining the average of all elements: By replacing y (x) from eq. 37 into eq. 10 and calculating 1 st and 2 nd derivatives of the resulting e with respect to b,ė, andë can be determined. Then by replacing the resulting e,ė, andë into eq. 8 and 9, S andṠ, can be derived, as follows: Assuming m i = 2α i y i and q i = (ax T x i + b) and replacing y = −1 and y d = 1 in eq. 47, and eq, 47 can be written in a matrix form as follows: where Q = −16MM T , p = −4λM T , M = m 1 , m 2 , . . . , m n T , r = 1, 1, .., 1 T and n is the total number of the training data points. The optimum value of q can be determined by calculating the derivative of eq. 48 with respect to q: Now the q vector can be written as: where Q † represents pseudo-inverse of Q and q vector can be determined by solving eq. 49. By using q i = (x T x i + b i 2 ), vector is obtained as: where lis the total number of q elements and b i 2 is the i th element of and b i 2 = q i − x T x i . Finally, the average of all elements is defined as b 1 , as follows: In the second stage, to find optimum values for the polynomial kernel parameters, a and b, the procedure is started using eq. 8 and 12, where −sigmoid(−x) function is used for the mis-classified data points with y = 1. Hence, eq. 8 and 12 are derived using eq. 10 and 11, as follows: By replacing the 2 nd order polynomial kernel in eq. 13, y (x) can be rewritten as: By replacing y (x) from eq. 52 into eq. 10 and calculating 1 st and 2 nd derivative of the resulting e with respect to a,ė, andë can be determined. Then by replacing the resulting e,ė, and e into eq. 8 and 9, S andṠ, can be derived, as follows: Now by considering m i = 2α i y i x T x i , n i = 2α i y i x T x i x T x i and q i = (ax T x i + b) and replacing y = 1 and y d = −1 in eq. 54 and rewriting it in a matrix form, eq. 54 can be rewritten as: Now the q vector can be written as: where Q † represents pseudo-inverse of Q and q vector can be determined by solving eq. 56. By using q i = (a i 2 x T x i + b), r vector is obtained as: where lis the total number of q elements and a i 2 is the i th element of and a i 2 = q i −b x T x i . Finally, a 1 is computed by determining the average of all elements: By replacing y (x) from eq. 52 into eq. 10 and calculating 1 st and 2 nd derivatives of the resulting e with respect to b,ė, and e can be determined. Then by replacing the resulting e,ė, and e into eq. 8 and 9, S andṠ, can be derived, as follows: Assuming m i = 2α i y i and q i = (ax T x i + b) and replacing y = 1 and y d = −1 in eq. 60, eq. 60 can be rewritten in a matrix form as follows: . . , m n ] T , r = [1, 1, .., 1] T and n is the total number of the training data points. The optimum value of q can be determined by calculating the derivative of eq. 61 with respect to q: Now the q vector can be written as: where Q † represents pseudo-inverse of Q and q vector can be determined by solving eq. 62. By using q i = (x T x i + b i 2 ), vector is obtained as: where lis the total number of q elements and b i 2 is the i th element of and b i 2 = q i − x T x i . Finally, the average of all elements is defined as b 1 , as follows: The procedures 1 to 4 will be used to determine a 1 and b 1 for all misclassified training data. Finally, a and b are determined by calculating the average of all resulting a 1 s and b 1 s, respectively, as follows: where m is the total number of misclassified training data points. The resulting aand b are used as the 2-degree polynomial kernel parameters for the next iteration.
The above procedure will be continued until the total number of iterations is reached or the validation error does not change for a pre-defined number of iterations.

V. SIMULATION AND EXPERIMENTAL RESULTS
To evaluate the performance of the proposed method, experimental results were generated using ten datasets from UCI machine learning repository [41] called: Letter Recognition (LR) (letters 'A' and 'N' are used for this experiment), Wisconsin Breast Cancer (WBC), Liver Disorder (LD), Heberman, Diabetes, Heart disease dataset, Ionosphere dataset, Parkinson and Sonar dataset. The letter recognition dataset consists of 20000 instances with 17 attributes for each data point (one label and 16 numerical features); Labels consist of 26 English capital alphabets; Wisconsin breast cancer database consists of 699 instances with 11 attributes for each data point, where benign and malignant labels are 2 and 4, respectively; Liver Disorder dataset consists of 345 instances with 7 attributes for each data point; Heberman dataset generated during study on the survival of patients, who had undergone breast cancer surgery; this database consists of 304 instances with 3 attributes for each of its instances. Parkinson dataset is composed of a range of biomedical voice measurements from 31 people, 23 with Parkinson's Disease (PD). Each column in the table is a particular voice measure, and each row corresponds one of 195 voice recording from these individuals (''name'' column). The main aim of the creation of this database was to be used for discriminating healthy people from those with PD. Diabetes database has two classes of data and consists of 804 instances with 8 attributes for each data point. Heart disease database consists of 303 instances with 75 attributes for each data point. Ionosphere database, which is used for binary classification, consists of radar data with 351 instances and 34 attributes for each data point. Sonar database contains 208 instances with 60 attributes for each data point.
To generate experimental results, all the databases were normalized and then each dataset was randomly divided into three subsets called: train, test, and validation subsets of size 70, 20 and 10 percent, respectively. Training subsets were used for updating kernel parameter, validation subsets were used for terminating the optimization algorithm [20], as mentioned in Section IV and test subset were used for evaluation and comparisons of the performance of the proposed algorithm. The following setting were used to generate results: f (S) = 50 * arctan(S/10), λ = 0.3 and regularization parameter, C = 100.1. The resulting number of Support Vectors (SVs) and achieved accuracy for the train and test data of the proposed technique using its RBF kernel parameter optimization algorithm were calculated and compared to those of the anchor SVM and tabulated in Table 2. From Table 2, the proposed technique generates significantly higher performance in terms of accuracy and the number of SVs than anchor SVM. The proposed method generates significantly lower number of Support Vectors (SVs) in compared to anchor SVM (up to 93.51% reduction), while it gives higher test accuracy. This implies that the proposed method is faster than its anchor SVM in its test phase.
To give the reader a sense of the number of iterations that proposed algorithm needs to determine its optimal kernel parameter, the initial value of γ , the calculated optimal value of γ , number of iterations that algorithm used to determine the optimal value for γ for ten different databases are tabulated in Table 3. This table shows that the proposed method arrives at the optimum value of γ using small number of iterations.
The performance of the proposed method using its RBF kernel parameter optimization algorithm were compared to those of Zhang et al.'s [23] and Liu and Xu's [30] methods on five databases (Parkinson, Ionosphere, Sonar, Heberman and Iris databases) are presented in Table 4. From Table 4, it can be seen that the propose method gives either superior    or very competitive results to those of Zhang et al.'s and Liu and Xu's methods. The average γ value that used to generate experimental results for the three techniques are also given in Table 4.
The performance of the proposed method using its 2nd-degree polynomial kernel optimization algorithm were also compared to those of Zhang et al.'s [23] and Liu and Xu's [30] techniques on three databases (Iris, Ionosphere, and Heberman databases) are presented in Table 5. (In [30], Liu and Xu presented experimental results of the application of a 2 nd order polynomial kernel ( ax T x j + b 2 ) for SVM classification, where a and b were set to one, on Iris, Ionosphere and Heberman databases. Therefore, these three databases were used to generate experimental results for the application of the proposed method using its 2nd-degree polynomial kernel optimization algorithm). From Table 5, it can be seen that the proposed method outperforms both Zhang et al.'s and Liu and Xu's techniques in terms of accuracy. The average values of a and b, which were used to generate experimental results for the proposed technique are presented in Table 5.

VI. CONCLUSION
In this paper, a kernel parameter optimization algorithm for support vector machine based on sliding mode control algorithm in a closed-loop manner was presented. The proposed algorithm introduced an error equation and a sliding surface and then iteratively updates the kernel parameter until it reaches maximum number of iterations or the training error stayed unchanged for a predefined number of iterations. Two types of kernels, an RBF or a 2-degree polynomial were considered in this paper. Ten publicly available databases were used to assess and compare the performance of the proposed method with the existing methods. Experimental results show the merit of the proposed method in terms of accuracy, training and testing speed, total number of the support vectors and robustness of the algorithm.