A Novel Hybrid Classification Method Based on the Opposition-Based Seagull Optimization Algorithm

In practice, classification problems have appeared in many scientific fields, including finance, medicine and industry. It is critically important to develop an effective and accurate classification model. Although numerous useful classifiers have been proposed, they are unstable, sensitive to noise and slow in computation. To overcome these drawbacks, the combination of feature selection techniques with traditional machine learning models is of great help. In this paper, a novel feature selection method called the opposition-based seagull optimization algorithm (OSOA) is proposed and studied. The OSOA is constructed based on an SOA whose population is determined by the opposition-based learning (OBL) algorithm. To evaluate its overall classification performance, some measures, including classification accuracy, number of selected features, receiver operating characteristic curve (ROC), and computation time, are adopted. The empirical results indicate that the suggested method exhibits higher or similar accuracy and computational efficiency in comparison with genetic algorithm (GA)-, simulated annealing (SA)-, and Fisher score (FS)-based classification models. The experimental results show that the OSOA is a computationally efficient feature selection technique that has the ability to select relevant variables. Furthermore, it performs well with high-dimensional data whose number of variables exceeds the number of samples. Thus, the OSOA is an effective approach for the enhancement of classification performance.


I. INTRODUCTION
With the development of computer and information techniques, large amounts of data are being generated from numerous sources, including economic activities, public administration and other scientific research fields [1]. To make sense of the data, machine learning techniques for the extraction of important patterns and trends from data and the prediction of data properties are employed. Machine learning techniques have been applied to a wide range of fields, including agriculture [2], finance [3], [4], and medicine [5]. Basically, the related techniques can be The associate editor coordinating the review of this manuscript and approving it for publication was Zhanyu Ma . categorized as supervised or unsupervised learning methods. In supervised learning, the goal is to predict the value of predefined target variables based on independent variables, whereas in unsupervised learning, there are no predefined target variables, and the goal is to describe the relationship and patterns among a set of independent variables [6], [7]. Classification is one of the typical and fundamental tasks of supervised learning. Feature selection is effective for handling high-dimensional data to enhance the overall performance of classification, which has been proven in both theory and practice [8]- [10]. The main goal of classification is to assign the instances in the test datasets to a predefined category based on the information classifiers acquired from the training datasets.

A. LITERATURE REVIEW
Many classification algorithms have been developed thus far. For instance, Logistic regression (LR) is a simple and effective classifier. It has wide applications in fields that require interpreting the relationship between independent variables (features) and dependent variables (classes) or the roles independent variables play in models, such as business [4] and industry [11]. Support vector machines (SVMs) handle classification tasks by constructing a hyperplane in sample space or feature space mapped by a kernel function. The application of a kernel function makes SVM a powerful method [12]. In some specific problems, e.g., prediction of chemical activity [13] and credit risk evaluation [14], researchers have designed new kernel functions to improve the performance of SVM. Least squares support vector machine (LSSVM) is a least squares version of SVM whose constraints are a set of linear equations, while the classical SVMs use a quadratic programming problem. Thus, LSSVM is more computationally efficient and can thus be applied to large-scale problems [15]. An artificial neural network (ANN) is a system with numerous connected neurons that simulates biological neural networks. The topology of ANNs can be categorized into 3 parts: the input layer, hidden layer and output layer. The training procedure of ANNs is to adjust the connection weights between neurons. ANNs are valuable and attractive classification techniques because they are nonlinear, data-driven self-adaptive classifiers and universal functional approximators, which can handle noisy data and do not need many priori assumptions [16]. These distinguishing features have allowed ANNs to enjoy fruitful applications in many fields [17]. A multi-layer perceptron neural network (MLPNN) is constructed of original simple perceptions and trained by a back-propagation algorithm [18], which has received wide applications in time-series problems [19], [20]. The back propagation neural network (BPNN), a typical and classic ANN, can find highly complex and nonlinear solutions to classification problems, which makes BPNN a very popular algorithm in complex nonlinear systems. However, it has problems regarding local optima and poor convergence, especially when it has a large set of neurons [21]. A radial basis function neural network (RBFNN) is a type of feedforward network based on computational intelligence with a simple structure and high efficiency. Moreover, RBFNN has the ability to perform nonlinear mapping and global optimal approximation [22]. Though classification approaches have achieved great success in various fields, they encounter a serious problem in high-dimensional data, which is known as the ''curse of dimensionality'' [23]. In high-dimensional data, a large number of features increase the size of the feature space, and many of them are irrelevant or redundant, which makes it difficult to recognize patterns for forecasting or classification. In addition, computational complexity is another challenge in processing high-dimensional data. Consequently, it is necessary to reduce the dimensionality of data. Feature selection identifies relevant features from an original feature set by removing these irrelevant and redundant features. It contributes to the reduction of training time, interpretability of the classification results and improvement of the classification performance, especially for high-dimensional cases. From the searching strategies perspective, feature selection methods can be classified into filter or wrapper approaches [24], [25]. Wrapper-type methods select a subset of features by a search algorithm binding with a given classifier. Many intelligent optimization algorithms have been adopted to build wrapper feature selection methods. A genetic algorithm (GA) is a stochastic, global optimization algorithm that can be used to perform feature selection naturally. It converges very slowly due to the unguided mutation operator [26]. Simulated annealing (SA) is another optimization algorithm used in feature selection tasks. However, SA cannot handle problems with large solution spaces well. Particle swarm optimization (PSO), a swarm intelligence optimization method, has the ability to retain and share good solutions with all particles. Moreover, PSO is easy to implement and computationally effective due to its algorithmic simplicity. However, PSO is not stable in high-dimensional search spaces and suffers from early convergence [27].
When features are evaluated by some criteria without classifiers, these approaches are called filter-type methods [28]. Fisher score, a filter method, computes a score for each attribute. The most discriminative features are those with higher scores. Then, a proper number of features can be picked according to their scores. The minimum redundancy maximum relevance (mRMR) filter method is based on mutual information and mainly contains two stages. First, the best individual features correlated to target variables are selected by the maximal relevance method. Then, the redundant features among the features obtained in the first step are removed by the minimal redundancy method [29]. A risk that mRMR suffers is that some uninformative features called irrelevant redundant features may be retained. In addition, mRMR is not suitable for high-dimensional data [30]. Reli-efF, derived from the original Relief algorithm, evaluates the usefulness of features according to the feature's weight by searching the nearest neighbor from the same and different classes of randomly selected instances. ReliefF is capable of handling incomplete and noisy data but is still unable to delete redundant features [31].
One can also find a subset of features by minimizing the goodness-of-fit measurement score, such as AIC [32], BIC [33], and Mallow's Cp [34], of the model. However, these approaches are infeasible for a large number of features. To overcome this shortcoming, some regularization methods have been applied as feasible approaches to high-dimensional problems.
Ridge regression [35] with l 2 penalty and LASSO [36] with l 1 penalty are two typical regularization methods. Ridge regression is an effective technique for multicollinearity problems [37], and it can yield a coefficient contraction but never reaches zero. Namely, ridge regression is unable to complete feature selection tasks. In contrast, LASSO has the desirable VOLUME 8, 2020 quality of shrinking some coefficients of uninformative features to zero.
That is, LASSO achieves the goal of feature selection by compressing some coefficients to zero using the l 1 penalty term. It is well accepted that LASSO has the ability to select the most relevant features from a broad set of candidate variables and enhance the predictive performance. In addition, LASSO behaves consistent statistically as the number of samples increases, and strict assumptions are not required [38]. Importantly, LASSO can be employed for the problem of multicollinearity, which is a very common phenomenon in high-dimensional problems, and it is an effective feature selection technique for high-dimensional data [39], [40]. It has been proven that hybrid models have the ability to overcome the drawbacks of using a single classifier [6].

B. CONTRIBUTION
In this research, we propose a novel hybrid classification method based on an OSOA. We borrow the strengths from both the SOA and OBL, which is embedded to determine the population of the SOA. Computationally, we have derived an efficient algorithm to obtain a global minimizer of the method and better classification performance. We have shown the advantages of the proposed method in different datasets, including high-dimensional datasets, via comparison with some state-of-the-art feature selection methods such as Fisher score, simulated annealing and genetic algorithm. In addition, the well-known LASSO method is also compared. To evaluate the proposed method, accuracy, ROC, AUC and computational efficient are adopted, and comprehensive comparisons are made between the proposed method and other popular methods. The rest of this paper is organized as follows. Section I gives the introduction, and the theoretical background is presented in Section II. Section III exhibits the proposed OSOA method, and Section IV shows the experimental results. Finally, a conclusion is drawn in Section V.

II. THEORETICAL BACKGROUND A. SEAGULL OPTIMIZATION ALGORITHM
The seagull optimization algorithm (SOA) [41] is a recently proposed metaheuristic optimization technique inspired by the natural behaviors of seagulls. Seagulls, scientific named Laridae, are intelligent birds. They can attract fish and earthworms by using breadcrumbs or making a rain-like sound with their feet. Generally, seagulls live in colonies. To find abundant food, they often migrate from one place to another. After arriving at a new place, seagulls attack their prey. he most important thing about seagulls is their migrating and attacking behaviors. Thus, the SOA focuses on these two natural behaviors, and the mathematical models are presented below.
First, seagulls perform migration behavior. During migration, the members of a seagull swarm should avoid colliding with each other. To achieve this purpose, an additional variable A is employed.
where P (t) represents the current position of seagulls in the tth iteration and A depicts the movement behavior of seagulls.
where a is a constant and responsible for controlling the frequency of employing variable A, which linearly decreases from a to 0. To find the richest food resources, seagulls move toward the best search agent.
where M represents seagull position toward the best search agent (seagull). The coefficient B is a random value responsible for making a trade-off between exploitation and exploration, and is defined as: where rd is a random number that lies in the interval [0, 1]. As seagulls move toward the fittest search agent, they might remain close to each other. Thus, seagulls can update their position according to the following rule: where D represents the distance between seagulls and the best search agent. Second, seagulls attack prey in a spiral shape after arriving at a new place. Their attacking behavior can be formulated as: where P (t) retains the best solution and x, y, z depict the traits of spiral motion.
where u and v are constants, e is the base of the natural logarithm, and k is a random number between 0 and 2π .

1) OPPOSITION-BASED LEARNING
Opposition-based learning (OBL) [42] was first proposed in 2005. Since then, OBL has been widely applied to improve the performance of metaheuristic algorithms, reinforcement learning and other machine intelligence techniques. In this work, we focus on employing OBL to help a metaheuristic optimization algorithm search for the global optimum. In general, a metaheuristic starts with a randomly generated population and iteratively updates the current solutions. By applying OBL, the opposite solution of the current solution is produced. Then, OBL compares the fitness of the current solution with the corresponding opposite solution and keeps the better one. Therefore, OBL has the potential to accelerate the convergence of the metaheuristic algorithm and obtain optima more easily. Here, we introduce some key concepts related to our work.
Assuming that x is a real number that lies in the interval [u, l], the opposite number of x is defined as: where u and l are the upper and lower bounds of the problem, respectively. For higher-dimensional problems, let The opposite vectorx can be defined asx The details of how to integrate OBL into the metaheuristic (SOA, in this work) will be discussed in the next section.

2) FISHER SCORE
Fisher score (FS) is a type of filter method, based on the Fisher criterion, which has the ability to select the most relevant features. FS indicates that these features with higher Fisher scores should be selected. Given a dataset {(x i ,y i )} n i=1 , x i ∈ R d denotes that there are d features in the dataset, and y i ∈ R k denotes the dataset has k classes. Then, the Fisher score of the i-th feature, f i , is calculated by the following expression: where n j indicates the number of class j of the sample, the mean value of f i is denoted by µ i , and µ ij and σ (i, j) 2 denote the mean value and variance of f i corresponding to the j-th class, respectively. In a nutshell, the importance of every feature is measured by FS, and then top features with high scores are selected after ranking.

3) LASSO
Least absolute shrinkage and selection operator (LASSO) was first introduced by Tibshirani [36], which is a constrained version of ordinary least squares [43]. Given a dataset is a regression coefficients vector and t ≥ 0 is the constraint term. · 1 is l 1 -norm and · 2 is l 2 -norm. Write the above optimization problem in Lagrangian form Here, λ is a tuning parameter that controls the strength of shrinkage. By applying the l 1 -norm, coefficients can be shrunk to exactly zero if λ is large enough, and more coefficients will be shrunk to zero as λ increases. Thus, LASSO can be seen as a continuous and stable feature selection method. Moreover, it produces a sparse solution and makes the model easier to interpret by adjusting the parameter λ.

4) SIMULATED ANNEALING
Simulated annealing (SA) is a global optimization technique that simulates the annealing phenomenon of metallurgy. Usually, SA starts with a randomly generated solution at a fairly high temperature. To find the global optimal solution, the initial temperature should be as large as possible. Next, the initial solution is updated in a certain way as the temperature decreases until the termination condition is reached. The most used method of temperature decrease is T k+1 = λT k , where T k is the current temperature, T k+1 is the updated temperature, and λ is a constant less than 1 but close to 1. Theoretically, the temperature should decrease to 0 or SA will not converge, which is considerably difficult to realize in practice. Some alternative methods, e.g., setting a minimal temperature value or setting a maximal number of iterations VOLUME 8, 2020 directly, are often adopted. The flowchart of SA (minimizing problem case) is presented in Fig.1. At a certain temperature, a new solution, say S new , is generated by a neighbor function, whose specific form depends on the problem domain. Then, the new solution is compared with the current solution, say S old , by an evaluation function. If the new solution is superior to the current solution, then the transposition is: S old ← S new . If not, SA accepts the inferior one with a certain probability p (·) generated by an acceptance function. This is the key step that makes SA jump out of local optima. The probability is associated with the temperature. It is higher at the beginning and tends to lower as the temperature decreases. The Metropolis algorithm is frequently adopted as the acceptance function. These procedures are repeated until the optimal solution is found or some stop criteria are met.

5) GENETIC ALGORITHM
A genetic algorithm (GA), a type of evolutionary algorithm, is inspired by the process of natural selection. The flowchart of a standard GA is presented in Fig.2. It produces a set of solutions, which are completely independent from each other, at the same time. Every solution is encoded as bits, numbers et al. in a sequence. The sequence is referred to as a chromosome or individual. Populations consist of chromosomes (individuals), while genes are elements of an encoded solution, which compose chromosomes. In selection, the chro-mosomes with higher fitness value are more likely to be selected and used for recombination. The Roulette Wheel and Tournament are two commonly used selection operators. The crossover, the pivotal process in GA, refers to two chromosomes exchanging some of their genes with each other according to the crossover probability. The result of crossover is that two new chromosomes are generated. The mutation indicates that the genes in the chromosome are altered with a certain probability. By applying the three genetic operators, the convergence of GA is guaranteed [44]. Moreover, GA has the ability to process large search spaces [45], [46].

1) SUPPORT VECTOR MACHINE
Support Vector Machine (SVM) was initially introduced for linearly separable classification problems [47]. However, there are numerous datasets in our real life that are nonlinearly separable [48]. To deal with these cases, kernel tricks are adopted. Considering a dataset {( where w, b is the weigh vector, b is the bias. ξ i is slack variables, k(·) is the kernel function that can map the input space into feature space (higher dimension space), and C is a real constant determined by users that balances the margin maximization and training error. According to dual theory, the Lagrangian form of SVM is where α i is the lagrangian multipliers. There are lots of algorithms that can be applied to solve the above optimization problem [49]- [51]. After solving this optimization problem, the decision function is given by Commonly used kernel functions are , where σ are constant.

2) LEAST-SQUARE SUPPORT VECTOR MACHINE
Least-square support vector machine (LSSVM) is a leastsquare version of SVM, which is used for classification and regression analysis. LSSVM uses equality constraints rather than the inequality constraints used in SVM, and it converts the quadratic programming problems into linear equation problems by utilizing a sum square errors cost function instead of the nonnegative errors cost function in the SVM model. Consequently, LSSVM consumes less computational resources [52]- [54]. Given a dataset {(x i , y i )} n i=1 , x i ∈ R d indicates that there are d attributes, and y i ∈ {−1, +1} represents that the output is binary. Assuming that the dataset is linearly inseparable in its attribute space, the input space x i ∈ R d will be mapped into a higher-dimensional space (feature space) by a nonlinear mapping function φ (·), which is illustrated in Fig.3. Therefore, the optimal decision function can be constructed in the feature space where ω, the weight vector, and b, the bias, are two parameters to be estimated. φ (x) is the nonlinear mapping function. To solve the above regression equation, a constrained optimization problem is constructed according to the structural risk minimization principle [55]: where C is the penalty factor that controls the trade-off between the complexity and the approximation precision of LSSVM. ζ i is the error between the prediction value of sample i and its true output value. Due to the difficulty of solving the above optimization problem directly, the Lagrange multiplier method is applied here. The Lagrange multiplier theorem states that, at any local maxima (or minima) of the function evaluated under the equality constraints, if the constraint qualification applies, then the gradient of the function can be expressed as a linear combination of the gradients of the constraints (at that point), with the Lagrange multipliers acting as coefficients. Thus, its corresponding Lagrangian function is built as follows: where α i is the Lagrangian multipliers. It is worth noting that the Lagrangian multipliers in LSSVM are positive or negative, whereas they must be positive in SVM [56]. Allowing inequality constraints, the KKT (Karush-Kuhn-Tucker) approach to nonlinear programming generalizes the method of Lagrange multipliers, which allows only equality constraints. Similar to the Lagrange approach, the constrained maximization (minimization) problem is rewritten as a Lagrange function whose optimal point is a saddle point. According to KKT conditions, we can get Next, applying Mercer's theorem: where K x i , x j is the kernel function. Eliminating ω and ζ i , a linear equation set is obtained: where E is an n-dimensional unit vector, α = (α 1 , α 2 , . . . , α n ) is the parameter of LSSVM, and I n is an n×n identity matrix.
= + C −1 I n , where is an n × n kernel matrix, whose elements are defined as After solving the above linear equation set, parameter b and α are given by [57] Then, the final model of LSSVM is

III. PROPOSED METHOD
In this section, the proposed hybrid classification methods are explained. In our work, there are two main stages. First, an opposition-based seagull optimization algorithm (OSOA) is employed to conduct feature selection on the original dataset. Second, classification is performed on the reduced data obtained from the first stage. Details of the proposed hybrid methods are presented below. VOLUME 8, 2020 To evaluate the performance of the selected feature subset, a fitness function is needed. In our work, the fitness function is defined as: where E(S) is the error of a classifier on the feature subset S.
|D| is the dimension of the original dataset, |S| is the size of the feature subset S, and β is a constant used to balance the feature subset size and the classifier's accuracy. In this work, KNN is employed as the evaluator in the feature selection stage.

1) INITIAL POPULATION
In this step, OBL is applied to initialize the population of the SOA. Generally, the SOA starts with a randomly generated population. By applying OBL, the diversity of the SOA's population is enhanced. The diversified population will improve the convergence and search abilities of the SOA. To initialize the population, the OSOA begins with a predefined population size N . Then, the OSOA randomly generates an individual x and the corresponding opposite individualx. Next, both x andx are evaluated by the fitness function, and the better one is kept. This process is repeated until the predefined population size is satisfied. For clarity, the initialization procedure is presented in Algorithm1. After initializing the population, the OSOA is applied to update the seagulls' positions. As aforementioned, every seagull is coded as a binary vector, whereas the original SOA was proposed for processing continuous problems. Thus, a binary version of the SOA is needed. To achieve this goal, a transfer function is applied: To obtain binary values, every seagull is transferred by the above function according to the following formula: where P d (t) is the d-th dimension of P(t) obtained from Eq (6), P(t + 1) is the updated position, P d is the value if d-th dimension of P(t), C P d is the complement of P d , and rd is a random number between 0 and 1.
After updating positions, every seagull is evaluated by the fitness function. These updating steps are repeated until the maximum number of iterations is reached. Then, the OSOA returns the fittest seagull (best solution). The fittest seagull represents the final selected feature subset. The procedures of the feature selection stage are presented in Algorithm.

Algorithm 2
Opposition-Based Seagull Optimization Algorithm for Feature Selection 1: Input the training dataset and initialize the parameters of the SOA 2: Initialize the population P i = p 1 , . . . , p j,.., p d , i = 1, 2, . . . , n by applying Algorithm1 3: while t < max iteration do 4: for i to n do 5: evaluate fitness of P i using Eq(31) 6: set P bs as the fittest seagull 7: perform migration using Eq(5) 8: perform attacking using Eq(6) 9: perform change on each element of P i using Eq (33) 10: end for 11: evaluate fitness of each seagull 12: update the fittest seagull 13: t = t + 1 14: end while 15: return P bs

B. CLASSIFICATION STAGE
In this stage, SVM and LSSVM are applied to perform the classification task. In the feature selection stage, KNN is employed to evaluate the quality of the selected feature subset. The main reasons behind choosing different classifiers in the feature selection stage and classification stage are twofold: first, KNN is a computationally efficient model. Usually, wrapper feature selection models area argued that they are expensive at computation. By applying KNN, the OSOA can select the optimal feature subset faster. Second, KNN is a simple model such that the OSOA can avoid overfitting to some extent. The flowchart of the proposed hybrid classification method is given in Figure 4.

IV. EXPERIMENTS A. EXPERIMENTS DESCRIPTION
To validate the superior performance of the hybrid methods, some experiments were performed. In particular, the proposed OSOA feature selection method were compared with four other state-of-the-art feature selection methods, including GA, SA, FS, and Lasso. The selected features will be tested on two classification models, LSSVM and SVM. Then, the hybrid models are established by combining these feature selection models individually with classification models. Firstly, feature selection methods are implemented in these 7 datasets. Second, classification models are applied to each dataset with features selected in the first step. All of the experiments are implemented in R3.6.0. For feature selection methods, FS and Lasso can be implemented using the PredPsych package [58] and glmnet package [59]. GA and SA are obtained in the caret package [60]. For classification models, SVM and LSSVM are available in the kernlab package [61]. There are 7 datasets applied in the experiments; the first five datasets, Hill-Valley, Ionosphere, Heart, Twonorm, and Ringnorm, are taken from the UCI repository [62], and Colon and Prostate are taken from an R package datamicroarray [63], which are high-dimensional datasets. Table.1 shows the details of these datasets. The second column indicates the names of these datasets, and the third column indicates the number of features of each dataset. The training and test samples were divided randomly, which are presented in the fourth and fifth columns, respectively. All 7 datasets are binary classes. The next two columns indicate the labels and the number of instances associated with each label. The last column shows the reference.

B. PARAMETER SETTING
For GA, the population size is 20, and elite is 1 for each generation. The crossover and mutation probability are 0.8 and 0.1, respectively, which are default values. For SA, all of the parameters are defaults. Both GA and SA run 100 iterations. For FS, the threshold was set empirically. For Lasso, all of the parameters are set as defaults.
For classification models including SVM and LSSVM, we mainly tried different kernel functions and parameters and selected the optimum. There are no parameters in Spline and Linear kernel functions. For the other six kernels, all of the scale parameters (sigma in the ANOVA RBF kernel, Polynomial kernel, Bessel kernel, Radial Basis kernel and Laplacian kernel; scale in the Polynomial kernel and Hyperbolic tangent kernel ) belong to [0.0001, 1000], and we start from 0.0001 and times 10 for each experiment. For parameter degree (Polynomial kernel, Bessel kernel and ANOVA RBF kernel),it can only be a positive constant. We tuned the parameter from 1 to 20 because the models (SVM, LSSVM) are very sensitive to the parameter. For parameter offset (Polynomial kernel, Hyperbolic tangent kernel), we only tried 1 and 10 since the parameter has little effect on the results.

C. RESULTS
The number of features selected and processing time are presented in Table.1. The second column shows the original features. It easy to see that Colon and Prostate are highdimensional data, whose numbers of attributes are 2,000 and 12,600, respectively. The number of features selected by GA, SA, FS, and Lasso and their computation time are also presented in this table. It is observed that the proposed OSOA-LSSVM and OSOA-SVM delivered better performance than the alternatives. For instance, for the Twonorm dataset, OSOA-SVM achieved a perfect classification outcome with 100% accuracy, and OSOA-LSSVM achieved an accuracy of 99.32%. For the Hill-Valley dataset, both OSOA-SVM and OSOA-LSSVM achieved 98% accuracy,  which is far higher than that of other methods. Figure 4 and Figure 5 show the ROC plot of all of the hybrid classification methods applied in all of the datasets. It is easy to see that OSOA-LSSVM and OSOA-SVM have larger areas, which indicates the advantage of the OSOA in terms of feature selection. From the aspect of number of selected features, the OSOA is comparable to Lasso in the Heart and Ionosphere datasets. The OSOA selected fewer features than Lasso did in the Twonorm and Ringnorm datasets. In the colon and prostate datasets, the OSOA selected more features. However, it is believed that the selected features are important since the classification accuracy is boosted to a large extent. Comparing with Lasso-SVM, which achieved 75% accuracy using 16 features, OSOA-SVM obtained 93.75% accuracy using 1,500 features. Furthermore, we find the processing times of the OSOA, FS, SA and Lasso to be comparable, while SA and GA take more computational time. Notably, the computational cost of GA is extremely expensive (14.68 hours) in the Ringnorm dataset. Consequently, OSOA-based hybrid classification methods are the best models in terms of both classification accuracy and computational efficiency. VOLUME 8, 2020

V. CONCLUSION
In this study, a novel hybrid classification approach is suggested by combining feature selection and machine learning methods. Specifically, the proposed approach is based on an OSOA, which performs feature selection. The OSOA is an effective and computationally efficient feature selection technique. Moreover, the OSOA has the ability to process high-dimensional data as well. In the proposed model, there are two phases: (1) feature selection is done by the OSOA, and (2) the data with selected features are classified. The developed method was tested with seven datasets. Among these datasets, Colon and Prostate are high-dimensional data. Comparisons were made between the proposed method and other popular methods. The experimental results indicate that the overall performance of the proposed method is superior to that of other well-known feature selection approaches. HE JIANG received the bachelor's degree in applied mathematics and the first master's degree in probability and mathematical statistics from Lanzhou University, in 2009 and 2012, respectively, the second master's degree in statistics from Florida State University, in 2014, and the Ph.D. degree from Florida State University, USA. He is currently an Associate Professor of statistics with the Jiangxi University of Finance and Economics. His main research interests are focused on machine learning, variable selection, and data mining techniques related to big data.