Novel Combined Variable Selection Approach Using Memetic Algorithm With Complex Harmonic Regularization

Variable selection has been highly successful in big data analyses, and regularization approaches are commonly used methods, which can automatically select important variables while constructing machine learning models. Due to the real datasets have complex relationships between relevant variable groups, many group-sparsity regularization approaches are proposed recently. However, these approaches usually are non-convex and sensitive to hyper-parameters. Therefore, optimization of the regularizations is a challenging task. In this paper, we present a novel memetic algorithm with the complex harmonic regularization (MA-CHR), which combines EC algorithm for hyper-parameters (global search) and complex harmonic regularization with path seeking strategy for the group-sparsity variable selection (local search). We further introduced a novel genetic individual representation (intron+exon) to efficient obtain the global optimal solution of this group-sparsity regularization. Simulation and five real data experiments demonstrate that the proposed MA-CHR method performs better than the state-of-the-art regularization methods in selecting groups of relevant variables and classification.


I. INTRODUCTION
Variable selection is one of the important issues in highdimensional and massive data analysis, which has attracted more attention in machine learning. Regularizations are commonly used methods for variable selection, which can automatically select important variables while constructing machine learning model, and their applications have become increasingly popular. Lasso [1] is one of the most popular regularization methods. In order to obtain a more sparse solution, a growing number of sparse regularization methods are proposed, such as smoothly clipped absolute deviation penalty (SCAD) [2], adaptive Lasso [3], minimax concave plus method (MCP) [4], Lp penalty [5], smoothed Lp regularization [6], L 1/2 penalty [7], [8] and harmonic regularization [9].
However, beyond the sparsity of data, the real datasets generally have complex relationships between relevant The associate editor coordinating the review of this manuscript and approving it for publication was Hao Ji. variable groups, which are widely referred to as grouping effect [10]- [18]. That is, the datasets typically have high correlations or grouping effects between variables, especially in biological data, gene regulation networks and protein-protein interactions. Using prior knowledge, regularization methods can rapidly generalize to new tasks containing some correlation information with supervised learning. For instance, group Lasso [10] can be used for multivariate analysis based on the variance model, where each of these variables can be represented by a set of dummy variables and has multiple levels. In addition, some regularization methods can generate group-sparsity learning models without prior knowledge. For example, Elastic Net (EN) regularization [11] linearly combines the L 1 penalty with the L 2 penalty. Similarly, Zou and Zhang [14] studied an adaptive elastic network using the L 2 penalty in adaptive weighted Lasso. By incorporating the L 2 penalty, the SCAD-L 2 penalty [15] inherits good performance of SCAD and can be extended to select group variables. Huang et al. [16] studied a new regularization method, which linearly combines the L 1/2 penalty and the L 2 penalty with logistic regression model for gene selection, termed as HLR. However, one limitation of the aforementioned methods (L 2 and Lp(0 < p < 1) with a fixed value of p) is that they often require the data to satisfy a certain probability distribution [17]. Very recently, Liu et al. [18] proposed a complex harmonic regularization (CHR), which can produce solutions that closely approximate to the combination of Lp (1/2 ≤ p ≤ 1) and Lq (1 < q < 2) regularizations with adjustable p and q to evaluate the sparsity and grouping effect of data.
However, most existing group-sparsity regularization methods inevitably involve hyper-parameters, like the control parameters to balance loss and penalty functions in machine learning models, and the mixing parameters to balance grouping effect and sparsity in datasets. The grid search method and k-fold cross-validation (CV) are typically used to adjust these hyper-parameters. Generally, the performance of the regularization methods is extremely susceptible to the hyperparameters. This tends to raise their application difficulty and reduce their performance stability in practical problems.
In the combination methods, many evolutionary computation methods for variable selection, which combine filter and wrapper methods, provide an opportunity for population-based optimization with local search. For instance, Zhu et al. [40] proposed a filtering variable ordering method in MA, which is used to balance local and global search to improve the optimization results and efficiency. However, these two-stage memetic-based methods have the potential limitation that the filter phase may eliminate potentially useful variables regardless of their performance during the wrapper phase. On the other hand, the wrapper methods usually involve a series of evaluations, and each evaluation usually consumes a lot of time, especially when the dimension of variables is large.
To alleviate the limitations of the existing hybrid evolutionary computation methods for variable selection, and help to globally optimize hyper-parameters of the group-sparsity regularization methods, we present a novel memetic algorithm with the complex harmonic regularization (MA-CHR) for variable selection in this paper. The proposed method combines the evolutionary optimization of the hyper-parameters (global search) with the complex harmonic regularization of the group-sparsity variable selection (local search). The memetic optimization procedure is represented by the genetic chromosomes, including introns (hyper-parameters) and exons (variable coefficients in learning models). Simulation and real data experiments substantiate the capability of the proposed method for prediction and variable selection.
The remaining sections are organized as follows. Section II reviews the related work of regularization methods and evolutionary computation algorithms for variable selection. Section III introduces the proposed method as well as the detailed algorithm. Section IV presents the results of simulation and real data experiments. Finally, the conclusion is given in Section V.

A. REGULARIZATION METHODS
Regularizations are important embedded methods of variable selection. Suppose that dataset D = {(X 1 , y 1 ), (X 2 , y 2 ), . . . , (X n , y n )}, where n represents the number of samples, X i = (x i1 , x i2 , . . ., x im ) represents the i th sample with t dimensions, and y i represents the variable of corresponding dependent. The linear regression with the regularization term can be expressed as: where P(β) represents the penalty function, and λ represents the hyper-parameter for controlling the complexity of the model. As the most commonly used regularization method, Lasso is given as P(β) = ||β|| 1 , which can construct the model while selecting the variables. In order to obtain more sparse solutions, a growing number of sparse regularization methods based on L 1 norm are proposed, such as SCAD [2], adaptive Lasso [3], MCP [4]. The L 0 regularization yields the most sparse solutions, which is given as P(β) = ||β|| 0 , but it faces the NP-hard combinatorial optimization problem. In order to improve the prediction accuracy of the machine learning model, the researchers studied Lp(0 < p < 1) regularization with a fixed p, especially p = 1/10, 1/2, 2/3, 9/10 [21]. The work in [7], Xu et al. proposed the L 1/2 regularization, which can be used as a representative of the Lp(0 < p < 1) penalty. They analyzed its threshold representation [8] and demonstrated that solving the L 1/2 regularization is easier than solving the L 1 regularization. Furthermore, they also proved that the L 1/2 regularization possesses unbiasedness and Oracle properties [8], which make the L 1/2 regularization an effective tool for solving high-dimensional variable selection problems [47].
The aforementioned sparse regularization methods, however, do not consider the grouping effects. If faced with a set of variables with high correlation, the sparse regularization method tends to select only one variable to represent the corresponding group. In order to solve this problem, several group-sparsity regularization methods are proposed, including Elastic net [11], SCAD-L 2 [15], L 1/2 + L 2 [16]. However, these methods (L 2 and Lp (0 < p < 1) with a fixed value of p) often require the data to satisfy a certain probability distribution [17]. Very recently, Liu et al. [18] proposed a complex harmonic regularization (CHR), which can produce solutions that closely approximate to the combination of Lp (1/2 ≤ p ≤ 1) and Lq (1 < q < 2) regularizations with adjustable p and q to evaluate the sparsity and grouping effect of data. The complex harmonic regularization (CHR) can be expressed as: a and b represents the hyper-parameters of the CHR penalty. The CHR penalty function can be plotted as Figure 1. When the hyper-parameter a is close to 0, m(β) ≈ |β| ( Figure 1-(c)), the first part m(β) of the CHR penalty approximates to the L 1 regularization. When the parameter a is close to 1, m(β) ≈ |β| 1/2 ( Figure 1-(b)), the first part m(β) of the CHR penalty approximates to the L 1/2 regularization. When the parameter b is close to 0, n(β) ≈ |β| 2 ( Figure 1-(e)), the second part n(β) of the CHR penalty approximates to the L 2 regularization. When the parameter b is close to 1, n(β) ≈ |β| ( Figure 1-(f)), the second part n(β) of the CHR penalty approximates to the L 1 regularization. Moreover, the CHR penalty has the property that its first derivative is finite at origin, which implies that the corresponding regularization problem can be efficiently solved via the direct seeking techniques.
In theory, the sufficient condition for the grouping effect of variables is the strict convex penalty function, and the Lq (1 < q < 2) penalty ensures that the function has strict convexity [11]. Therefore, the second part of the CHR penalty, n(β), approximates the Lq (1 < q < 2) penalty while inducing a grouping effect. The experimental results of [18] show that the CHR method of artificial and real gene expression data is very promising.

B. EVOLUTIONARY COMPUTATIONS FOR VARIABLE SELECTION
Evolutionary computation (EC) randomly generates an initial population of candidate solutions in the search space, and then iteratively updates through intersection, mutation, and selection operators to obtain the optimal solution. In recent years, more and more researchers have applied the local search (LS) technology to the random search process of EC to improve optimization efficiency [22]. These hybrid algorithms are often referred to as the Memetic Algorithm (MA). The MA algorithm combines wrapper and filter variable evaluation metrics for variable selection, providing a local search opportunity for population-based optimization. For example, Zhu et al. [40] proposed a filtering variable ordering method in MA, which is used to balance local and global search to improve optimization results and efficiency. Then, Zhu et al. [36] integrated the Markov coverage method into MA to simultaneously identify all relevant or partially related variables. The work in [44] proposed another twostage variable selection algorithm in which each variable is sorted using the Relief-F algorithm, and then the top-ranked variable is used as the input to the memetic package variable selection algorithm. Other researchers have introduced a heuristic blending mechanism that combines filter ranking scores to guide the GA and PSO search process for packaging variable selection [26]- [29]. In addition, MA for variable selection has been used to solve practical application problems such as optimal controller design [48], motif discovery in DNA, mircoRNA and protein sequences [49], [50].
As is shown above, in most hybrid evolutionary computation methods for variable selection, the EC stage is used for wrapper feature selection, and the filtering-based LS algorithm helps to achieve a local optimal solution. However, these ''wrapper + filter'' two-stage memetic methods cannot guarantee that the selected variables in the filtering phase are the best candidates for the EC phase. Therefore, the filter phase of the MA algorithm has the potential to eliminate potentially useful variables, regardless of their performance during the wrapper process.

III. MEMETIC ALGORITHM WITH COMPLEX HARMONIC REGULARIZATION
To alleviate the limitations of the existing hybrid evolutionary computation methods for variable selection, and help to globally optimize hyper-parameters of the group-sparsity regularization methods, we present a novel memetic algorithm with complex harmonic regularization (MA-CHR) for variable selection. The proposed method combines the evolutionary optimization of the hyper-parameters (global search) with the complex harmonic regularization of the group-sparsity variable selection (local search). The memetic optimization procedure is represented by the genetic chromosomes, including introns (hyper-parameters) and exons (variable coefficients in learning models). In the first step of the MA-CHR method, the evolutionary population is randomly initialized with each chromosome encoded by an intron and an exon part. Subsequently, the CHR regularization (local search) is performed on the exon parts under the fixed intron parts to achieve a local optimal solution or to improve the adaptability of individual in the search population. Genetic operators such as mutations and crossovers are performed on the intron parts of the chromosome. After that, the selection operator is used to produce the next population. This process is repeated until the stop condition is satisfied. Each component is explained below.

A. CHROMOSOME REPRESENTATION: INTRON AND EXON
In the proposed MA-CHR method, a representation for the four hyper-parameters λ 1 , λ 2 , a, b and the coefficients β 1, β 2, . . . , β t of the candidate variable subset can be encoded as a chromosome: intron+exon=(λ 1 , λ 2 , a, b, β 1, β 2, . . ., β t ). The length of the chromosome is denoted as t + 4, where t represents the total number of variables. The chromosome is represented as a string that is globally optimized by its EC operator for its intron parts. Although the search space of the intron portion has non-convex and multipeak properties, the EC can find a global optimal solution, because the dimension of the intron is very low. On the other hand, the exon part is optimized by the regularization method for synchronizing variable selection and building a learning model. In the exon part, the value of β i is nonzero means that the model has selected the corresponding variable. Conversely, if the β i coefficient of the candidate variable is equal to zero, it means that the candidate variable has been rejected. T represents the maximum non-zero β i exon number allowed in each chromosome. When the optimal number of variables has relevant prior knowledge, we can limit T to no more than the predetermined value; otherwise T = t.

B. OBJECTIVE FUNCTION
The objective function is defined by:

Fitness(chromosome)
= Accuracy of the classification model with (λ 1 , λ 2 , a, b,β 1, β 2, . . . , β t ) where nonzero β i denotes the corresponding selected variable subset encoded in the exon part of the chromosome. The objective function evaluates the significance of the given variable subset. In this paper, the fitness of the objective function is specified as the classification accuracy of the logistic regularization model with the chromosome (λ 1 , λ 2 , a, b,β 1, β 2, . . ., β t ), using the complex harmonic regularization method. Note that when two chromosomes are found to have similar fitness, i.e. the difference between their fitness is smaller than the small value of e, the chromosome with a smaller number of selected variables is more likely to be selected for the next generation.

C. LOCAL SEARCH IMPROVEMENT PROCEDURE (LS)-PATH SEEKING APPROACH FOR THE SPARSE LOGISTIC REGRESSION WITH THE CHR PENALTY
In this section, we consider the use of complex harmonic regularization and path search algorithms as memes or local search methods for our proposed MA-CHR method. Usually, the computation time of the path search algorithm [51] increases linearly with the dimension of the variable selection problem. Therefore, this method is an effective way to solve the problem of regularization. When applying Lp(0 < p < 1) regularization, the path search algorithm cannot be used directly. Therefore, we also need to solve the optimization problem of Lp(0 < p < 1) penalty. Fortunately, we can use the direct path finding method to overcome the CHR penalty. Popular path search methods based on squared errors include partial least squares regression (PLS) [52], forward stepwise regression [53], minimum angle regression [12], piecewise linear path [54] and gradient enhancement [55]. Friedman [56] proposed a generalized path search that produces solutions that are very close to any convex loss function and non-convex constraint. The advantage of the pathfinding method is that we provide a new method for solving non-convex penalty regularization. Let ν measure the length along the path, ν >0 is a small increment.
Here φ j (v), ϕ j (v) are the j th gradient components of the logistic loss function and penalty function respectively, and . Then, we give the path seeking algorithm for the CHR regularization as follows.

Algorithm 1
The Path Seeking Algorithm for CHR Penalty After finishing initialization of the path and computation of vector τ (v), we need to identify that those signed coefficientŝ β j * (v) = 0 are opposite to that of the corresponding τ j (v). If the set S becomes empty, the selected coefficients of the absolute value need to correspond to the maximum element τ (v) at Step 6. Moreover, if there are one or more than one elements in the set S, the coefficient with corresponding maximum |τ j (v)| within this subset will be selected. Then, the selected coefficientβ j * (v) will be increased a bit in the direction of its signed correspond τ j * (v), and keeping all other coefficients not change as well as producing the result for the next path point ν + ν. Next, the iterations keep running until all components of τ (v) become 0 and the algorithm gets the regular result of the CHR penalty. Therefore, MA-CHR has the capability to construct the learning model and select the relevant variables which can group effect efficiently and synchronously.

D. EVOLUTIONARY OPERATORS
In the MA-CHR, there is an evolution process has standard genetic operators which includes fitness proportionate selection, one-point crossover and uniform mutation operators. In addition, if the prior knowledge about optimal number of variables is available, the nonzero-number of β in every exon of the chromosome will maybe constrained to the largest value of T during evolution process.
1) Selection operator: The standard roulette-wheel selection applies in generating the next generation which includes the parent-generation and child population. The probability of the chromosome selection is proportional to its adaptive faculty directly. In the genetic selection phase, the candidate chromosomes with lower MSEs maybe unlikely to be erased but there is still has the chance to implement it. 2) Crossover operator: First, we select two parents (p e ; p f ) from current population for producing children randomly. Then, execute the crossover operation to produce offsprings that inherit characteristics from both parents. Next, in order to generate a single crossover point on the intron of both p e and p f chromosomes among the hyper-parameter λ 1 , λ 2 , a and b, these four hyper-parameters on both sides of the point are exchanged in the intron of the parent's chromosomes to produce c e and c f in the intron part of the offspring's chromosomes. Finally, the exon part β of chromosomes of the two offspring is evaluated by the local optimization strategies. 3) Mutation operator: The mutation operator allows for greater diversity of populations diversity and search space. In this stage, we select one of the hyperparameters λ 1 , λ 2 , a and b with a mutation probability p m (p m = 0.1 in our experiments) to mutate its value randomly. Then, local optimization strategy was used to evaluate the heterozygosity and exon β of the new chromosome produced by mutation operation.

IV. RESULTS AND DISCUSSION
To evaluate the performance of the proposed method, we implement experiments on simulated datasets with three scenarios and five publicly available microarray gene expression datasets. To compare with the proposed method, we apply GA, GP, MA , Elastic Net (EN), SCAD-L 2 , the hybrid L 1/2 + L 2 regularization (HLR) and the complex harmonic regularization (CHR).

A. ANALYSIS OF SIMULATED DATA
The simulated data is generated from the logistic regression model, and has the following form: where X∼ N (0,1), ε is the independent random error and σ is the parameter which controls the signal to noise. Three scenarios are simulated in the simulated experiments. In each scenario, the dimension of variables is 2000. The notation / represents the number of observations in the training and test sets, respectively, e.g. 100/100. Here are the details of the three scenarios. We did the simulation on a grouped variable situation = 502, 503, . . . , 600; = 1002, 1003, . . . , 1100; = 1502, 1503, . . . , 1600. where ρ is the correlation coefficient of the grouped variables. In this example, there are four groups of correlated variables. An ideal sparse regression method would select only the 400 true variables and set the coefficients of the 1600 irrelevant variables to zero.
(b) Scenario 2's definition is similar to Scenario 1. However, Scenario 2 except that if the case has other independent factors, which also contribute to the decision variable y: (c) In Scenario 3, the number of real variables were increased to 1000 totally, σ = 0.1, and the dataset is constituted by 500/100 observations, and In this case, the correlated variables consist of three groups, 400 correlated variables and 300 independent variables. An ideal sparse regression method would select only the 1000 real variables and the coefficients of the 1000 irrelevant variables would be set to 0.
In the simulated experiments, the correlation coefficient ρ of each variable was set to 0.1, 0.4, 0.7 singly. GA, MA, Elastic net (EN), SCAD-L 2 , HLR, CHR and MA-CHR are adopted the logistic classification approach. In GP, it adopts the multi-tree classifier. In each iteration of GA and MA, the number of selected variables based on the information gain filter is set to 1000. Table 1 lists the configuration parameters used by the EC algorithm in these eight methods. The model parameters of EN, SCAD-L 2 , and HLR are adjusted by 10-fold cross-validation. It is worth noting that EN and HLR are tuned by the 10-fold crossvalidation on the two-dimensional parameter surfaces. Moreover, the model parameters of SCAD-L 2 and CHR are tuned by the 10-fold cross-validation on the three-dimensional and the four-dimensional parametric surfaces, respectively. We repeated simulations of all the competing methods 100 times and calculated the average classification accuracy on the test set. In order to evaluate the variable selection performance of all competing methods, we adopt the sensitivity and specificity to evaluate the variable selection performance, which can be defined as follows [21]: where the . * is the element-wise product, and |. | 0 calculates the number of non-zero elements in a vector,β andβ are the logical ''non'' operators on the true coefficients vector β and the simulatedβ. Table 2 shows the variable selection and classification performance of all the competing methods in all the scenarios. It can be observed that the smaller correlation coefficient ρ, the better performance of the methods. In Table 2, the proposed method always selects the most correct relevant variables in different scenarios. MA-CHR obtains the highest sensitivities and specificities of variable selection, which means that MA-CHR can select the most relevant variables while discarding the most irrelevant variables. The classification accuracy of MA-CHR is also superior to other competitive methods.

B. ANALYSIS OF REAL DATA
In this section, we choose five publicly available gene expression microarray datasets to further estimate the performance of the proposed method. Five publicly available datasets are AML, DLBCL, Prostate, Lymphoma and Lung cancer, respectively. The AML dataset was mentioned by Bullinger et al. [57] in the first time and this dataset included 116 patients with 6283 genes. The DLBCL was first proposed by Rosenwald et al. and it has almost 240 samples' information [58] which every sample contains the genenumber of expression data is 8810. The Prostate dataset is a dataset includes 12,600-gene expression profiles for   [61]. The Lung cancer dataset is a public dataset and it can be downloaded at www.ncbi.nlm.nih.gov/geo with through access number (GSE40419). A summary of these datasets is shown in Table 3.
In order to effectively evaluate the performance of all the competing methods, the real data set is randomly divided into two parts: 2/3 samples are in the training set for model estimation, and the remaining 1/3 data are used to test the estimation performance. The model parameters are turned by 10-fold cross-validation (CV). Table 4 reports the average training and test accuracy calculated by all competing methods over 100 repetitions. It can be easily seen that the performance of the MA-CHR method is better than other seven competitive methods. Moreover, the number of selected genes by all competing methods is also shown in Table 4. Compared to the other competitive methods, the number of genes selected by the proposed method is the smallest. In group-sparsity regularization methods, such as Elastic net, SCAD-L 2 , HLR, and CHR, the performance of CHR is better than that of Elastic net, SCAD-L 2 and HLR in gene selection. On the other side, in EC algorithms, such as GA, GP and MA, the performance of GP is better than that of GA and MA in gene selection and classification. It is shown that the proposed method has better performances in gene selection and prediction classification compared with seven competitive methods. Table 5 lists the results of the biological analysis. The table shows the top 10 genes got by different methods in the AML data set. Compared to existing variable selection methods, the MA-CHR model is able to select unique genes such as SFRP1 and SFRP2, which are members of the Sfrp family and a signal transduction protein. The Sfrp family of proteins transmits TGF-[beta] signals from cell surface receptors to the nucleus. This gene is very important in the mutation or deletion of AML disease and has been shown to cause pancreatic cancer [61]. We consider it may be closely related to AML disease. Among other genes selected by the MA-CHR method, the gene FLT3 has been found to be up-regulated in several different types of AML diseases [62]. This gene can stimulate the motility of AML disease. It has been reported that the protein encoded by the NPM1 gene is very similar to the tumour suppressor of Drosophila, a highly related gene of AML disease [63]. In addition, MA-CHR can also find some related genes chose by other regularization models with Elastic Net, SCAD-L 2 , HLR and CHR methods, such as SFRP5 and GSTM1. They are significantly associated  with AML disease and have been discussed in [57]. To explore the interaction network between these genes selected by eight methods, we built up the integration of intergenic interactions using cBioPortal [64] by integrating biological interactions from the public acute myeloid leukemia (TCGA, NEJM 2013) database. Figure 2 shows the maximum interactive network view of the top 10 genes selected by eight different methods. For Elastic Net, in the largest network view, only two genes, PTPN6 and CDKN2A, are linked to other frequently altered genes. SFRP1, CDKN2B, DAPK1 and PTPN6, which are linked to other frequently altered genes from the acute myeloid leukemia (TCGA, NEJM2013) database, were selected by GP. For GA, MA and SCAD-L 2 , 6 genes were identified by these three methods, which are related to other frequently altered genes. For the proposed MA-CHR method, it identified 8 genes in the same interactive network. This result indicates that MA-CHR performs better than other methods in selecting related genes.

C. DISCUSSION
Similar results were obtained from the analysis of four other real gene expression data sets. Biological analysis shows that MA-CHR model can not only find the related genes selected by other variable selection methods but also find some unique genes. These genes are not selected by other models but are significantly related to the disease. Therefore, MA-CHR method can be used to identify related genes accurately and effectively. VOLUME 8, 2020

V. CONCLUSIONS
The primary purpose of this paper is trying to propose a new memetic algorithm MA-CHR. This method combines an evolutionary optimization method with a complex harmonic regularization method to deal with high-dimensional and small-sample gene expression data. This paper first uses the evolutionary computation method to globally optimize the hyper-parameters in the non-convex machine learning model, and robust selection of related variable sets can be achieved by using complex harmonic regularization methods. Next, build a classification model. The effective path search algorithm is used to optimize complex harmonic regularization in order to generate a global optimal of the regularized solution with convex loss function and non-convex penalty. The experimental results show that the MA-CHR model proposed in this paper can use more suitable sparse and group regularization methods to learn machine learning models more accurately. In addition, the reliability and stability of the method are relatively high. We believe that this method can be an effective tool for analysing gene expression data by means of adaptive sparse and group regularization.