Influential Gene Selection from High-Dimensional Genomic Data using a Bio-inspired Algorithm wrapped Broad Learning System

The classification of high dimensional gene expression/ microarray data always plays an important role in various disease diagnoses and drug discovery. To avoid the curse of high dimensionality, the selection of the most influential genes remains a challenging task for the researchers in the machine learning field. As the extraction of influential features by a bio-inspired algorithm is a non-deterministic polynomial-time (NP-Hard) task, the possibility of applying new algorithm is always there. In this suggested work, a recently developed bio-inspired algorithm, Monarch Butterfly Optimization (MBO), is wrapped with the Broad Learning System (BLS), called MBO-BLS, to choose the most influential features and classify the microarray data simultaneously. In the first stage, a pre-selection method (Relief) is used to select a feature subset. Then, this selected feature subset undergoes further execution with the MBO-BLS model. To estimate the efficacy of the presented model, six cancerous microarray datasets are taken. Here, sensitivity, specificity, precision, F-score, Kappa, and MCC measures are used for an impartial comparison. Further, to prove the supremacy of the presented method, the basic BLS, Genetic Algorithm wrapped BLS (GA-BLS), Particle Swarm Optimization wrapped BLS (PSO-BLS), and the existing ten models are taken for comparison. Moreover, to examine the designed model statistically, Analysis of variance (ANOVA) test is also performed here. From the above qualitative and quantitative analysis, it is concluded that the proposed MBO-BLS model outclasses other considering models.


I. INTRODUCTION
Gene expression data allows monitoring the expression values of thousands of genes in a single experiment. This advanced technology is very much beneficial for analysing disease mechanism which set to improve the quality of health science. The classification of high dimensional gene expression data always plays an important role in various disease diagnoses and drug discovery [1]. But the drawback of gene expression data is its high dimensionality which comprises a small number of samples and a large number of features. Thus, the selection of the most relevant features is essential to reduce the classificational complexity and computational cost of this high dimensional gene expression data [2]. Researchers have been proposed mainly two techniques to select relevant genes, one is feature extraction and another is feature selection. In the feature extraction technique, the original high dimensional feature space is transformed to a lower-dimensional subset of reduced features by using nonlinear and linear techniques [3]. But in the feature selection technique, a small subgroup of essential features is chosen from the original massive set of attributes. Feature extraction techniques may lose some information due to the transformation of original data. So, in this proposed work feature selection technique is emphasized to deduct the irrelevant, redundant features and to increase the learning performance. By considering the evaluation aspect [4], the feature selection is classified as a filter, wrapper, and hybrid method. In the filter approach, the significant feature subset is selected by evaluating each feature through some independent test before the application of the classification algorithm. But in the case of the wrapper approach, the vital feature subset is evaluated by using a classifier where classification accuracy is the key factor to evaluate the best feature space. The wrapper method is more effective than the filter method because in the filter method the evaluation process is not following any learning algorithm but the advantage of the filter approach is it is having less computational cost. In the wrapper approach, the metaheuristic algorithm is wrapped with any machine learning algorithm to search the optimal best feature subset. Some wrapper method is discussed here as genetic algorithm (GA) is hybridized with support vector machine (SVM) to select the best features of the gene expression data [5].
GA is embedded with an Extreme Learning Machine (ELM) for the classification of biomedical cancer data [6]. Particle swarm optimization (PSO) used K-Nearest Neighbor (KNN) as a learning algorithm for the selection of significant genes of cancer data [7]. Artificial Bee Colony (ABC) and Genetic Bee Colony (GBC) used SVM as a learning algorithm for gene selection of microarray cancer data [8,9]. GA is applied for gene selection and Naïve Bayes is considered as a classifier for Indian diabetes data [10]. Best First Search technique is wrapped with an Artificial Neural Network learning algorithm for feature selection and classification of colon cancer data [11]. Bat algorithm (BA)is wrapped with optimum path forest (OPF) learning algorithm for feature selection of biomedical data [12]. Cat swarm optimization (CSO) and kernel extreme learning machine (KELM) are used for biomedical data gene selection [13]. Mutual information combined with adaptive genetic algorithm for selection of informative genes [14]. Although the wrapper approach is more applicable for getting better classification accuracy than the filter approach as it has a high risk of data overfitting and it is more computationally complex than the filter approach. Considering the advantages of both approaches the hybrid feature selection method is formed by combining both filter and wrapper methods. In the hybrid method, the significant genes are selected initially applying the filter approach and then the wrapper method is applied in which the optimum subset of that significant genes is sorted out. For the past decade various traditional and metaheuristic machine learning techniques have been used to classify the complex high dimensional gene expression data. These discussed classifiers are very popular for their potential for various classification performances. Now a days, deep learning is also very much popular for its effectiveness in the classification task. It effectively promotes classification performance by deepening the layers in neural networks. But the deep network is very time-consuming because of its complicated deep network. To solve these issues, a single-layer feed-forward neural network (SLFNN) is more applicable to solve classification and regression problems [25][26][27][28][29][30]. The conventional gradient descent method is used as a learning algorithm to train SLFNN [31,32]. But it suffers from various issues like slow converging and trapping in local minimum and overfitting [33]. So, a different non-iterative learning method called random vector functional link neural network (RVFLNN) is presented which promotes the generalization performance [29], [34]. It also eliminates the drawback of a long training process. So, the RVFLNN is very less time-consuming. But it has a drawback that it is not working well in large high-volume data [35]. Thus, a new method called broad learning system (BLS) is proposed [36] by taking the concept of single-layer feed-forward RVFLNN. According to BLS, the input features are mapped to expanded feature nodes which form a broad network to enhance the classification accuracy. Here in BL-RVFLNN, the input weights are chosen randomly which may affect the classification performance of the algorithm. To solve this issue researchers have been proposed an optimization algorithm that optimizes the input weight and enhances the performance of the algorithm. Many metaheuristic algorithms like PSO [37,38], Ant-colony [39,40], cuckoo search [41], Moth-Flame-Optimization (MFO) [42,43], Genetic Algorithm (GA) [44][45][46] have been proposed by the researchers to optimize weight, various parameters that can help to improve performance of the algorithm.
Here, Monarch butterfly optimization-based BLS (MBO-BLS) model [47] is used for the selection of notable genes and to get the enhanced classification accuracy with optimum weight and bias. The supremacy of the proposed model is estimated by considering the validation test as specificity, F-score, sensitivity, and Matthews Correlation Coefficient (MCC). Here, the performance of the suggested MBO-BLS model is compared with the conventional technique like GA-BLS, PSO-BLS, and the basic classifier BLS. The main goal of this paper is: • To present a robust classification model i.e., MBO optimized BLS to classify the high dimensional cancerous data. • To achieve high classification accuracy with a minimum number of features.
• To show the supremacy of the presented method, other benchmark approaches such as GA-BLS, PSO-BLS, and basic BLS are considered for comparison. • Eventually, a statistical analysis i.e., ANOVA test has been performed to establish the superiority of the presented model with respect to other standard models. The appearance of the rest of the work is aligned in the following way: Section 2 performs an analysis of the presented model. Section 3 explains all the supported methods and section 4 covers the proposed method. Section 5 and 6 explain the experimental setup with the experimentation and the result validation portion respectively. Eventually, Section 7 discusses the conclusion part.

II. MODEL ANALYSIS
The basic layout of the presented model (MBO-BLS) is described in Fig.1. Before the execution, the missing values of the dataset are replaced with the most occurring value of that attribute and then each value of the dataset is a mapping between 0 and 1 by min-max normalization. At first, the 10-Fold Cross-Validation technique is used to divide the whole sample into a training sample and a testing sample. Then, the Relief feature selection method is used to preselect the relevant genes. After getting the most significant features, the MBO-BLS wrapper method is employed to get the optimum subset of genes after classification. In the presented model, the MBO algorithm is applied to optimize the weight and bias of the BLS classifier to select the most significant feature subset. Moreover, MBO optimized BLS (MBO-BLS) model undergoes the testing phase with the testing data having the most influenced feature subset to achieve the classification accuracy.

III. SUPPORTED METHODS
In this section, all the supported methods are discussed.

A. BROAD LEARNING SYSTEM (BLS)
In BLS, the input data are mapped to generate feature nodes then the mapped feature nodes are expanded to form the enhancement nodes where the weights are randomly generated [36]. Fig. 2 describes the basic structure of BLS.
In BLS, the input data are mapped to generate feature nodes then the mapped feature nodes are expanded to form the enhancement nodes where the weights are randomly generated [36]. Fig. 2 describes the basic structure of BLS.
The original feature set ∈ × is mapped randomly to new feature nodes of the network layer.
In (1), ∅ represents the n th mapping function, and represents the bias and random weight respectively. Then the new feature nodes as = [ 1 … … … . ] is enhanced to form a new set of enhancement nodes which is represented as In (2), is the lth enhancement node, where ℎ and ℎ denote the bias and random weight respectively. The newly generated enhancement nodes are considered as The final output is computed as per (3).
Here, the output layer weight is represented in (4). In (5), λ is taken as the constraint constant.
Then the solution is obtained by solving the above given optimal problem as per (6).
Then, + is taken as the inverse of and is represented by (7).

B. MONARCH BUTTERFLY OPTIMIZATION ALGORITHM
Monarch butterfly optimization (MBO) algorithm is a newly designed swarm intelligence optimization technique based on the migration strategies of monarch butterfly species found in North America [47,48]. Fig. 3 presents a pictorial representation of the MBO algorithm. The implementation of this algorithm is simpler and easier which is based on two operators: migration operator and butterfly adjusting operator. According to this algorithm whole population is divided into two subpopulations of equal size. Half of the population has the best fitness value resides in population 1 and the rest half to population 2. After each iteration, the best value is reserved as global optimum and then the newly updated subpopulation is merged into a new population. Then the newly generated population is again divided into two subpopulations based on the new fitness function. This process is repeated until the stopping criteria are met.

1) MIGRATION OPERATOR
The main objective of the migration operator is to exchange the information between both populations and also in the subpopulation1. The updating of each butterfly in subpopulation 1 depends on the position of another butterfly in both populations and on migration ratio p. The updating of l th butterfly in subpopulation 1 can be calculated according to (8).
The main objective of the migration operator is to exchange the information between both populations and also in the subpopulation1. The updating of each butterfly in subpopulation 1 depends on the position of another butterfly in both populations and on migration ratio p. The updating of l th butterfly in subpopulation 1 can be calculated: Where , +1 represents the position of on lth dimension in t+1 generation and 1 and 2 indices the integer index randomly selected from both subpopulation 1 and subpopulation 2. Here parameter r is taken as r=Rand*Peri, where rand is denoted as a random real number in [0,1] and peri is taken as migration period.

2) ADJUSTING OPERATOR
The movement of each butterfly l in subpopulation 2 is calculated based on adjusting ratio p and the adjusting rate of butterfly (BAR).
In (9), , is the nth element of the global best ( , ) is the nth element of global best ( , ) at the current generation t, 3, is the nth element of the randomly. Selected butterfly from the subpopulation 2. ξ represents as a weighted factor which can be formulated as (10): Here, = maximum walk step of individual butterfly in each step and t is the current generation and represents the walk step of the individual butterfly i which can be formulated by taking the concept of Levy flight method as (11). = ( ) (11) Here, has a great impact on and , , When value will be more than it can accelerate the exploration in search space and if the value will be small it can accelerate the exploitation in search space. Design subpopulation 1 by taking the best half of the population as NP1 and rest as subpopulation 2 or NP2.
Remerge the newly updated subpopulation as a whole new population.
Get the best one.

End While
Step 3: Get the global best one End

C. RELIEF APPROACH
The original Relief algorithm is formulated by Kira and Rendell [49] which is based on a feature approach. According to this instance-based learning algorithm, the features are ranked based on the assigned weight to each feature. It is an iterative method, where the feature weight of an individual feature is evaluated to estimate the "Relevance" of features. From the input dataset, a sample K is randomly selected at each iteration and then its nearest neighbor from the same class called nearest-Hit and the nearest neighbor from the different class as nearest-Miss is picked out. The updating of weight for each feature i is given below: Where the heuristic estimation function ( ) is the relevance of feature i and it is initialized to zero. The difference of feature i between sample and ′ is denoted as ( , , ′ ).Equation (12) is repeated m times where m is the sample size of training data for getting the final update relevance of features.

A. SELECTION OF GENES BY RELIEF APPROACH
In the proposed work, a filter approach is used for preselection of the most relevant genes, then it is followed by a wrapper method to find the optimal gene subset. Among several filter approaches, the Relief feature selection method [49] is used in this proposed model for gene preselection. According to the Relief approach, each feature is evaluated according to its rank. Here the topmost ranked genes (ranges of 500) are selected (as per [50]) for further execution. Then this top-ranked subset of genes is used in the MBO-BLS model to find the optimum gene subset and enhance the classification accuracy. At the same time, the MBO algorithm is applied to obtain the optimum subset of genes and optimize the weight and bias of the BLS classifier.

B. PROPOSED MBO-BLS ALGORITHM
In the proposed work, the MBO algorithm is used for the selection of genes and at the same time optimizes the weight and bias of BLS respectively. Here logistic conversion function is used to convert the value of each feature into binary form. This logistic function is expressed in (13) and (14).
In (13), In the proposed algorithm the fitness value is calculated in (15).
Here, the 10-fold cross-validation method is used to compute the average testing accuracy of the proposed MBO-BLS classifier. The final outcome as classification accuracy% with the length of the subset of features. 1: for individual candidate solutions do 2: Apply the transformation function and convert the values of each candidate solution into (1,0) pattern (Set the initial two bits for w and b, and the remaining bits are considered for the gene subset. Here, 1 and 0 specify the selection and rejection of that gene respectively).

3.
Determine the fitness employing w, b, and the selected gene subsets. 4: end for 5: Store the achieved values of the fitness in descending manner and choose the best and worst one. 6: As per the index of organized fitness value, arrange the population (NP). 7: Keep the best fitness with the location of the candidate solution. 8: Find out the mean value of the achieved fitness.

A. SIMULATION ENVIRONMENT
The simulation environment of the entire work is given below: Processing unit: Intel(R) Core (TM) i5-7200U with 2.5 GHz processing speed, Operating System: Windows 10, RAM capacity: 8 GB, and Programming Language platform: R2015b MATLAB.

B. SIMULATION ENVIRONMENT
The detailed description of the microarray data which are taken for the implementation is given in Table 1.

C. PARAMETER INITIALIZATION
Here to prove the supremacy of the proposed work (MBO-BLS), some well-known models like PSO-BLS, GA-BLS are taken for comparison. To avoid ambiguity the equal value of both iteration numbers and also the population size are considered. The other parameters of all the algorithm which shows the best performance are considered as the initial value which is shown in TABLE 2.

D. MODEL ESTIMATION MEASURES
Here Various estimation measuring values are taken to evaluate the performance of all the models taken in this paper like classification accuracy (In percentage), sensitivity, specificity, F-score, precision, Matthew's correlation coefficient (MCC), etc. These attributes are evaluated by taking the concept of confusion matrix which is explained in TABLE 3.

A. OUTCOME OF GENE PRE-FILTRATION
Here, the most significant genes are initially selected using a filter approach (Relief algorithm). Moreover, the topmost N prominent genes having the range of [1,500] are selected from the whole dataset. Then this reduced dataset further undergoes classification with BLS classifier . TABLE 4 and  TABLE 5 demonstrate the percentage of classification accuracy of six benchmark gene expression binary and multiclass data respectively. Moreover, from these Tables, it is concluded that the percentage of classification accuracy increases till a specific no. of selected genes (N), and then the accuracy percentage remains unchanged or decreased. When the best subset of a gene is found from each microarray data, it is further forwarded for classification with the BLS classification model.

B. OUTCOME OF GENE PRE-FILTRATION
In the preselection process, the most significant genes which are selected through the Relief approach like 100 top genes of Ovarian cancer, 200 top genes of Leukemia, colon tumor, Lymphoma, and SRBCT,500 top genes of ALL-MLL3 are individually forwarded for further execution to MBO-BLS model. In the MBO-BLS model each dataset is executed 10times by using 10-fold cross-validation. In TABLE 6, the all-performance measures of six microarray datasets like accuracy, specificity, sensitivity, precision, F-Score, MCC, and Kappa with 10-fold crossvalidation are shown. From Table 6, it is observed that the performance measure of binary class Ovarian cancer data outperforms the other dataset. Moreover, TABLE 6 shows 100% results of sensitivity and specificity in Leukemia and SRBCT and ALL-MLL3 datasets. Here, the convergence graph of GA-BLS, PSO-BLS, MBO-BLS models with all these six-microarray data are shown in Fig. 5(a)-5(f). From these graphs, it is observed that the accuracy of all these six datasets is increased gradually up to a maximum 100 iterations. As in Leukemia data, the accuracy is being converged after 54 ℎ ,69 ℎ , 74 ℎ , and 81 st iterations in MBO-BLS, GA-BLS, PSO-BLS, and BLS models respectively. For the Colon dataset, the accuracy in MBO-BLS, GA-BLS, PSO-BLS, and BLS models are converging after the 43 , 63 , 72 , and 84 ℎ iteration respectively. In Ovarian, the accuracy of the data is converging after 44 ℎ ,74 ℎ , 75 ℎ , and 81 st iterations in MBO-BLS, GA-BLS, PSO-BLS respectively. In Lymphoma it is showing that the accuracy of MBO-BLS, GA-BLS, PSO-BLS, and BLS is converged after the 43 , 63 , 72 , and 84 ℎ iteration respectively. Like this, the accuracy is also converging in MBO-BLS, GA-BLS, PSO-BLS, and BLS models after 42 ℎ ,69 ℎ , 79 ℎ , and 88 ℎ iteration respectively. For ALL-MLL3 data, the accuracy in MBO-BLS, GA-BLS, PSO-BLS, and BLS models are converging after 30 ℎ ,53 ℎ , 64 ℎ , and 78 ℎ iteration respectively. From these discussed converging graphs, it is observed that the rate of convergence of the MBO-BLS model is earlier than another discussed model.

C. SELECTED SIGNIFICANT BIO-MARKERS BY MBO-BLS
The highly influenced genes are selected from the proposed MBO-BLS model having high classification accuracy are listed in

D. EXECUTION TIME OF PRESENTED MODEL
Here, the presented approach has two parts, i.e., a feature extraction part (by Relief) and a wrapper part (by MBO-BLS). Henceforth, the complete execution time relies on the time consumed by the two parts. For six microarray datasets, TABLE 9 shows the execution time in both parts.  In colon tumor, BDE-XRankf selects 4 genes whereas the proposed MBO-BLS selects 5 genes but with higher classification accuracy, and for ovarian cancer both these algorithms select the same number of genes but here is also the proposed algorithm classifies the data with better accuracy. From the above analysis, it is derived that the proposed algorithm outperforms in all six microarray datasets.

F. STATISTICAL ANALYSIS BY ANOVA
To examine the mean value of the definite groups (whether they are similar or not), a most popular statistical approach i.e., Analysis of variance (ANOVA) has been applied in this work. This statistical approach helps to estimate the model statistically. Generally, a null or alternative hypothesis is taken in ANOVA. In this test, F-value is calculated first, then according to F-value, the p-value is determined. The obtained p-value of ANOVA finalizes either to keep or discard the alternative hypothesis. The null hypothesis is refused, if p-value ≤ 0.05 (taking 5% as significance level) and it can be decided that the accuracy percentages of all the models are undoubtedly dissimilar. Moreover, the statistical analysis of the ANOVA test is shown in TABLE 11 and TABLE12. In this work, the pvalue is calculated as 0.047. It is quite smaller than the previously considered p-value (i.e., 0.05). Henceforth, the alternative hypothesis is refused. So, it is decided that the presented algorithm is statistically better than other models. efficacy. Here six microarray datasets are taken for evaluation, where 3 datasets are binary (i.e., Leukemia, Colon tumor, Ovarian cancer) and the rest three are multiclass (i.e., Lymphoma, SRBCT, and ALL-MLL-3). At the first stage, a pre-selection method (Relief-f) is used to select a feature subset and then this selected feature subset undergoes further execution with the MBO-BLS model. Here various performance measures (i.e., sensitivity, specificity, precision, F-score, MCC, Kappa) are applied for an impartial comparison. Further to present the supremacy of the suggested method, the benchmark models as GA-BLS, PSO-BLS, BLS, and existing ten standard models are taken for comparison. Moreover, to examine the mean values of specific groups (whether they are similar or not), a most popular statistical approach i.e., Analysis of variance (ANOVA) is applied in this work. From the above qualitative and quantitative analysis, it is concluded that the proposed MBO-BLS model can be a dependable framework for the diagnosis of various diseases.