Multi-Label Bioinformatics Data Classification With Ensemble Embedded Feature Selection

In bioinformatics, the vast of multi-label type of datasets, including clinical text, gene, and protein data, need to be categorized. Specifically, due to the redundant or irrelevant features in bioinformatics data, the performance of multi-label classifiers will be limited, and therefore, selecting effective features from the feature space is necessary. However, most of the proposed methods, which aimed at dealing with multi-label feature selection problem in the past few years, only adopt a simple and direct strategy that transforms the multi-label feature selection problem into more single-label ones and ignore correlations among different labels. In this paper, a novel algorithm named ensemble embedded feature selection (EEFS) is proposed to handle multi-label bioinformatics data learning problem in a more effective and efficient way. The EEFS does not only explicitly find out the correlations among labels, but it can also adequately utilize the label correlations by multi-label classifiers and evaluation measures. Furthermore, it can reduce the accumulated errors of data itself by employing an ensemble method. The experimental results on five multi-label bioinformatics datasets show that our algorithm achieves significant superiority over the other state-of-the-art algorithms.


I. INTRODUCTION
Multi-label type of bioinformatics data widely exists in clinical text data [1], gene data [2], protein data [3], [4] and so on.For example, a patient suffering from cough and fever should be associated with both two disease labels in the clinical records.Formally, let X = R d denote the d−dimensional feature space and Y = {0, 1} q denote the q−dimensional label space, where each example in multi-label bioinformatics data can be denoted as (x i , Y i )(x i ∈ X , Y i ⊆ Y).Due to the rapidly expanding quantity of multi-label bioinformatics data resources, techniques based on multi-label learning demonstrate the superiority in mining useful information from vast this type of data [5], [6].These techniques aim to build classification models for instances which are assigned with multiple labels simultaneously.However, the feature space of multi-label data inevitably exists redundant and The associate editor coordinating the review of this manuscript and approving it for publication was Navanietha Krishnaraj Rathinam.irrelevant features which could limit the performance of multi-label classifiers.Encouragingly, to promote the classification performance, many multi-label feature learning methods, which could reduce the dimension of feature space, have been proposed to acquire important and effective information from the original feature space [7], [8].Specifically, there are two kinds of dimension reduction methods for multi-label data: feature extraction and feature selection.For feature extraction, unsupervised approaches, such as principal component analysis (PCA) [9], latent semantic indexing (LSI) [10], [11], multi-output regularized feature projection [12], extraction shared subspaces [13], are proposed to find a compact feature space to represent the original datasets and supervised approaches, such as linear discriminant analysis (LDA) [14]- [16] and multi-label dimensionality reduction via dependence maximization which is based on the Hilbert-Schmidt independence criterion (MDDM) [17], achieve better performance.These approaches are effective to improve the performance of classification.However, the extracted features fuse the information of original features, and lose the distinct physical meanings.Hence, the features extracted from the original feature space can be hardly explained and easily comprehended.
These characteristics limit the use of feature extraction methods in particular research area, such as multi-label bioinformatics data analysis.For example, many clinical decisionmaking tasks following the philosophy of Evidence Based Medicine (EBM) rely on the ability to find relevant health records and gather sufficient clinical evidence [18].Therefore, reliable feature subsets which have interpretation of physical significance for clinical records classification can help doctors or researchers in disease diagnosis, prevention and treatment, or simply medical text resources categorization and retrieving.Encouragingly, different from feature extraction, feature selection approaches remains the physical meaning of features when reducing the feature dimension.Hence, it is more suitable than feature extraction for dealing with this type of multi-label data analysis task.
To deal with the multi-label feature selection task, there are three main types of methods: filter, wrapper ands embedded [19]- [21].We will describe them in detail in the next section.In this paper, a new embedded methods type of multilabel feature selection algorithm named EEFS, i.e.Ensemble Embedded Feature Selection, for multi-label bioinformatics data is proposed.It randomly selects partial training examples to train classification models which is an ensemble method, and then employs evaluation measure and averaged training examples for each column to test the trained models iteratively to acquire the final feature importance ranking.Experiment results demonstrate our algorithm is superior over other multi-label feature selection (feature importance ranking) algorithms.This paper extends our preliminary work [22].
The rest of this paper is organized as follows.Section II, reviews the existing multi-label feature selection methods.Section III, presents the proposed EEFS algorithm.Section IV, presents the design of the experiments.Section V, reports and analyzes the comparative experimental results.Finally, Section VI, summarizes several issues and suggest some future directions.

II. RELATED WORKS
Recently, multi-label feature selection methods for bioinformatics data have received increasing attention from research community, due to the rapidly expanding quantity of multilabel bioinformatics data resources.There is a rich body of work on the research of them.As mentioned above, the existing multi-label feature selection methods can be generally categorized into three classes, namely filter, wrapper and embedded methods.In this section, we will review the main algorithms of these three main types of methods in detail.

• Filter methods
The main idea of filter feature selection methods [23], [24] for multi-label bioinformatics classification is transforming the single-label methods to multi-label methods.For example, Yang and Pedersen [23] proposed a filter framework to evaluate features for each label separately under some statistic evaluation measures, and combine the results by the maximal or average methods.This framework is an extension of single-label filter feature selection methods.It deals with the labels separately, which ignores the correlations within labels.Let X = {f 1 , f 2 , . . ., f d } denote the feature space with d features and Y = {l 1 , l 2 , . . ., l q } denote the label space with q class labels.Then the maximal and average types of filter multi-label feature selection methods, FS max and FS avg , are defined as follows: where EM is the evaluation measure which evaluates the correlations between feature and label for singlelabel feature selection.The evaluation measures utilized by single-label feature selection methods can be χ 2 , Relief [25], COR [26] and mRMR [27].The importance which is represented by the value FS(f i ) of ith features decided in multi-label bioinformatics data depends on the rules of filter multi-label feature selection as shown in Eq. (1) or Eq. ( 2).The results of filter multi-label feature selection will demonstrate the feature importance ranking.These methods have linear computation cost, but their selection results are always rough.They consider the relevance between labels and each feature, while ignoring the power when features combined together.Moreover, filter methods provide a unique feature ranking for different kind of classifier.The selected feature subset is always not the most suitable subset for a certain classifier.
• Wrapper methods The main idea of wrapper feature selection method [28]- [30] for multi-label learning is depending on the learning machine and utilizing the learning machine of interest as a black box to score feature subsets according to their predictive power.They are widely used in scientific data analysis, because the selected feature subset is optimal to the specific learning machine due to its mechanism that the selection result is based on the learning algorithms.For example, Shao et al. [30] propose a hybrid optimization multilabel feature selection method called HOML.In their work, simulated annealing, genetic algorithm and hill climb strategies are combined to generate many feature subsets and then they utilize multi-label classifiers to select the best one.The process of generating and selecting optimal feature subset is comparative timeconsuming.These methods are classifier specified feature selection methods.Specifically, they select a wide variety of feature subsets based on some principle from training data to train corresponding classifiers, and then measure the selected feature subsets from test data with the corresponding trained classifiers directly.Wrapper methods can improve the performance of classifiers in a large range.However, their computational complexity is always too high.
• Embedded methods The embedded feature selection methods [31]- [33] for multi-label learning are a trade-off way to overcome the weaknesses of filter and wrapper methods.For embedded methods, the multi-label classifiers are embedded in the process of feature selection to compute the relationships between features and classifiers.It also can avoid the time consuming problem compared with wrapper methods.For example, You et al. [31] proposed an algorithm named multi-label embedded feature selection (MEFS) which utilizes the prediction risk and classifier to evaluate the features importance in feature subset and backward search strategy to select the best feature from feature subset step by step.With this, the selected features are more directly to improve the classification performance.But with the change of the training data, the trained model will generate different feature rankings, so it is difficult to get relatively stable feature ranking.This will not be conductive to further analysis of the data and the generalization of the algorithm.Furthermore, MEFS cannot overcome the weakness of reduce the accumulated errors of data itself.In this paper, due to the proposed EEFS algorithm providing the feature importance ranking as final result which is similar as algorithms based on filter methods and different from algorithms based on wrapper methods only providing optimal feature subset from the candidate feature subsets, we only compare our algorithm with algorithms based on filter and embedded methods.And we do not involve algorithms based on wrapper methods in the experimental part.

III. THE PROPOSED ALGORITHM
In this section, our proposed algorithm EEFS, i.e.Ensemble Embedded Feature Selection, will be presented for multilabel bioinformatics data feature selection.EEFS can provide relatively stable feature ranking and reduce the negative effect of the change of training data.Furthermore, it can be adjusted on the basis of the multi-label bioinformatics data structural characteristics for boosting the performance of multi-label classifiers.Specifically, aimed at generating the feature subset which can be utilized to improve the classifier's performance, EEFS employs prediction risk and forward search strategy to evaluate the importance of features.And EEFS's feature selection process cooperates with multi-label classifier and prediction risk.In detail, the feature selection capacity of EEFS mainly relies on the classifiers' learning ability and the employed evaluation measure which is used for computing prediction risk.
Prediction risk can evaluate the models' classification performance.During the learning process of models, prediction risk is applied to estimate the prediction accuracy of the Algorithm 1 The EEFS Algorithm Inputs: . ., l q }) L: the loss function in Eq. ( 4) µ: the percentage parameter (0 < µ ≤ 100%) λ: the iteration times (any integer ≥ 1) Outputs: r: the feature ranking list end for// evaluate each feature's importance according to the prediction risk criterion 12: end for 13: compute preRISK k j for each feature according to Eq. (5) 14: r ← rank[preRISK k j ]//update the feature ranking list 15: output the final feature ranking list r models and then select suitable models.The principle of prediction risk minimization is often used for selecting optimal feature subset in single-label problems.Prediction risk criteria estimates each feature by computing the difference between original and updated training data's results which are from testing the trained model.The updated training data means the value of a certain feature for each training example is replaced by its mean value of all training examples.The prediction risk (preRISK ) of ith feature is defined as follows: where preERR stands for the prediction error of trained model on original training data and preERR(x i ) stands for the prediction error of trained model on updated training data corresponding to ith feature.In experimental part, we employ average precision, which is a multi-label ranking type of evaluation measure, to compute preERR and it is defined as follows: Comparison of the optimal results of Four algorithms with BR classifier on Five datasets. where rank(x, l) returns the rank of l ∈ Y based on the descending order induced from model and n is the number of examples.
Average precision evaluates the average fraction of relevant labels ranked higher than a particular label l k ∈ Y i .
The preRISK of ith feature captures the preERR difference between the updated training data which replaces the ith feature's value for each example by its mean value of all examples and the original training data, when they test the trained model separately.We employ the preRISK value as the ranking basis of feature importance.When we utilize the prediction risk to reduce the dimension of feature space in multi-label bioinformatics data, we employ the evaluation measures of multi-label learning as the loss function for prediction risk.x i j ∈ R is the value of ith feature for jth example.The output of a classifier C(x) (x = [x 1 , . . ., x d ]) is the predicted label sets Y .Let L(Y , Y ) denote a multi-label loss function where Y is the true label set associated with instance x.Then preERR(x i ) is defined as follows: where x i is the mean value of the ith feature of all examples and C([x 1 , . . ., x i , . . ., x d ]) is the prediction value of all examples with the ith feature replaced by their mean value.
To further improve the performance of EEFS, we randomly select µ (0 < µ ≤ 100%) percentage of instances from the original training data to train models, and then utilize Eq. ( 4) to compute the prediction risk for each feature.At last, we repeat this process for λ (In theory, any integer ≥ 1; In practice, 1 ≤ λ ≤ 20 on the basis of our experimental experience) times.In the jth (1 ≤ j ≤ λ) iteration, we compute the prediction risk of the ith feature preRISK i j according to Eq. (3).The average prediction risk preRISK i of ith feature for all λ iterations is computed as follows: In order to better select appropriate parameters µ and λ in the multi-label bioinformatics feature selection process, we found a knack of good guiding significance according to our experimental experience.Firstly, we employ other stateof-the-art multi-label feature selection algorithms (e.g. based on filter methods) as benchmark to compute the importance of each feature.Secondly, we use EEFS with setting: µ = 100% and λ = 1 to estimate the importance of each feature.After that the features are ranked according to their importance in descending order.Thirdly, we select the same top percentage of features from the two feature rankings as feature subsets, and denote them as subsets A and B respectively, and then utilize multi-label classifiers to test their performance.Then we set the parameters by comparing the performance of the two feature subsets.If the performance of A is better than B, we set µ close to 100% and λ small.On the contrary, if the performance of B is better than A, we set µ away from 100% and λ big.Because this method essentially reduces the accumulated errors of data itself by employing ensemble method, it can acquire satisfactory experimental results.The pseudo code of EEFS is demonstrated in Algorithm 1.

IV. EXPERIMENTS A. EXPERIMENTAL DATASETS
Five bioinformatics datasets are employed to help compare our proposed algorithm with the other state-of-the-art algorithms.For each dataset S = {(x i , Y i )|1 ≤ i ≤ p}, we use |S|, dim(S), L(S) and F(S) to denote the number of examples, number of features, number of class labels, and feature type for |S| respectively.The clinical dataset [34], [35] comprises a total number of 1566 free text clinical records label by disease codes.The content of the records is mostly composed of patient's impressions reported by some radiologists in free text form.We extract bag-of-words features from the raw text and further transform word counts into TF-IDF features.Only the word frequencies of the top 232 words are kept after stop words filtering and word stemming [36].The disease labels are expressed by a group of ICD-9-CM codes [37].It contains a list of carefully categorized disease entries, coded by distinguished numbers, which can be used to classify the clinical records into their relevant diseases.
In our experiments, we restrict the label size to top 10 for analysis and algorithm comparisons.Also as clinical text data, the data processing mode of medical dataset [34] is similar to clinical.It comprises 978 text medical records, 217 top frequencies words, and 20 labels.Two protein datasets, which are plant [38], [39] and virus [40], [41], with experimentally determined subcellular location are obtained from Cell-Ploc 2.0 [42].In these two datasets, protein sequences were totally collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/.We use go protein representation method, which is widely used in many existing protein subcellular localization systems, to generate features of protein examples [43]- [45].Information of these two protein datasets are described as follows: 1) 969 different proteins with 224 go distributed among 12 subcellular for plant cells; 2) 206 different proteins with 185 go features distributed among 6 subcellular location for virus The yeast dataset [2] is formed by micro-array expression data and phylogenetic profiles with 2417 genes.The input dimension is 103.Each gene is associated with a set of functional labels whose size can be 14.Table 4 briefly demonstrates the characteristics of the experimental datasets.

B. MULTI-LABEL FEATURE SELECTION ALGORITHMS
We compare our proposed algorithm with the following three algorithms: Multi-label embedded feature selection (MEFS) [31], χ 2 based maximal and average type of filter multi-label feature selection (max and avg) [23].
• MEFS: The basic idea of this algorithm is to utilize the prediction risk and classifier to evaluate the features importance in feature subset and backward search strategy to select the best feature form feature subset step by step to form feature ranking.
• max: The basic idea of this algorithm is to calculate the dependency score with a χ 2 based evaluation statistic between a feature and a label separately.The maximal dependency score of a certain feature across all labels stands for the final importance score of this feature.
According to the importance score of each feature, we get feature ranking in descending order.
• avg: The basic idea of avg is similar to max.Dependency scores for a certain feature on all labels are averaged to form the final importance score for this feature.

C. MULTI-LABEL CLASSIFIERS
In order to eliminate the bias of classifiers, three multi-label classifiers, which are Binary Relevance (BR) [46], Classifier Chain (CC) [47] and Multi-Label k-Nearest Neighbor (MLkNN) [48] are employed in the experiment.BR and CC are problem transformation method, which transform the multi-label classification problem into one or more singlelabel classification.MLkNN is algorithm adaptation method, which extends specific learning algorithms in single label problem to handle multi-label data directly.
• Binary Relevance (BR): The basic idea of this algorithm is to decompose the multi-label learning problem into q independent binary classification problems, where each binary classification problem corresponds to a possible label in the label space.In brief, it trains and tests models for each label.• Classifier Chain (CC): The basic idea of classifier chain is to transform the multi-label learning problem into a chain of binary classification problems, where subsequent binary classifiers in the chain is built upon the predictions of preceding ones.In brief, it treats label as new feature with original feature space to predict next label in a chain way.
• Multi-Label k-Nearest Neighbor (MLkNN).The basic idea of this algorithm is adapting k-nearest neighbor techniques to deal with multi-label data, where maximum a posteriori (MAP) rule is utilized to make prediction by reasoning with the labeling information embodied in the neighbors.In brief, it designs a new algorithm based on k-nearest neighbor techniques for multi-label data.

D. EVALUATION MEASURES
In the multi-label learning community, it is well known that the performance evaluation of multi-label learning differs from that of classical single-label learning because each example could have multiple labels simultaneously.Therefore five standard evaluation measures, which are multi-label accuracy (mlACC), precision (mlPRE), recall (mlREC), F1 (mlF1) and subset accuracy (ACC), are introduced for evaluating the performance of our proposed method from multiple aspects more exactly [49], [50].The five evaluation measures are defined as follows: where m is the number of test examples, Y i and Y i are the set of true labels and the set of predicted labels of each instance, respectively.mlF1 is the harmonic mean of mlREC and mlPRE.For the five evaluation measures, note that the bigger the measure value, the better the performance.

E. EXPERIMENT CONFIGURATION
In the experiments, multi-label ranking type of evaluation measure average precision [51], [52] is employed as preERR to compute prediction risk.EEFS is compared with 3 other state-of-the-art feature selection methods MEFS, max and avg. 3 classifiers and 5 evaluation measures previously described are all implemented in the experiment for an exhaustive assessment.In the experiment, we select top 25%, 50%, 75% and 100% percentage of the features, which are ranked in a descending order according to their importance, to demonstrate the results and we employ 10-fold cross validation in the experimental part.For the setting of parameters µ and λ of EEFS, we set µ = 80% and λ = 5 for BR and CC and µ = 50% and λ = 15 for MLkNN according to the previous analysis of experimental experience we mentioned.

V. RESULTS ANALYSIS
In this section, we will analyze the experimental results in detail.All results with 3 classifiers, 4 algorithms and 5 evaluation measures on five multi-label bioinformatics datasets are demonstrated in    The optimal results of EEFS and MEFS shown in Furthermore, according to the algorithm computational complexity and mechanism, EEFS is higher computational efficiency than MEFS.For example, for training data with n instances, d features and q labels, to get the feature ranking, EEFS needs to train λ models with µ * n instances for each model and test d times to get each feature's importance for each model.MEFS needs to train (d − 1) models to get a feature ranking.In detail, the ith model of MEFS which is trained based on (d − i + 1) features with n instances needs test (d − i + 1) times to get each feature's importance.In experiments, MEFS is more time consuming than EEFS in getting the final feature ranking part and they are the same in other parts.

B. PERFORMANCE COMPARISON AMONG EEFS, MAX AND AVG:
As shown in Table 1-3, compared with max and avg across all evaluation measures and classifiers, EEFS ranks 1st in 73.3% cases (BR: 72.0%, CC: 60.0%, MLkNN: 80.0%).As shown in Figure 2-4, when the size of feature subset is small which is top 25% features, EEFS ranks 1st in 60.0% cases (BR: 64.0%, CC: 48.0%, MLkNN: 68.0%).These phenomenons, when top 25% features are selected, indicate that: 1) For BR and MLkNN, EEFS can better evaluate the feature importance than max and avg; 2) For CC, EEFS is not good as max and avg.When we analyze it in detail, we find the performance of EEFS on plant and virus impacts the results of feature selection.For plant and virus on top 25% features, EEFS ranks 1st in 70% cases with BR and 100% cases with CC, but it ranks 1st in 10% cases with CC.This phenomenon is because of the characteristics of CC classifier and go features of the protein datasets.CC employs label as new feature with original feature space to predict next label, but the labels structure do not match the structure of go features.Finally, it gets the poor performance.
All results indicate that, during the process of multilabel feature selection, EEFS can utilize: 1) The correlations between multiple labels and features; 2) The correlations within labels.In contrast, the other two algorithms, max and avg, are implemented by transforming sing-label methods to multi-label methods according to Eq. ( 1) and Eq. ( 2), respectively.Therefore, they can only utilize the relationship between single label and single feature which leads to their worse performance than EEFS.

VI. CONCLUSION
In this paper, we propose a novel algorithm named EEFS, i.e.Ensemble Embedded Feature Selection, which can deal with the multi-label feature selection problems in bioinformatics data.EEFS can provide relatively stable ranking of feature importance and reduce the negative effect from the change of training data by randomly selecting partial training examples and utilizing iteration to compute the prediction risk.As illustrated in the experimental results of most cases, the performance of EEFS is better than MEFS because of that it can reduce the accumulated errors of data itself by employing ensemble method.And it is better than other two filter multi-label feature selection algorithms, i.e., max and avg because of that it can utilize: 1) the correlations between multiple labels and features, 2) the correlations within labels.
Inspired by this work, we will further explore the mechanism of embedded multi-label feature selection methods for bioinformatics data and propose a more efficient algorithm in the future.

FIGURE 1 .
FIGURE 1. Best performance of EEFS with BR, CC and MLkNN classifiers on five datasets compared with other three algorithms.(Each dataset connects all the evaluation measures with different color curves simultaneously and the number of color curves on the left half circle denotes the rank performance of EEFS corresponding to each evaluation measure).

FIGURE 2 .
FIGURE 2. Performance of the four algorithms with BR classifier on five datasets (TPoF: Top percentage of feature ranking in descending order according to their importance).

FIGURE 3 .
FIGURE 3. Performance of the four algorithms with BR classifier on five datasets (TPoF: Top percentage of feature ranking in descending order according to their importance).

FIGURE 4 .
FIGURE 4. Performance of the four algorithms with BR classifier on five datasets (TPoF: Top percentage of feature ranking in descending order according to their importance).

in 9 .
3% cases (BR: 24.0%, CC: 0%, MLkNN: 4.0%) which demonstrates the multi-label feature selection effectiveness of EEFS.As shown in Table1-3, and Figure1, compared with other three algorithms, EEFS ranks 1st in 72.0%cases (BR: 72.0%, CC: 60.0%, MLkNN: 84.0%) and equally 1st in 2.7% cases (BR: 4.0%, CC: 0%, MLkNN: 4.0%).In Figure1, each dataset under BR, CC and MLkNN classifiers connects all the evaluation measures with different color curves simultaneously.Different color curves represent different rank status and the number of color curves on the left half circle denotes the best performance of EEFS corresponding to each evaluation measure compared with other three algorithms.For example, under BR, we can see EEFS achieves 5 blue curves (ranking 1st) on clinical for five evaluation measures, 1 blue curve (ranking 1st) on medical for mlREC, and 4 yellow curves (ranking 3rd) on medical for other four evaluation measures.In Figure2-4, the x-axes represent the top percentage of the feature importance ranking

TABLE 2 .
Comparison of the optimal results of four algorithms with CC classifier on five datasets.

TABLE 3 .
Comparison of the optimal results of four algorithms with MLkNN classifier on five datasets.

Table 1 -
3and Figure1-4.In Table1-3, the optimal results means the selected best performance of each algorithms among the four top percentages (25%, 50%, 70%, 100%) of features with corresponding multi-label classifiers and evaluation measures.The bold-faced values represent the best performance among all the algorithms in

Table 1 -
3. We use benchmark represent the cross validation classification results of corresponding multi-label classifiers and evaluation measures about the related datasets with all the features (no feature selection).As shown inTable 1-3, compared with benchmark, EEFS ranks 1st in 90.7% cases (BR: 76.0%, CC: 100%, MLkNN: 96.0%) and equally 1st

TABLE 4 .
Characteristics of the experimental datasets.