Feature Redundancy Based on Interaction Information for Multi-Label Feature Selection

Recent years, multi-label feature selection has gradually attracted significant attentions from machine learning, statistical computing and related communities and has been widely applied to diverse problems from music recognition to text mining, image annotation, etc. However, traditional multi-label feature selection methods employ cumulative summation strategy to design methods, which suffers from the problem of overestimation the redundancy of some candidate features. Additionally, the cumulative summation strategy will lead to the high value of the goal function when one candidate feature is completed related with one or few already-selected features, but it is almost independent from the majority of already-selected features. To address these issues, we propose a new multi-label feature selection method named Feature Redundancy Maximization (FRM), which combines the cumulative summation of conditional mutual information with the ‘maximum of the minimum’ criterion. Additionally, FRM can be rewritten as another form of multi-label feature selection method that employs interaction information as a measure of feature redundancy that obtains an accurate score of feature redundancy as the number of already-selected features increases. Finally, extensive experiments are implemented on fourteen benchmark multi-label data sets in comparison to six state-of-the-art methods. The experimental results demonstrate the superiority of the proposed method.


I. INTRODUCTION
Recent years have witnessed an increasing number of applications involving multi-label data sets in which each instance is associated with multiple labels simultaneously. The high-dimensionality of data is a stumbling block for multi-label learning. Multi-label feature selection is a key technique in many applications since it can speed up the learning process, improve the classification accuracy, and alleviate the effect of the curse of dimensionality. A large number of The associate editor coordinating the review of this manuscript and approving it for publication was Chintan Amrit . developments on multi-label feature selection methods have been made.
As in traditional single-label feature selection methods, multi-label feature selection methods are divided into three models: (1) filter models where the selection is independent of any classifier, (2) wrapper models where the prediction method is used as a black box to assess the importance of candidate feature subsets, and (3) embedded models where the procedure of feature selection is embedded in their learning process [1]- [4].
We focus on filter-based multi-label feature selection methods because they have higher computational efficiency. Additionally, it is intuitive to consider information theory, which has been widely used in numerous feature selection methods [5]- [7]. Information theory is a powerful measure that can capture linear and nonlinear dependency between variables.
Traditional multi-label feature selection methods employ cumulative summation strategy to design methods, which suffers from the problem of overestimation the redundancy of some candidate features. Specifically, the effect of feature relevance term is reduced gradually when more features are selected. It indicates that the cumulative summation strategy ignores the importance of feature relevance when more features are selected. Initially, we review many information theoretic multi-label feature selection methods, such as PMU [5], D2F [7], and SCLS [8] before designing our method. We find that these methods suffer from the problem of overestimation the redundancy of some candidate features. With more already-selected features, the effect of feature relevancy term in D2F is reduced gradually. The same situation occurs in other multi-label feature selection methods. Afterwards, we notice that these methods adopt the cumulative summation strategy which leads to the high value of the goal function when one candidate feature is completed related with one or few already-selected features, but it is almost independent from the majority of already-selected features. Both two issues can be solved by the 'maximum of the minimum criterion'. Furthermore, to the best of our knowledge, it is the first attempt to apply the 'maximum of the minimum criterion' in multi-label feature selection method. Importantly, the proposed method achieves excellent classification performance in comparison with six methods on fourteen data sets in terms of multiple evaluation criteria.
To solve these issues, the proposed method combines the cumulative summation of conditional mutual information with the 'maximum of the minimum' criterion. Additionally, our method can be rewritten as another form which employs interaction information as a measure of feature redundancy that obtains an accurate score of feature redundancy as the number of already-selected features increases.
The main contributions of this paper are as follows: 1) To address the problem of overestimation the redundancy of some candidate features, the proposed method chooses interaction information as the feature redundancy term that obtains an accurate score of feature redundancy as the number of already-selected features increases; 2) Our method employs the 'maximum of the minimum' criterion that avoids the problem of overestimation regarding the value of goal function to design the proposed method, which is different from existing multi-label feature selection methods.

3) An evaluation criterion named Feature Redundancy
Maximization (FRM) is proposed to assess each feature; 4) Extensive experiments are conducted to verify the effectiveness of the proposed method in comparison to six state-of-the-art methods on fourteen benchmark data sets. The rest of the paper is organized as follows. Section 2 presents the concepts of information theory, and the evaluation metrics that are used in our experiments. In Section 3, we review related work. Section 4 proposes Feature Redundancy Maximization (FRM) multi-label feature selection method. In Section 5, Extensive experimental results and time complexity of feature selection methods are shown. Finally, we give conclusions and future research in Section 6.

A. INFORMATION THEORY
Information entropy is used to describe the uncertainty of random variables, and it is fundamental for information theory. Let X = {x 1 , x 2 , . . . , x l } be a random variable. Information entropy is defined as follows: where p(x i ) is the probability of x i . The base of the logarithm, log is 2. If Y = {y 1 , y 2 , . . . , y m } is a random variable, and the joint probability of X and Y is p(x i , y j ). Then, joint entropy is measured as: Additionally, conditional entropy reflects the uncertainty of a variable X given Y , which is expressed as follows: In addition to information entropy, mutual information measures the amount of information that both variables share, it is defined as follows: Alternatively, conditional mutual information between X and Y given the random variable Z = {z 1 , z 2 , .., z n } is defined as follows: where H (X , Y |Z ) is joint entropy that reflects the uncertainty of (X , Y ) given the variable Z . Similarly, joint mutual information measures mutual information between (X , Y ) and Z , which is defined as follows: Finally, we give the definition of interaction information [9], [10]: VOLUME 8, 2020

B. MULTI-LABEL EVALUATION METRICS
In traditional single-label feature selection methods, the average classification accuracy is usually used to evaluate the already-selected feature subset [2], [4], [10]- [13]. However, the evaluation metrics in multi-label feature selection methods are much more complicated than single-label feature selection methods.
In this section, we introduce three kinds of evaluation metrics. Suppose X ∈ R d represents the d-dimensional instance space, and L = {y 1 , y 2 , . . . , y q } is a set of labels with q possible class labels.
. , x id } T is an instance and Y i ⊆ L is the subset of labels associated with x i . Let Y i be the predicted label set corresponding to the instance x i .
Hamming Loss (HL) calculates the average fraction of misclassified labels on the test data. HL is defined as: where ⊕ denotes the symmetric difference between the label set Y i and Y i . The smaller the value of HL is, the better the classification performance is. Macro-average (Macro-F1) and micro-average (Micro-F1) based on F-measure are two widely used evaluation criteria. Macro-F1 is an average of F-measure for all labels and Micro-F1 is a weighted average of F-measure over all labels.
where TP i , FP i and FN i denote the number of true positives, false positives and false negatives in the i-th class label, respectively. The higher the two metrics value are, the better the classification performance is. The details of the three metrics mentioned above and other metrics that evaluate multi-label learning algorithms could refer to the literature [14].

III. RELATED WORK
In multi-label learning, one instance covers a group of labels. Therefore, traditional single-label methods cannot address the multi-label data directly. Conventionally, multi-label methods fall into two categories: problem transformation methods and algorithm adaption methods [6], [15]- [17]. Problem transformation methods transform the multi-label learning task into single-label (binary or multi-class) learning tasks [18], [19]. Binary Relevance (BR) [20] obtains several binary single-label classifiers, and BR predicts each label separately, ignoring the relationship between labels. Label Powerset (LP) [21] regards each unique combination of labels that in a multi-label training set as a new single-label task. However, LP easily leads to higher time consumption and lower accuracy because the number of new classes is increased rapidly as increase of labels. Additionally, it easily generates unbalanced data. Pruned Problem Transformation (PPT) [22] abandons the new patterns with labels that has extremely small number of samples, while some abandoned patterns will result in the loss of multi-label information. In the process of feature selection, χ 2 statistics is used to select the effective features (PPT+CHI) [22]. In addition, Doquire and Verleysen [15] propose a multi-label feature selection method based on mutual information using PPT (PPT+MI). Algorithm adaption methods tackle multi-label learning problem by adapting multi-label problem directly. Classical multi-label learning methods include Multi-Label K-Nearest Neighbor (ML-KNN) [23], Multi-Label Decision Tree (ML-DT) [24], Ranking Support Vector Machine (Rank-SVM) [25], etc. ML-KNN adapts k-nearest neighbor classifier to tackle multi-label data, the basic idea of ML-KNN is to use Maximum A Posteriori (MAP) rule to predict the unseen label. ML-DT adapts decision tree to deal with multi-label data, where information gain technique is used to build the decision tree recursively. Rank-SVM adapts maximum margin strategy to address multi-label data, where a group of linear classifiers are improved to minimize the empirical ranking loss.
Similar to traditional single-label learning, there also exists curse of dimensionality in the multi-label learning. Therefore, feature selection technique is also playing an important role in multi-label learning. First, problem transformation methods transform the multi-label data into single-label data, and then evaluate each feature using some measures, such as Mutual Information (MI). However, problem transformation feature selection methods assess each feature independently, and the transformed single-label data includes too many classes, resulting in the decline of learning algorithm classification performance. As a result, many feature selection methods tackle multi-label data directly.
Jian et al. [26] propose a multi-label feature selection method without data transformation named Multi-label Informed Feature Selection (MIFS) that adopts the latent semantics of the multi-label data to exploit label correlations for feature selection. Lee and Kim [7] propose D2F multi-label feature selection method that is based on information theory. D2F adapts interaction information and mutual information to evaluate the importance of each feature, D2F is able to measure feature dependency among multiple features. The criterion of D2F is as follows: where X k is a candidate feature, and y i ∈ L is the class label, and X j ∈ S is an already-selected feature, and S is an already-selected feature subset. I (X k ; y i ) calculates feature relevance between a candidate feature with each class label, I (X k ; X j ; y i ) obtains feature redundancy among three variables.
Scalable Criterion for Large Label Set (SCLS) [8] employs a scalable relevance to evaluate the conditional relevance. As a result, SCLS removes irrelevant features from the final feature subset effectively. SCLS is expressed with the following way: It is worthy to notice that the first feature is selected by obtaining the maximum of y i ∈L I (X k ; y i ).
Pairwise Multi-label Utility (PMU) [5] is proposed to tackle curse of dimensionality by considering two types of feature redundancy: I (X k ; X j ; y j ) and I (X k ; y i ; y j ). PMU is designed by: (13) y i ∈ L and y j ∈ L are two different class labels.
Recently, many multi-label feature selection methods are proposed to address multi-label data. González-López et al. [27] propose a distributed model to obtain a score that evaluates the quality of each feature regarding multiple labels. Sun et al. [28] propose a multi-label feature selection method based on constrained convex optimization, with the propose method obtaining less time to address multi-label data. Additionally, some sparse learning-based multi-label feature selection methods have attracted increased attention. Hu et al. [29] propose a sparse learning-based method incorporating dual-graph regularization, i.e., feature graph regularization and label graph regularization. Furthermore, a novel feature selection method named Feature Selection considering Shared Common Mode between features and labels (SCMFS) is proposed, which extracts the shared common mode between the feature matrix and the label matrix [30]. González-López et al. [31] employ mutual information to design two distributed multi-label feature selection methods on continuous features. One evaluates features based on the differential mutual information between features and labels, the other selects the feature subsets based on maximizing mutual information between features and labels while minimizing the redundancy between already-selected features. Additionally, some single-label feature selection methods adopt the interaction term or its variants to design evaluation criteria, such as the methods [32]- [34].

IV. PROPOSED METHOD
As we can see, most algorithm adaption methods suffer from two issues: first, most methods apply cumulative summation approximations, which suffers from the problem of overestimation the redundancy of some candidate features. For example, the first item of D2F is feature relevance term, and the feature relevance term is the sum of |L| mutual information, while the feature redundancy term is the sum of |L| * |S| interaction information. As a result, with more already-selected features, the effect of feature relevancy term in D2F is reduced gradually. The same situation occurs in most multi-label feature selection methods [5], [7], [8]. Second, the cumulative summation strategy will lead to the high value of the goal function when one candidate feature is completed related with one or few already-selected features, but it is almost independent from the majority of already-selected features. To address these issues mentioned above, we design a new multi-label feature selection method that combines the cumulative summation of conditional mutual information with the 'maximum of the minimum' criterion. The proposed method is expressed as follows: As shown in Formula (14), y i ∈L I (X k ; y i ) calculates the mutual information between a candidate feature and each class label which represents feature relevance term, and the interaction information I (X k ; y i ; X j ) represents feature redundancy term. The term min X j ∈S y i ∈L I (X k ; y i ; X j ) obtains the minimum of cumulative summation of interaction information. In the Formula (14), feature relevance term is the sum of |L| mutual information and feature redundancy term is the sum of |L| interaction information, which balances the magnitude of feature relevance term and feature redundancy term. Find the X k ∈ X that maximizes Formula (14); 4: Set S ← {S ∪ X k }, and X ← X − X k ; 5: until |S| = N ; 6: Output the set S containing the already-selected features.

Algorithm 1 FRM
N is the number of features to be selected, and S is the already-selected feature subset, X is a full feature set. First, we initialize the S with an empty set. Second, we calculate Formula (14) to find the best candidate feature, and update the S set and X set. Finally, Algorithm 1 repeats the second step until |S| = N .

V. EXPERIMENTAL RESULTS
To demonstrate the classification superiority of the proposed method, our method is compared to two problem transformation-based feature selection methods (PPT+MI VOLUME 8, 2020 and PPT+CHI) and four state-of-the-art multi-label feature selection methods (MIFS, D2F, PMU and SCLS). First, we introduce the experimental data sets and experimental setting. Second, extensive experimental results demonstrate the superiority of our method. Finally, we present the running time and the time complexity of the proposed method and compared methods.

A. EXPERIMENTAL DATA SETS AND SETTING
We execute our experiments on fourteen real-world data sets, which are collected from Mulan Library [35]. They are from four different application areas. Flags data set and Scene data set are relevant to images. Emotions is a multi-label data set that is relevant to music [21]. Genbase and Yeast are relevant to Biology. The remaining data sets are widely applied to text categorization [36]. We discretize continuous features using an equal-width strategy into 3 bins, as recommend in [7], [18]. Table 1 displays the details of the data sets, and the last column indicates the maximal number of already-selected features in feature selection process.
In this paper, ML-KNN [23] classifier, where the k is set to 10, and Support Vector Machine classifier (SVM) classifier are selected to execute the experiments for HL, Macro-F1 and Micro-F1 evaluation metrics. The number of already-selected features K varies from 1 to M with a step size of 1, where M is the 20% of the total number of features (M is the 17% of the total number of features in medical data set). In addition, we set K to the total number of features for Flags data set. All the experiments are executed on an Intel Core i7 with a 3.40 GHZ processing speed and 8 GB main memory. The results of feature selection on training set are adopted to conduct the experiment on test set directly, because the original data sets in Mulan Library have been separated into two parts: training set and test set. The same settings of cross-validation are employed in the literature [1], [37]. The experimental framework is presented in Fig. 1.

B. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we show the experimental results in the forms of figures and tables. Average value of Hamming Loss, Macro-F1 and Micro-F1 across the K groups of feature subsets that are selected according to each selection criterion are recorded in Tables. 2-7. Additionally, standard deviations are also listed together with the three metrics. In Tables. 2-7, the row 'Average' indicates the average value of corresponding feature selection methods. The row 'Improved rate' presents the improvement of the proposed method in comparison to compared methods. The bold fonts mean the best classification performance among the seven feature selection methods. Furthermore, We conduct a paired two-tailed t-test between the proposed method and compared methods. The improvement does not occur by chance when the P-Value is less than 5%. '+', '−' and '=' represent that FRM performs 'better than', 'worse than' and 'equal to' the corresponding compared method.
As shown in Tables. 2-7, we can discover that our method outperforms the compared methods in terms of Hamming Loss, Macro-F1 and Micro-F1 on most data sets. In Table 2,   our method achieves the best classification performance on ten data sets in terms of Hamming Loss metric. Specially, the proposed method FRM obtains better Hamming Loss performance than D2F and PMU on all data sets. In Table 3, FRM outperforms other six compared methods on 12 data sets in terms of Macro-F1 metric. As shown in Table 4 and Table 5, FRM method is superior to other six compared methods on these thirteen data sets in terms of Micro-F1 and Hamming Loss except on the data set emotions on SVM classifier. In Table 5, the proposed method obtains the best classification performance on thirteen data sets except the data set Emotions. Addtionally, as shown in Table 6 and Table 7, our   method achieves the best classification performance in terms of Macro-F1 and Micro-F1 on ML-KNN classifier.
To present the improvement of our method, the last row 'Improved rate' in Tables. 2-7 represents the improved rates regarding the average results in terms of Macro-F1, Micro-F1 and Hamming Loss. In Tables 2-7, we can find that the improved rates of our method in comparison to PPT+MI, PPT+CHI, MIFS, D2F, PMU and SCLS are 17.9%, 18.7%, .7% in terms of Micro-F1. The proposed method outperforms the compared methods significantly in terms of Macro-F1 and Micro-F1 on the two different classifiers. Furthermore, the improved rates of our method in comparison to PPT+MI, PPT+CHI, MIFS, D2F, PMU and SCLS are 2.9%, 2.5%, 7.9%, 2.9%, 5.1% and 6.5% in terms of Hamming Loss on ML-KNN classifier. On the SVM classifier, the improved rates of FRM in comparison to PPT+MI, PPT+CHI, MIFS, D2F, PMU and SCLS are 3.6%, 4.4%, 8.7%, 3.8%, 6.2% and 7.3% in terms of Hamming Loss. The classification performance between our method and compared methods is not noticeable in terms of Hamming Loss on the two different classifiers because the range of value regarding Hamming Loss is narrow, there exist little room for progress. In conclusion, the proposed method achieves the best classification performance.
Next, Figs. 2-5 show the classification performance with different evaluation metrics on four data sets (Arts, Education, Entertain and scene). In Figs. 2-5, the X-axis represents the number of selected features that are varied as {1%, 2%,. . . , 20%} of total number of features. The Y-axis represents the classification performance of the different evaluation metrics after feature selection. Different colors and shapes represent different multi-label feature selection methods.
As shown in Figs. 2-5, different multi-label feature selection methods obtain different curves on different data sets on a specific metric. For Arts data set, FRM obtains good classification performance with the increase of the number of already-selected features especially on the metrics Macro-F1 and Micro-F1. Similar trends occur on the scene data set in Fig. 5. In addition, as shown in Fig. 4, FRM obtains significantly better performance on the Entertainment data set. In general, FRM outperforms the other compared methods.

C. EXPERIMENTS ON AN ARTIFICIAL EXAMPLE
In this section, we conduct experiments on an artificial example to verify the effectiveness of the proposed method. The experimental results present the feature ranking and the classification performance in terms of Hamming Loss, Macro-F1 and Micro-F1 on ML-KNN classifier and SVM classifier. Table 8 and Table 9 present the training data set and the test data set respectively. F = {f 1 , . . . , f 8 } is a feature set and L = {l 1 , . . . , l 4 } is a label set. The experimental results are presented in Table 10. We choose D2F as the compared method because both D2F and the proposed method FRM adopt interaction information as the feature redundancy term, the only difference is that D2F employs cumulative summation strategy to design feature selection method while FRM employs the 'maximum of minimum' criterion. As we can see in Table 10, the feature ranking of FRM is Because of different selection strategy, the feature f 5 is selected as the third feature in D2F method while f 5 is selected as the last feature in FRM method. As we mentione above, the feature f 5 is overestimated in D2F method. Observing Table 10, the proposed method FRM, adopting the 'maximum of minimum' criterion, achieves better classification performance than D2F method in terms of Hamming Loss, Micro-F1 and Macro-F1.

D. TIME COMPLEXITY ANALYSIS AND RUNNING TIME
In this section, we make an analysis of time complexity regarding the proposed method and four information theory-based multi-label feature selection methods FRM, D2F, SCLS and PMU. Consider the number of features to VOLUME 8, 2020  FRM, suppose that k features have been selected. Afterwards, in each iteration, we need to compute the conditional mutual information between N − k candidate features and Y labels given one already-selected feature before selecting the next feature. The time complexity of each iteration is O(MNY ). And the number of iterations is K . Therefore, the total time complexity of FRM is O(KMNY ).  As for D2F, SCLS and PMU methods, the feature relevance term of these methods involves Y mutual information terms for N features of M instances, therefore, the time complexity of feature relevance term in these methods is O(MNY ). Specifically, D2F method considers the interaction information to measure the feature redundancy, hence,      + KMN ), which is lower than FRM, D2F and PMU methods.
Additionally, we add Table 11 to present the running time of the proposed method and mutual information-based methods on four representative data sets. These data sets are from three different application areas, and the number of features varies from 19 to 1185, the number of instances varies from 194 to 5000, and the number of labels varies from 6 to 33. These four mutual information-based methods belong to the class of algorithm adaption methods, SCLS method is designed for reducing running time, moreover, the proposed method FRM obtains the second best performance in terms of running time among the four methods. More importantly, the proposed method achieves the best classification performance in terms of multiple evaluation criterion on two different classifiers.

VI. CONCLUSION AND FUTURE WORK
Multi-label data, in which an instance is associated with more than two labels simultaneously, has emerged as a challenging problem in numerous areas such as text classification and music emotion recognition. Multi-label data suffers from curse of dimensionality. There exist irrelevant and redundant features in multi-label data. Multi-label feature selection methods intend to alleviate the negative effect of irrelevant and redundant features.
Traditional multi-label feature selection methods employ cumulative summation strategy to design feature selection methods, which suffers from the problem of overestimation the redundancy of some candidate features. Additionally, the cumulative summation strategy will lead to the high value of the goal function when one candidate feature is completed related with one or few already-selected features, but it is almost independent from the majority of already-selected features. To address these two limitations, a novel multi-label feature selection method named FRM is proposed. FRM combines the cumulative summation of conditional mutual information with the 'maximum of the minimum' criterion. Additionally, FRM can be rewritten as another form of multi-label feature selection method which employs interaction information as a measure of feature redundancy term that obtains an accurate score of feature redundancy as the number of already-selected features increases. As a result, FRM overcomes the limitations that are mentioned above.
To verify the classification superiority of our method, FRM is compared to two problem transformation-based feature selection methods (PPT+MI and PPT+CHI) and four stateof-the-art multi-label feature selection methods (MIFS, D2F, PMU and SCLS) on fourteen benchmark data sets for three multi-label evaluation metrics on two different classifiers. The experimental results demonstrate that FRM outperforms other compared methods in terms of Hamming Loss, Macro-F1 and Micro-F1. The classification performance of different multi-label methods is dependent on different data sets, therefore, we plan to design an effective multi-label feature selection method that overcomes the limitations in the future.