Multi-Label Feature Selection Based on Min-Relevance Label

Multi-label feature selection has been widely adopted to address multi-label data with high-dimension features. It is critical to calculate label correlations for multi-label feature selection. Existing methods adopt different schemes to calculate label correlations, which obtain different importance of labels. However, there exist two issues regarding these schemes calculating label importance: first, previous schemes cannot predict the whole labels well because they only focus on the most important labels; second, most of important labels have similar classification information corresponding to redundant features. To this end, we use the mutual information metric to obtain different cores regarding label set rather than calculating the importance of labels. Afterwards, we capture features with respect to each core label, finally, obtaining an optimal feature subset. To verify the effectiveness, our method is compared to state-of-the-art multi-label methods on 16 real-world data sets with several evaluating metrics. The results of experiment proves that the proposed method has achieves the best classification performance among all multi-label feature selection methods.


I. INTRODUCTION
Multi-label learning has extensive applications in many research fields [1], [2], [3] such as document classification [4], image recognition [5], [6] and gene prediction [7], [8].Therefore it attracts significant attentions. In multi-label data sets, each instance is usually associated with multiple labels and high-dimension features simultaneously. Highdimension multi-label data have numerous irrelevant and redundant features, which not only degrades classification performance but also increases calculating cost. To address the problem regarding curse of dimension, feature selection technique becomes the focus [9], [10], [11], [12].
The associate editor coordinating the review of this manuscript and approving it for publication was Chao Tong .
Feature selection methods can be divided into three categories [13], [14], [15], [16]: filter models, wrapper models and embedded models.Filter models obtain a feature subset from the original feature set independent of the subsequent classifier [17]. Wrapper models use the result of the classifier to determine the optimal feature subset [18]. Embedded models choose the optimal feature subset in the training process [19]. Wrapper models and embedded models both involve the process of training models. Compared to them, filter models are more flexible and faster [20], thus this paper will focus on this type of models.
Different from the single-label learning that one instance is only related to one label, in multi-label learning, one instance is related to multiple labels.Therefore, it is critical to consider the label correlations. However, there two issues exist in previous methods. First, previous methods cannot predict the whole labels well because they only focus on the most important labels that are defined by the cumulative summation approximation scheme; second, most of important labels have similar classification information corresponding to redundant features. For example, label ''sports'' has high probability alone with label ''baseball'', features are selected around similar classification information, causing that numerous redundant features are selected. In fact, there exist high-order label correlations in multi-label data. Previous methods employ mutual information based on cumulative summation approximation scheme to calculate the relevance between features and labels [21], [22], [23], [24], [25], [26]. The common limitation of these methods is that cumulative summation approximation scheme may overestimate the importance of some candidate features related to groups with more labels, but ignore the importance of groups with fewer labels. In other words, the features selected by these methods may only have a good classification effect for those groups that contain more labels. We believe that a good model should have considerable performance on classification for each label on the data set. Specially, the highlights are concluded in the following: 1) A new label importance evaluation scheme is designed, which consider predict precisely for the whole label set. 2) The redundant features are eliminated by selecting features around different classification information of labels 3) The new feature selection method avoids to overestimate the significance of some features.
The rest of this paper is arranged as follows. Section II introduces some basic concepts of information theory and evaluation criteria of multi-label classification. Section III reviews some related work. In Section IV, we present the proposed multi-label feature selection method. Section V introduces some experimental results to verify the effectiveness of the proposed method. In Section VI, we make a conclusion and give our opinion in future research direction.

A. THE BASIC CONCEPTS OF INFORMATION THEORY
In this subsection, we will introduce some basic concepts of information theory which are used to measure the correlations among random variables [27], [28]. Let X = {x 1 , x 2 , . . . , x n } and Y = {y 1 , y 2 , . . . , y m } be two discrete random variables. Information entropy H(X) is used to measure the uncertainty of X. It is defined as follows: p(x i ) is the probability of xi and the base of log is 2. Then we introduce joint entropy and conditional entropy. The joint entropy of X and Y can be expressed as H(X, Y). H(X|Y) is the conditional entropy of X considering Y, which is used to measure the remaining uncertainty of X under the condition of Y. H(X, Y) and H(X|Y) are defined as follows: p(x i , y j ) is the joint probability of (x i , y j ) and p(x i |y j ) is the conditional probability of x i given y j . With the concept of entropy, mutual information is used to measure the amount of information shared by two variables. The lager the mutual information is, the more relevant the variables are. It is defined as: Then we introduce another variable Z. Conditional mutual information means the mutual information between two variables giving another variable, the mutual information of X and Y giving Z can be expressed as: Considering X,Y as a whole. The joint mutual information is defined as: Interaction information measures the amount of information shared by three variables, which is defined as:
. . , (x n , L n )} be the multilabel test set where the instance pair is composed of a instance and the real labels, L = {l 1 , l 2 , . . . , l q } is the label set, and N is the number of instances and L i ⊆ L is the label set corresponding to the instance x i . Let L i be the predicted label set corresponding to the instance xi. Hamming Loss (HL) is the average fraction of misclassified labels in the test data. HL is defined as: In this formula ⊕ is the symmetric difference between the label set L i and L i . The smaller the value of HL, the better the classification performance. Macro-F1 and Micro-average Micro-F1 are two widely used evaluation criteria for multi-label learning based on the VOLUME 11, 2023 F1 score. Macro-F1 is the arithmetic average of the F1 score of all labels. Macro-F1 is defined as follows: TP i , FP i and FN i denote the number of true positives, false positives and false negatives in the li respectively. Micro-F1 is the weighted average of the F1 over all labels: These three multi-label evaluation criteria can be used to measure the classification performance. For the three evaluation criteria, a lower value of HL indicates a better classification performance while the higher Macro-F1 and Micro-F1 are, the better the classification performance is.

III. RELATED WORK
Traditional multi-label feature selection methods can be divided into two groups: problem transformation and algorithm adaptation [32], [33]. The problem transformation methods involve two steps: (1) transforming the multilabel data sets to multiple groups of single-label data sets; (2) selecting the optimal features from the transformed groups of single-label data sets. Binary Relevance (BR) [6], Label Power set (LP) [34] and Pruned Problem Transformation (PPT) [35] are classic problem transformation methods. BR transforms the multi-label data set into several groups of independent binary classification data sets. LP divides each instance's label set to a single new class. Using ReliefF (RF) [37] and Information Gain (IG) [38] as the feature evaluation criteria to measure the transformed data, N. Spolaôr et al. [36] propose four multi-label feature selection methods (RF-BR, RF-LP, IG-BR and IG-LP) based on BR and LP. However, BR ignores the correlations between labels and LP may cause over-fitting and imbalance problems because it creates too many classes. PPT removes the instances related to rarely occurring labels by a predefined minimal number of occurrences of the label set to improve the effectiveness of LP. Based on this PPT method and mutual information, Doquire and Verleysen [39] propose a multi-label feature selection method (PPT + MI). Additionally, CHI square statistic are used to select the effective features (PPT + CHI) [35]. However, the problem of this type of methods still remain, they usually ignore the correlation between labels.
In recent years, many multi-label feature selection methods based on algorithm adaptation have been proposed, which directly select features from multi-label data sets. S Kashef and H Nezamabadi-pour propose a multi-label feature selection algorithm based on Pareto dominance concept in [40] for label-specific feature selection in multiobjective optimization problem. Sun et al. [25] propose a new feature selection method based on mutual information, which obtains discriminant features considering label correlation through constrained convex optimization (MICO).
Multi-label information feature selection (MIFS) [41] is an integrated feature selection method. It uses latent semantic index (LSI) to decompose multi-label information into low dimensional label space, and then uses the reduced label space to guide the feature selection process through regression model. Gao et al. [53] propose a new feature selection method based on a unified feature selection framework including three low-order information-theoretic terms for multi-label learning named Selected Terms of Feature Selection (STFS) regarding high-order variable correlations. Zhang et al. [54] propose a new feature selection method named Weighted Feature Relevancy Feature Selection (WFRFS)assuming that the remaining uncertainty of candidate features changed dynamicly according to the selected features, thus Relevancy Ratio is designed to clarify the dynamic change amount of information and Weighted Feature Relevancy is defined to evaluate the candidate features. Fan et al. [55] propose manifold learning with structured subspace for multi-label feature selection to take the potential structural information of the data into consideration. Wang et al. [56] propose a new feature selection based on neighborhood discrimination index which is used to characterize the distinguishing information of a neighborhood relation. Fan et al. [57] propose a new feature selection method with local discriminant model and label correlations.It uses local discriminant model to obtain cluster assignment matrix and then explores the label correlations matrix based on clusters by taking advantage of a coefficient matrix between feature space and label space. Lee and Kim [21] propose a multi-label feature selection method based on information theory, which is named Pairwise Multi-label Utility (PMU). Its evaluation function is defined as follows: where L is the label set and S is the feature subset that has been selected. f k is a candidate feature and f j is a feature in S, l i and l j are two labels.The larger the value of J (f k ) is, the more important the candidate feature f k is. Multi-label feature selection method using interaction information (D2F) [22] is also proposed to efficiently measure the correlation between features in multi-label data. It can be presented as: Scalable Criterion for a Large Label Set (SCLS) [23] is proposed to design a new multi-label feature selection method using a scale from 0% to 100% to assess the correlation. It is denoted as follows: Lin et al. [42] propose a multi-label feature selection method based on Max-Dependency and Min-Redundancy (MDMR). The general idea of MDMR is to maximize the feature dependency between candidate features and each label using mutual information and minimize the feature redundancy between the candidate feature and all the alreadyselected features. The criterion of MDMR is defined as |S| is the number of features that have already been selected. Multi-label Feature Selection based on Label Redundancy (LRFS) [2] is also proposed, and LRFS adopts the conditional mutual information between candidate features and each label giving other labels to assess feature relevance. It is defined as follows: (15) Gonzalez-Lopez et al. [43] propose Geometric Mean Maximization (GMM) which selects the optimal features by calculating the geometric mean of the instance's mutual information. GMM is defined as follows: Through the above introduction, we can find that most of previous multi-label feature selection methods based on information theory use cumulative summation approximation to consider label correlation. But the problem is that in real-world multi-label data sets, there is often high-order label correlation in multi-label data, that is, labels tend to be clustered into several groups with high correlation. The common limitation of these methods is that cumulative summation may overestimate the importance of some candidate features related to groups with more labels, but ignore the importance of groups with fewer labels. In other words, the features selected by these methods may only have a good classification effect for those groups that contain more labels. We believe that a good model should have considerable performance on classification for each label in the data set, therefore we propose a method named multi-label Feature Selection method based on Min-relevance Label (MRLFS) that first finds the label set that has minimized relevance and then select features based on the label set.

IV. MRLFS: MIN-RELEVANCE LABEL FEATURE SELECTION A. METHOD PROPOSED
Observing Formulas (11)-(15) that obtain evaluation functions for multi-label feature selection methods using cumulative summation approximation, it is hard to deal with high-order correlation relation. Labels in real-world multilabel data set are likely to cluster to several groups that share similar topics, thus the number of labels in the group will influence the classification performance of feature selection models. On the one hand, they tend to select more redundant features that are all related to the same group with more labels; on the other hand they neglect some important features that are related to the groups with less labels. For example, if there is a label set with 100 labels that can be divided in to 5 groups (g 1 . . . .g 5 ) while there are 80 labels belong to g 1 while only 20 labels belong to the other four groups, then these models may select extra features that are related to g 1 and ignore some features related to the other four groups. Supposing five group are important equally, then these methods may not be satisfying. An effective and comprehensive feature subset should choose features that are related to different semantic groups, which is proved to be effective [44].
To solve these issues, we propose the Min-Relevance Label Feature Selection method (MRLFS). We have mainly taken two stages: 1) By selecting the most irrelevant labels through mutual information to form a new label set, we can avoid the influence of the number of labels in the original label set.

2) The maximum value of mutual information between
features and labels is used as the calculation function, which is also to avoid the deviation of experimental results caused by the large difference in the number of different groups of labels. Let the training sample D with a full feature set F = {f 1 , f 2 , . . . , f n } and the label set L = l 1 , l 2 , . . . , l q . First we calculate the mutual information of all pairs of f i and l j and get the pair of feature with the largest value, we call this pair as initial features f 0 and initial labels l 0 . And we take this f 0 to our feature subset S and l 0 to the label subset L . Then we calculate the mutual information of label not in L with all the labels in L , then we find the label with minimal value and select it into the L , we call this evaluate function as label relevance (LR) and it is denoted as: Afterwards, the minimal label relevance (MLR) is denoted as: this function returns the label with the minimal relevance with the labels in the L . We choose one label at one time and then add the feature having maximal mutual information with this label into subset F , the function is presented as follows: the latter item is used to minimize the feature redundancy in the already-selected feature subset. S is the number of labels in F'. Next we repeat this operation until the number VOLUME 11, 2023  of selected features has arrived the termination condition or all labels have been added into L'. If all the labels have been added and the number of feature we selected is still not enough, we use this function to select feature until the termination condition:

Output:
The already-selected feature subset S. 1: S ← ∅; L ← ∅ 2: for i = 1 to d do 3: for j = 1 to q do 4: calculate the I (f i ; l j ) and get the pair f 0 ,l 0 with the largest value;   methods (D2F, PMU and SCLS). For ease of readability and comparison,we provided the result of in a tabular form in Table 1. Suppose that the number of instances is n, the number of features is d and the number of labels is q. The time complexity of mutual information, conditional mutual information and interaction information is O(n) as all the instances need to be traveled for probability estimation. Suppose that the number of selected features is t, the time complexity of MRLFS is O (ndq + tn(d + q)).The time complexity of SCLS is O(ndq+tnd). The time complexity of D2F is O(ndq+tndq). PMU and LRFS design the evaluation criteria to consider the second-order label correlations. The time complexity of PMU is O(ndq + tndq + ndq 2 ).

V. EXPERIMENTAL RESULTS AND ANALYSIS
To evaluate the classification performance of the proposed method, our method is compared to four state-of-the-art multi-label feature selection methods (MIFS, D2F, PMU and SCLS). Three evaluating metrics (Hamming Loss, Micro-F1 and Marco-F1) are used. And the classifiers used in the experiment are MLKNN classifier and SVM classifier. In this section, we first introduce the experimental data sets VOLUME 11, 2023   and parameter setting of the experiments. Second, extensive experimental results are used to analyze the performance of our method. The experimental framework is presented in Figure 1.

A. DATASET AND EXPERIMENTAL SETTING
Sixteen real-world data sets are used in our experiments, which are collected from Mulan Library [45]. They are from five different application areas. Birds data set is relevant to Text. Scene data set is relevant to images. Emotions is a multilabel data set that is relevant to music [34]. Genbase data set is relevant to Biology. The remaining data sets are widely applied to text categorization [46]. Table 1 displays the details of the data sets. The training set and test set have been already separated in Mulan Library and the number of instances of training and test data set are shown in the last two columns of Table 2.
Experimental settings are introduced as follows: first, continuous features are discretized using an equal-width strategy into three bins, as recommend in the literature [14], [47]. Second, the number of already-selected features b varies from 1 to M with a step size of 1, where M is 20% of the total 416 VOLUME 11, 2023  in various literature [48], [49], [50], [51], [52]. We use the package scikit-learn and in Python 2.7 to implement the classifiers. Tables 3-5 show the classification performance of each feature selection method in terms of hamming loss, Marco-F1 VOLUME 11, 2023 and Mirco-F1. These tables present the details of the experimental results of each compared method providing the average classification results and standard deviations of the M groups of feature subsets selected on each data set. The best classification performance for each data set is shown in bold fonts and second best ones are in underline. In addition, the last rows record the average results of each feature selection method over all benchmark data sets. Tables 3 and 4 record the evaluating results in terms of marco-F1 and mirco-F1. From Table 3, we can see that MRLFS achieves the best classification performance on all data sets except data sets Enron and Scene, but MRLFS obtains sub-optimal results on the data sets Scene. Similarly, the proposed method obtains the best or sub-optimal results on all the data sets in terms of micro-F1 in Table 4. Both Macro-F1 and Micro-F1 metrics are label-based metrics, the larger the values are, the better classification performance the method achieves. Furthermore, Hamming Loss is a examplebased metric is adopted for evaluating MRLFS and other compared methods. The results in Table 5 present HL that the proposed method obtains the best or second best performance on all data sets. At last, we calculate the average results of each method on all data sets in terms of the three evaluation metrics. The proposed method achieves the best classification performance on the average values in terms of three metrics. For example, in Table 3, MRLFS obtains 0.1821 compared to MIFS with 0.1218, D2F with 0.1510, PMU with 0.1303 and SCLS with 0.0935. Hamming loss is used to indicate the fraction that the labels are mis-classified. These results prove the conclusions in the previous paragraph again: our model is effective in classify all types of labels other than certain categories with majority.

B. EXPERIMENTAL RESULT AND ANALYSIS
To further prove our conclusion and clearly display the classification performance, Figures 2-5 show the classification results of MRLFS and other feature selection methods on four multi-label data sets (Entertain, Health, Scene and Science). X-axis of the figures indicates the percentage of the selected features while Y-axis represents the classification performance considering in terms of different evaluation metrics. The curves of different color represent different feature selection methods.
As shown in Figures 2-5, the proposed method outperforms the other compared methods significantly. On Entertain data set, our method is better than all the methods. As for other three data sets (Health, Scene and Science), MRLFS outperforms all the other methods. Overall, our method is better than MIFS, D2F, PMU and SCLS on these four data sets in terms of these three metrics.

VI. CONCLUSION
In this paper, a novel multi-label feature selection method is proposed named Multi-label Feature Selection based on the Min-relevance Label. Our general idea is to find a subset of labels where the labels are as different as possible to represent the whole data set and then select features based on it. In this process, mutual information is used to find the labels with minimal relevance and optimal features. The combination of mutual information between labels and features, and the feature redundancy term constitutes a function to select the features. To demonstrate the effectiveness of our method, MRLFS is compared to four state-of-the art multilabel feature selection methods (MIFS, D2F, SCLS and PMU) using ML-KNN on 16 real-world multi-label data sets with the evaluating metrics of Hamming Loss. Additionally, the SVM classifier is used to evaluate the classification performance among the five feature selection methods in terms of Micro-F1 and Macro-F1. The experimental results proves that MRLFS is effective and reach our target. In future work, we have two main directions: first, we should further improve the performance of the model and optimize the existing problems with high-order label correlation; second, we could find a term to evaluate the diversity of data sets and adjust the structure of the datasets.