Feature Selection for Multi-Label Learning Based on F-Neighborhood Rough Sets

Multi-label learning is often applied to handle complex decision tasks, and feature selection is its essential part. The relation of labels is always ignored or not enough to consider for both multi-label learning and its feature selection. To deal with the problem, F-neighborhood rough sets are employed. Different from other methods, the original approximate space is not changed, but the relation of labels is sufficient to consider. To be specific, a multi-label decision system is discomposed into a family of single-label decision tables with the label set(first-order strategy) at first. Secondly, calculate attribute significance in the family of single-label decision tables. Third, construct an attribute significance matrix and improved attribute significance matrices to evaluate the quality of the features, then a parallel reduct is obtained with information fusion. These processes construct F-neighborhood parallel reduction algorithm for a multi-label decision system(FNPRMS). Compared with the state-of-the-arts, experimental results show that FNPRMS is effective and efficient on 9 publicly available data sets.


I. INTRODUCTION
Multi-label learning [8], [14], [19], [23], [24], [39], [40], [44]- [47] has been prevailing in recent years. Unlike singlelabel learning, each instance can be associated with multiple labels in multi-label learning. For example, in image classification, an image may contain labels cat, dog and human. It is well known that in the era of big data more and more features are obtained in producing practice. Some of the features may be redundant or unrelated to the classification task, which need to be removed before any further data processing.
Multi-label dimension reduction is mainly divided into multi-label feature extraction and multi-label feature selection [4], [22], [32]. Representative algorithms for multilabel feature extraction include MDDM [52] and PCA. They achieve the goal of reducing the dimension of the feature space through spatial mapping or transformation methods, but this kind of methods may destroy the structural information of the original data. Feature selection is the removal of redundant features from the original features without any conversion. The subset of features can preserve the original structure and physical meaning of data. This method is more readable and interpretable. Examples of multi-label feature selection algorithms include RF-ML [31], MDMR [22] and MLFRS [23].
Rough sets [28], [29] were also applied into multi-label feature selection recently. Lin et al. [20], [23], [24] proposed multi-label feature selection methods based on fuzzy rough sets [26] and neighborhood rough sets [12], [25]. These methods consider the correlations among labels in some extent, but need redefine the approximation space of features. It is not well interpretable. Moreover, for different types of data, the positive area may be too large or too small, so that the feature selection is not valid.
Almost all of existing multi-label dimension reduction approaches ignore the relation of labels or are not sufficient to consider the relation of labels.
To deal with the above problems, a new method based on F-neighborhood rough sets [5] is proposed for feature selection in a multi-label decision system. Different from other methods, the original approximate space need not be changed, but the new method has lots of merits over the stateof-the-arts, including interpretable, effectiveness, efficiency, and the relation of labels is sufficient to consider. Firstly, a multi-label decision system is discomposed into a family of single-label decision tables(first-order strategy [46]). Second, the attribute significance based F-neighborhood rough sets is proposed, and then it is employed to evaluate features in a family of single-label decision tables. Finally, the attribute significance matrix [2] and its improvement matrix are defined, and then a forward heuristic feature selection algorithm(FNPRMS) is proposed to select the optimal feature subset with information fusion. Experimental results and theoretical analysis indicate that FNPRMS has a competitive advantage over the state-of-the-arts. The total process of this article is shown in Fig.1 The rest of this article is organized as follows. Section 2 reviews multi-label feature selection and F-rough sets briefly. Section 3 briefly refers to the basic concepts of F-neighborhood rough sets. Section 4 proposes an approach to divide the multi-label decision table into a family of single-label decision tables(first-order strategy), and then introduces a new multi-label feature selection model based on F-neighborhood rough sets, and discusses its properties. Finally, a forward heuristic feature selection algorithm(FNPRMS) is designed. Section 5 conducts experiments and their analysis. Section 6 summarises the article and indicates future works.

II. RELATED WORK
In this section, the related works on multi-label feature selection and F-rough sets are reviewed briefly.

A. MULTI-LABEL FEATURE SELECTION METHODS
The methods of multi-label feature selection are mainly divided into 2 categories: filter and wrapper. Filter methods considers multi-label feature selection as a pre-process of multi-label learning. Wrapper methods selects the optimal subset of features when the classifiers conduct classifying. Filter methods are always more efficient and effective than wrapper methods. Therefore, most of feature selection algorithms belong to filter methods.
Many methods are applied into multi-label feature selection, including information entropy [30], [33], function optimization [33] and granular computing [17]. Seo et al. [30] proposed an improved k-cardinality entropy approximationbased criterion of multi-label feature selection and investigated the parameter k. Sun et al. [33] presented a mutual information based method which obtained the optimal subset of features via constrained convex optimization. Zhang et al. [42] presented an algorithm called MDFS, which exploits label correlations via manifold regularization, then selects optimal features with l 1,2 −norm regularization and convexity. Li et al. [17] put forward a multi-label feature section based on granular computing, which granulates label space firstly, then select the optimal feature subset with a multi-label maximal correlation and minimal redundancy criterion. Li et al. [24] presented a new rough set model and applied it to multi-label feature selection.
Some researchers investigated the problem of missing labels. Ma and Chow [27] employed semantic relational graphs to recover label matrix. Wang et al. [35] defined 3 indexes via information entropy, and feature interaction is employed to select more valuable features in incomplete label space.
However, the label correlations are not sufficient to consider and the problem of labels still exists.

B. F-ROUGH SETS
F-rough sets [2] are an extended version of rough sets. Except for rough sets' merits such as objectivity, understanding and no prior knowledge required, F-rough sets are VOLUME 8, 2020 dynamic, feasible and adaptable, and has been applied to concept drift detection and heterogeneous data processing [3]. F-neighbourhood rough sets [5] are extension of F-rough sets, which can deal with both numerous data and discrete data.

III. PRELIMINARY KNOWLEDGE
In this section, the basic concepts of F-neighborhood rough sets [5] is referred to.
Definition 1: n} be a family of information systems, where U i is a set of instances, and A is a set of attributes. X (IS i ) ⊆ U i (denoted with X if not confused)is a concept variable which has different meanings in different information systems. The F-neighborhood lower and upper approximations of concept X can be defined as (1) If there is a decision attribute d in FIS, the family of information systems becomes into a family of decision systems where | • | denotes the cardinality of •.
After the attribute dependence degree of d to B ⊆ A is defined, the attribute significance of any a ∈ B can be defined as When a ∈ A − B, the attribute significance of a to B alternates the following expression:

IV. MULTI-LABEL FEATURE SELECTION BASED ON F-NEIGHBORHOOD ROUGH SETS
In this section, multi-label feature selection with F-neighborhood rough set model is investigated. An approach is proposed to divide a multi-label decision system into a family of single-label decision tables(first-order strategy) at first. Secondly, the most crucial step is to obtain a parallel reduct from the family of single-label decision tables with F-neighborhood rough set model. At last, related properties are discussed and proved.  Table 1 is a multi-label decision system MDT = (U , A, D), where the universe U = {x 1 , x 2 , x 3 }, and the feature space The multi-label decision table MDT is divided into three single-label decision systems DT 1 , DT 2 and DT 3 (Tab.2, 3 and 4 respectively) under the view of F-neighborhood rough sets. These three single-label decision systems have the same sample set U and feature space A as those of MDT , but the labels are d 1 , d 2 and d 3 respectively. According to this method, a multi-label decision system MDT can be turned into a family of single-label decision systems F, where F = {DT 1 , DT 2 , · · · , DT n }, and n is the number of labels in MDT . The F-neighborhood rough sets for a multi-label decision system MDT has the following definitions: Definition 2: Let MDT = (U , A, D) be a multi-label decision system, F = {DT 1 , DT 2 , · · · , DT n } is a family of singlelabel decision systems corresponding to MDT , the upper and lower approximation of F(corresponding to MDT = (U , A, D) ) are defined respectively as follows: where  U , A, D) be a multi-label decision system, F = {DT 1 , DT 2 , · · · , DT n } is a family of singlelabel decision systems corresponding to MDT , the attribute dependence degree of D to B ⊆ A is defined as: Multi-label information in MDT is fused into γ (F, B, D) together, and the relation of labels is embodied in γ (F, B, D).
Theorem 5: In a family of single-label decision systems F corresponding to MDT , F-neighborhood attribute dependence degree γ (F, B, D) (F, A, D).
Proof 1: According to the Theorem 3 of the literature [13], we can conclude that: For According to the law 1 in the literature [30], γ (F, B, D) can be used as an attribute reduction criterion. Hence, the following theorem 2 is obtained:  (2)Secondly, the second condition is proved. Assume that there exists S ⊂ B, such that γ (F, S, D) = γ (F, A, D). According to (1), it is easy to obtain POS(F, S, D) = POS (F, A, D), this is to say, S is a parallel reduct of F, contradict with that B ⊆ A is a parallel reduct of F. Definition 7: In a family of decision systems F corresponding to MDT ,DT i = (U , A, d i ) ∈ F(i = 1, 2, · · · , n), B ⊆ A. For a ∈ B or a ∈ A − B, the attribute significance of a to B is defined respectively as:
(2)We only prove the second condition. Inverse: The attribute significance matrix H [B, F] has a column with all zeros, assigned the j th column without loss of generality, and B ⊆ A is a parallel reduct of F.   In order to obtain attribute reducts, it is necessary to define an improved matrix H of H . The improved matrix H is defined as follows.
Definition 10: F is a family of decision systems corresponding to the multi-label decision system MDT , The improved attribute significance matrix of B is defined as: · · · · · · · · · · · · σ n1 σ n2 · · · σ nm     (12) where H is an improved matrix of H , If a j ∈ B, then σ ij = 0, This means that the more attributes that are included in B, H is more sparse, until POS(F, B, D) = POS(F, A, D) and H becomes zero matrix. This is an incremental reduction process.
When Theorem 9 is applied to obtain a parallel reduct from a multi-label decision system, FNPRMS is expressed as Algorithm 1.
In FNPRMS, Step 2 is to get the attribute core of F, and Step 4 is to obtain a parallel reduct with incremental methods. The parallel reduct B starts from the empty set.
First, the attribute significance matrix H is created to calculate the core of parallel reducts, which contains all the attributes corresponding to the no-zero columns. Then select the attributes of all subsystems whose attribute importance is not zero (the column corresponding to the zero elements in matrix H ) is added to B. Finally, recurrently calculate the attribute significance matrix H till H is a zero matrix, and the attribute corresponding to the column with the maximal number of non-zero elements is added to B. The algorithm guarantees attributes that affect the positive area will not be deleted. The time complexity of FNPRMS is mainly composed of constructing attribute significance matrix and improved significance matrices. The attribute significance degree is calculated in the same way as [5], and its time complexity is O(m|U |log|U |). |U | represents the number of data in the decision table, m represents the number of conditional attributes, n represents the number of labels(also the number of decision tables), then the time complexity of creating an attribute importance matrix is O(nm 2 |U |log|U |). In the worst case, the number of improved matrices is m, so the time complexity of FNPRMS is O(nm 3 |U |log|U |). The attribute significance in section 3.1 can also be employed to obtain a parallel reduct, but the result is similar   to FNPRMS. For simplicity, it is omitted. However, there is a difference between the attribute significance and the matrices of attribute significance when they are employed to obtain a parallel reduct. The former index may be affected by the number of instances, but the latter index isn't be affected. It is to say that all of the decision tables in matrices are equal.

A. DATA SETS
In this section, we designed some experiments to compare the performance of our method FNPRMS with the state-ofthe-arts. Nine multi-label data sets are selected from Mulan Library [1]. The descriptions of nine data sets are displayed in Table 5.

B. EXPERIMENT SETTINGS
In order to verify the validity of FNPRMS and its feasibility, experiments are designed to compare with the other six multilabel feature selection algorithms, including MLNB [45], MDDMspc [44], MDDMproj [44], PMU [16], MLFRS [23] and MDFS [42]. the parameter smooth in MLNB is set up to 1, as recommended in [45]. The parameter η in MDDMspc and MDDMproj is set up to 0.5, as recommended in [44]. The number of labels in MLFRS is set up to 3, and experimental  results from [23]. For FNPRMS, the neighborhood radius is set up to 0.25. In order to make a fair comparison with MLFRS, we use the same training set and test set as [23]. Meanwhile, ML-kNN [44] evaluates the classification performance of all algorithms, and the parameter k is set as 10, as the same with [23]. In ML-kNN, we use the six metrics to measure the performance of feature selection, including Average Precision, Hamming Loss, Coverage, One-error, Ranking Loss and Micro-F1. Hamming Loss and Micro-F1 are labels set prediction metrics, Average Precision, Coverage, Oneerror and Ranking Loss are label ranking metrics. These algorithms are executed in Matlab R2018a.

C. EXPERIMENTAL RESULT
In the experiment, FNPRMS, MLNB and MLFRS can directly obtain feature subsets, so these three algorithms 39684 VOLUME 8, 2020  can be compared fairly in the same criteria. However, the feature selection results of MDDMspc, MDDMproj, PMU and MDFS are feature rank lists. For comparability with FNPRMS, the number of features is chosen with FNPRMS as their final feature subset. Table 6-11 shows the predicted performance of the six algorithms in Average Precision, Hamming Loss, Coverage, One-error, Ranking Loss, and Micro-F1 respectively. In these tables, (↑) and (↓) respectively mean 'the bigger, the better' and 'the smaller, the better', and the bold numbers indicate the best predictive classification performance of each data set.
From Table 6-11, the following conclusions can be obtained: (1)For Average Precision, FNPRMS has optimal or suboptimal performance on most data sets, and it is very close to optimal performance on Computers and Education. (2)For Hamming Loss, FNPRMS has optimal performance on three data sets, while FNPRMS performance is only a little worse than MDFS or MLNB on the other data sets. (3)For Coverage, One-error and Ranking Loss, FNPRMS is superior to other feature selection algorithms on at least four data sets, and FNPRMS also achieves significant performance on the remaining data sets. (4)For Micro-F1, FNPRMS, MLNB, MLFRS and PMU each get the best on two data sets, MDFS has optimal performance on the remaining data sets. However, FNPRMS has achieved suboptimal results in four data sets that have not achieved optimal results. The experimental results in Table 6-11 show that FNPRMS has excellent performance.
In order to fully compare these algorithms, several experiments is conducted to compare to MLNB, MDDMspc, MDDMproj, PMU, MDFS and FNPRMS. FNPRMS and MLNB with feature subsets, which are got by modifying parameters. For MDDMspc, MDDMproj and PMU, the different numbers of features are kept. Figure 2-5 shows the performance of the different evaluation metrics for the six algorithms on Arts, Business, Computers and Scene. The horizontal axis represents the number of retained features, and the vertical axis represents the performance of different metrics. As shown in Figure 2-5: (1)Six algorithms all can effectively select features, and the performance of each metric gradually becomes better as the number of reserved features increases. (2)Compared with MLNB, MDDMspc, MDDMproj and PMU, FNPRMS occupies obvious advantages. FNPRMS is a little worse than MDFS in smaller feature subset, but can catch up to or even better than MDFS in bigger feature subset. Specially, FNPRMS has evident advantage over MDFS in Scene.(3)FNPRMS has good robustness and can be applied to text data and image data.
In a word, FNPRMS has competitive advantages over the state-of-the-arts.

VI. CONCLUSION
Feature selection is an effective way to reduce redundant features, while reducing redundant attributes can not only improve classification performance but also reduce classification costs. In this paper, a new multi-label rough set model and a new multi-label feature selection algorithm(FNPRMS) are introduced, which can reduce effectively and efficiently the redundant features in multi-label data. FNPRMS combines with the advantages of F-rough sets and neighborhood rough sets, and solves the classical problem of relation among labels in multi-label learning. Experimental results show that FNPRMS has competitive advantages over the state-of-the-arts on nine mulan multi-label data sets [1].
Moreover, the results of FNPRMS are more objective and understanding.
In future work, we will study more effective and efficient algorithms for multi-label learning.
YIRAN HE received the bachelor's degree from Fujian Normal University, in 2018. She is currently pursuing the master's degree with Zhejiang Normal University. Her research interest covers generative adversarial networks.
DAWEI ZHANG received the bachelor's degree from the Huaiyin Institute of Technology, in 2017. He is currently pursuing the Ph.D. degree with Zhejiang Normal University. His research interest covers computer vision and deep learning.