Lattice and Imbalance Informed Multi-Label Learning

In a multi-label dataset, an instance is given a single representation across all possible labels. Despite the mutual sharing of instances among the labels, the membership of the instances vary from label to label. This diversifies the intrinsic class geometries of the labels. Multi-label datasets are often found to be class-imbalanced as well. The varying membership of the instances coupled with the imbalance phenomenon gives rise to varying imbalance ratios across the labels. We address these two key aspects in this work, Lattice and Imbalance Informed Multi-label Learning (LIIML) in a two step procedure. Firstly, we obtain the imbalance ratios and the intrinsic positive and negative class lattices of each label. We capitalize on these two information to obtain a dedicated feature set for each label. In the second step, to handle the class-imbalance further, we employ a scheme of imbalance-adaptive misclassification cost across the labels. We have evaluated the competence of the proposed method in a generic as well as class-imbalanced framework. The elaborate empirical study establishes the competence of the proposed method in both the contexts.


I. INTRODUCTION
Contemporary datasets differ from the class of traditional datasets in a number of aspects -multi-label nature of the data being one of them. In multi-label datasets a single instance in a given input space can belong to one or more of the possible class labels. The need for efficient processing of multi-label data is backed by the availability of datasets with multi-label characteristics from several real-world applications. Beginning with text categorization by [1] and [2], data with multi-label characteristics have emerged from different genres namely images [3], [4], music [5], bioinformatics [6], chemical data analysis [7], tag recommendation systems [8] and video [9]. Consequently, multi-label classification and learning grabbed the attention of the data science community. Let a multi-label dataset be denoted by D = {(x i , Y i ), i = 1, 2, . . . , n} and the label set cardinality be L. Y i = {y i1 , y i2 , . . . , y iL }. Let us assume that each label has exactly two classes positive (1) and negative (0) that is Y ij can be either 1 or 0, j = 1, 2, . . . , L. An instance x i has to The associate editor coordinating the review of this manuscript and approving it for publication was Derek Abbott . be rightfully classified into either positive (1) or negative (0) class for L labels.
In a multi-label dataset, a single set of instances though sharing a same representation across the labels can belong to different classes for different labels. This leads to a variable class partition of the same instance set across different labels. To tackle this aspect of multi-label datasets, selecting dedicated and discriminating features for different labels has been a popular and choice. A number of works has been done following this paradigm whose details can be found in [10].
Class imbalance is the quantitative disproportion between the number of instances belonging to the possible classes of a dataset. For binary classification problems, the class with higher and lesser share of instances are called majority class and minority class respectively. Imbalance ratio is the ratio of number of instances in the majority class to the number of instances in the minority class. Let us assume that we have exactly two classes -positive (1) and negative (0) for each label. In multi-label datasets, the positive class is underrepresented for most of the labels in a dataset. This issue is further compounded by the varying class membership of the instances across different labels. A natural outcome is FIGURE 1. This illustration depicts the phenomenon of variable class geometries and varying imbalance ratios in a toy multi-label dataset. The dataset comprises of 23 two-dimensional points. Fig A shows the set of feature points. We assume this dataset to have 3 binary labels -1, 2 and 3. Each instance can belong to the positive or the negative class with respect to labels 1, 2 and 3 individually. We have used pink and green color to represent positive and negative class-memberships of a point. Figures 1, 2 and 3 shows the classification of the given instances with respect to labels 1, 2 and 3 in order. These figures show that the membership of these instances vary from label to label. For example, instance marked x is positive for labels 1 and 2 while negative for label 3. On the other hand, instance y is positive with respect to label 1 only. The consequence of this varying membership is two-fold. Firstly, as figures 1, 2 and 3 indicate, we get varying positive and negative class geometries for different labels. We may also note that out of the 23 given points, labels 1, 2 and 3 have 5, 12 and 10 positive points respectively. Accordingly, the their imbalances are 3.6, 0.92 and 1.3 in order. So, we get a set of varying imbalance ratios across different labels by virtue of the differential class membership of the instances. Labels 1 and 3 have somewhat similar imbalance ratios (imbalance ratio of label 1 and 3 are 0.92 and 1.3 ). Despite that, the class geometries of these two labels are fairly diverse. These observations serve as the motivation of this work. We address the issue of differential class geometry as well as varying imbalance ratio to have a fruitful learning of multi-label datasets. differential degree of class imbalance for different labels. For example, in yeast dataset [11] (with 14 labels), the minimum imbalance ratio and maximum ratio is 1.32 (for label 12) and 50.74 (for label 14) respectively. A single framework with a single set of parameters may not work well across the two diversified labels. Figure 1 illustrates this phenomenon on a toy multi-label dataset.
In this work, we employ feature extraction followed by an imbalance-adaptive cost sensitive classification to learn the multi-label datasets. We propose that while handling classimbalance of a dataset we should not overlook or distort its original class geometry. A standard solution of handling the class-imbalance problem is by undersampling the majority class or oversampling the minority class. These two techniques modify the set of representaive points of a dataset and can lead to the distortion of the it's original class-geometries. We propose a two-step procedure for obtaining a geometry preserving and imbalance-aware multi-label learner. In the first part of our work, we extract a dedicated feature set from the intrinsic class lattice of the labels. We obtain the overall structure of a dataset from its Relative Neighborhood Graph (RNG). Next, we detect the regions of homogeneous class memberships of RNG and select the label specific lattice points from those. To preserve the original class geometry, we select a differing number of positive and negative lattice points for the labels. When we have significant difference in the class cardinalities of positive and negative classes, selecting equal number of lattice points from both may not preserve the class geometries. It can lead to a distorted representation which in turn may affect the learning. For a label which is well balanced across positive and negative classes, selecting equal number of lattice points for both classes can work well. But for an imbalanced label (with more number of negative points), we have to select more number of negative lattice points than that of positives for proper representation. The ratio of the number of negative lattices to that of the positive lattices for a label is dependent on its imbalance ratio. This helps us preserving the original class geometries. Next, as in LIFT [12], we compute a distance based feature extraction for the points. The extracted features are used to model the set of classifiers (one for each label) and predict the test data.
Class-imbalance is a fundamental feature and issue of multi-label datasets. Cost-sensitive learning [13] was one of primal techniques for tackling the issue of classimbalance and consequently detecting the 'hard-to-learn' positive or minority instances. To nullify the natural bias of the classifier towards the quantitatively abundant majority class, a higher mis-classification penalty is set for the quantitatively scare minority class. The main goal is to bias the classifier towards identifying the minority samples.
In the second part of our work, to address differential classimbalance further, we adopt a cost-sensitive learning scheme where the misclassification cost is adaptive to the imbalance ratio of the labels. As said earlier, a multi-label dataset has differing values of imbalance ratios across the labels. In such a situation, selecting a single misclassification cost for the minority class across will not yield proper learning. Instead, we select a set of misclassification cost values of the minority (positive) class, one for each label. Between two labels with differing degree of imbalance, we set a higher misclassification penalty for the one with higher imbalance than that of the other.
We summarize the contributions of this paper as follows.
• We propose a scheme which works on two perspectives of multi-label learning -i] dedicated feature extraction and ii] handling differential class-imbalance.
• For feature extraction, intrinsic class geometries of the labels are explored. The concept of Relative Neighborhood Graph is used for capturing the class geometries.
• To tackle the differential class-imbalance ratios of the labels, we adopt a simple yet effective imbalanceadaptive misclassification penalty across the labels.
• The efficaciousness of the feature extraction scheme of LIIML is demonstrated empirically on 11 real-world multi-label datasets against generic multi-label learners. It indicates the competitiveness of the proposed scheme. VOLUME 8, 2020 • From the perspective of class-imbalance, the proposed imbalance-adaptive misclassification cost scheme has given remarkable improvement in multi-label performance. In the empirical study, we have compared with class-imbalance dedicated multi-label learner COCOA, RNNOML and three more generic multi-label learners in addition to class-imbalance learners SMOTE, USAM and RML.
• The imbalance-adaptive misclassification cost's effectiveness is also demonstrated on two extant first-order works, namely Binary Relevance [14] and LIFT [12]. In the next section, we present the literature review. In Section 3 and 4, we present the approach and algorithm of our work respectively. Section 5 and Section 6 present the experimental study and the experimental results respectively. The article is wrapped by the Conclusions in Section 7.

II. RELATED WORK
Multi-label learning methods are broadly classified into two types -Problem transformation approach (PT) and Algorithm Adaptation Approach (AA) [15]. On the other hand, studies such as the one in [16] differentiate the multilabel learners into three groups, namely Problem transformation, Algorithm Adaptation and Ensemble of multi-label classifiers.
Problem transformation approaches modify or decompose a multi-label dataset to fit it in a framework of regular decomposition. Depending on the number of decompositions and number of labels involved in a classifier, this class of learners are further sub-divided into first order, second order and higher order paradigms [17]. In first-order PT, only one label is involved in a classifier while for second-order and higher-order approaches, two and more number of labels are involved in a classifier respectively. Notable problem transformation approaches are namely Binary Relevance [14], power set of labels [3], pruned problem transformation [18] and calibrated label ranking [19]. Binary relevance, (BR) is the most primitive form of PT approach, where a series of binary classifiers is generated, one for each label. Though computationally sound (linear with label cardinality), BR is criticized for its inability to capture label correlations [20]. The solution proposed in [3] accommodated label correlations by employing Label Powerset. It generated 2 8 = 256 classifiers for learning 8 labels. Despite involving label correlations, the scheme lacked computational feasibility as it generated an exponential number of classifiers. A more feasible approach was given in RAKEL [21], which considered random subsets of labels. Calibrated Label ranking [19] scheme provided multi-label outputs on the basis of pair-wise classification, considering a synthetic label to distinguish the relevant and irrelevant groups of labels. Ensemble of classifiers like RAKEL [21], ensembles of classifier chains [22]- [24] and ensembles of pruned sets are also popular and effective in learning multi-label datasets. In addition to these, a number of feature selection and extraction methods transform the features in context of each label and follow first-order approach to complete the learning. They are discussed in the detail in the next paragraph. In Algorithm-Adaptation approach, an existing classifier is adapted in the context of multilabel scenario. Quite a number of classifier paradigms like k-nearest neighborhood classifier [25], naive bayes algorithm [26], back-propagation of neural networks [27] are adapted to facilitate multi-label learning. In [25], the k-nearest neighbors of a test point are identified. Following that, their label configurations and the principles of maximum posteriori are used to determine the label predictions of the test instance. In [27], the usual back-propagation algorithm is used with small modifications to accommodate the multi-label characteristics. The error function of the backpropagation algorithm is replaced with a ranking loss minimization function which operates on the fact that a relevant label of an instance is ranked higher than another label to which the instance dose not belong. Another scheme [28] uses the cross-entropy error function in back-propagation neural network for facilitating multi-label learning. Table 1 outlines the basic principles of a number of state-of-the-art multi-label methods.
Apart from the above, multi-label datasets are analyzed from newer perspectives like feature or dataset preprocessing and class distribution of the labels. Label specific feature extraction was proposed in LIFT [12]. In LIFT, following the clustering of the positive and negative classes of each label, the authors extract a label-specific feature set. Feature selection is also done by a number of works on the basis of class characteristics of the labels. Works dealing with feature extraction and selection of multi-label datasets include [29] and [30]. A detailed account and comparative analysis of the extant works in multi-label feature extraction and selection in first-order framework can be found in [10]. Joint feature selection and classification (JFSC) [31] and [32] performs label-correlated feature selection of multilabel datasets. Class distribution of the various of a multilabel dataset is a probable data mine and gives a number of pertinent information. As said earlier, multi-label datasets are class-imbalanced and they are differently imbalanced also. COCOA [33] has addressed the imbalance issue in their work by considering an ensemble of classifiers using pair-wise label correlation. A few more works, [34] and [35] have used cost-adaptive paradigm to address multi-label problems. In [36], authors have integrated the data gravitational model with multi-label lazy learner for improving the minority class performances of imbalanced multi-label datasets. In our very recent work [37], we have used a reverse-nearest neighborhood oversampling to curb the problem of differential imbalance in multi-label datasets. Several other techniques like convex relaxation [38], ensembles of random graph [39] and graph classification [40] are also employed to address multilabel classification.
The proposed method LIIML is a hybrid method which involves i]. a dedicated feature-extraction for the labels (like LIFT [12] and JFSC [31]) and also ii]. adapts the cost-sensitive learning in multi-label context. From first perspective, LIIML is a first-order PT method and from the latter it is an algorithm-adaptation method.

III. APPROACH A. MULTI-LABEL NATURE OF DATA, ITS CONSEQUENCES AND OUR THOUGHTS
A multi-label dataset is characterized by the membership of a set of feature points to more than one label. A class-imbalanced dataset is typified by the quantitative disproportion in the number of instances representing its classes. Multi-label datasets are often found to be class-imbalanced. In this work, we deal with binary multi-label dataset where each label can take exactly one of the two classes (1 -positive class and 0 -negative class). Typically, class 1 and class 0 are the minority and majority classes respectively. The class membership of each instance varies from label to label. An instance which is positive for some label A can be negative for some other label B. A similar phenomenon for all the instances will lead to a different combinations of positive and negative sets for each label (even though the union of the positive and negative set of instances is same for all labels). From spatial perspective, this leads to variable positive and negative class geometries and variable class boundaries for the labels. Figure 1 illustrates this phenomenon. The quantitative consequence of this phenomenon is the variable degree of class imbalance across the labels. The key idea of this work is to design an imbalance-informed scheme which also takes into account the differential class geometries of different labels. For each label, we will extract an imbalance-informed feature set from the positive and negative class geometries of that label. The learning is wrapped up with a set of costsensitive, first-order learners, one for each label.

B. EXTRACTING THE CLASS GEOMETRIES OF LABELS
Our first goal is to extract the positive and negative class geometry of each label. To perceive the geometry, we generate a Relative Neighborhood Graph (RNG) of the entire set of training dataset where the edge weights are the distance between the points. RNG shows the connectivity of a data point or vertex to its adjacent neighborhood and the interconnectivity the points gives the overall anatomy of the feature points. A RNG of G is an undirected graph defined from G where x i and x j are connected whenever there is no . For a given set of points, it's MST is a subgraph of it's RNG. We may note that this Tree will be same for all the labels. But the membership of the vertexes or the data points vary from label to label and leads to a differential positive and negative class structures for the labels. Let X = {x 1 , . . . , x n } be the training data and Y i = {y i1 , . . . , y iL } be the class label vector associated with instance x i . We have assumed that there are L labels in the dataset.
We will extract the positive and negative class geometries (with respect to each label) from the RNG. To extract the above-said, we need to look at the membership of the vertexes to each label. For a label, the membership of a vertex can be either positive (1) -if it belongs to that label or negative (0)if it does not belong to that label. The class-memberships of the data points will likely vary across the labels.
Let us consider an edge e ij between two vertices, v i and v j . If the class-membership of x i and x j with respect to label k are same (both 0 or both 1), we term edge e ij as a homogeneous edge. If y ik and y jk (the memberships of x i and x j ) are both 1, we term e ik to be a positive homogeneous edge. If the class-memberships (y ik and y jk ) are 0, we call it negative homogeneous edge. So, for each label, we will have a set of homogeneous edges which is a subset of the RNG edges. We can further partition this homogeneous edge set into two mutually exclusive sets of positive homogeneous edges and negative homogeneous edges. For each label, we will have a set of positive homogeneous edges and a set of negative homogeneous edges which is described in the next paragraph. We will extract the positive and negative class lattices of the label from its respective sets of homogeneous edges. Homogeneous edges lie in the regions of same class memberships. VOLUME 8, 2020 A homogeneous edge (belonging to a certain class) with smaller weight will likely be a better representative of that class than another with higher weight. It is because, with increasing edge weight, the vertexes (associated with the edge) become sparser in the feature space and eventually overlap with the vertexes associated with a different class. But a vertex associated with a shorter edge will has another vertex near its vicinity which affirms its class-membership. Hence, to get the positive class lattice (for a label), we arrange the positive homogeneous edges in increasing order of their weights. For a certain label k, we select a N Pk number of positive homogeneous edges in increasing order of their weights and compute their midpoints. The set of N Pk midpoints represents the positive lattice of label k. Similarly, we compute N Nk lattice points to represent the negative lattice of label k. In case of extreme imbalance when we do not get any positive homogeneous edge, we select the positive points themselves to represent the positive lattices.
We determine the values of N Pk and N Nk in light of the degree of class-imbalance of label k. Let the degree of imbalance of label k be imb k . For the negative class (generally the majority class) of label k, we select the value of N Nk as N Pk *(log 2 (imb k ) + 1). The logarithm function allows us to add deviations in the figures in a controlled manner. Let us consider two scenarios to analyse the aspect. If there is no imbalance in two classes of label k, that is imb k = 1, (log 2 (imb k ) + 1) value will be 1 and we will select N Pk points as the negative class. On the contrary, if imb k is 16 (dataset is highly imbalanced with respect to label k), (log 2 (imb k ) + 1) will be 5 and we will select 5× N Pk points to represent the negative class. We can also select different figures of N Pk and N Nk . We present a discussion in Remarks 1 at the end of this section.

C. EXTRACTING THE FEATURES
Now, we extract the features for each label. For that, we obtain the distance of a data point from the sets of positive and negative lattice points of a label. In order to make the positive information stand out in a pool of negative data, we multiply the distances from the positive lattices with the respective class imbalance ratio of that label. The above computed distances gives the imbalance-informed mapping of a data point for that label. The set of N Pk positive distances an N Nk negative distances give the transformed mapping of x with respect to label k.

D. HANDLING IMBALANCE FURTHER -COST SENSITIVE CLASSIFICATION
Cost-sensitive learning is one of the ways of handling imbalanced data. As stated earlier, in multi-label datasets, the degree of imbalance varies across labels. After generating the imbalance-informed representations for each data point, we proceed with a cost-sensitive linear SVM based classification. Let the misclassification cost of a minority instance to the majority class for label k be denoted by Cost k . To improve the detection of minority class (generally the positive class), for label j, Cost k value is fixed to cf × (log 2 (imb k ) + 1). cf is a cost-factor whose increasing value gives increasing misclassification cost for the minority class. Remark 5 discusses the details on choice of cf . The misclassification cost of a majority instance to the minority class is set to 1 for all labels. The misclassification cost Cost k value increases with increase in imbalance value of a label and is adaptive to various and diversified ranges of imbalance values in a single dataset. For a label which has no imbalance or the imbalance ratio is 1, the misclassification costs of both classes (no class is minority or majority to be precise) is 1. For an imbalanced label with Cost k > 1, the misclassification cost of the minority instances to the majority class is greater than 1 and it increases with increase in imbalance value. Hence, we have an imbalance-informed misclassification cost for each label. The log 2 function allows us restrict the misclassification costs within an admissible yet varying limit depending on the imbalance ratios. Remarks: 1) Values of N Pk and N Nk : The number of lattice points for the positive class and the negative class are given by N Pk and N Nk respectively. The feature set cardinality in the extracted feature set will increase with the increase in the number of lattice points. Increasing the number of lattice points will give better discernible and classification capabilities of the learner. But this is accompanied with an increase in computational complexity. While setting the values of N Pk and N Nk , we have to make a trade-off between complexity and performance. Experiment 4 in the empirical study explores this aspect. 2) Distance function used: We have used Euclidean distance and Jaccard distance functions for numeric and nominal datasets respectively. 3) Learning with cost: The scheme that we have presented here can be carried out in a equal misclassification cost framework as well as in an enhanced cost (for misclassifying minority instances) framework, the latter presented in subsection 3.4. The equal misclassification cost variant shows the intrinsic capability of the scheme. The enhanced cost variant helps us in handling the class-imbalance of the labels better. We report the outcomes of both the variants to investigate the efficaciousness of our method in both contexts. The utility of the enhanced cost can be also be investigated on existing multi-label learners like LIFT and BR. We have presented a study on this aspect. 4) Choice of cf: In experiment 4, we have explored the choice of cf value. Increase in cf value gives increased misclassification cost for the minority class.

IV. ALGORITHM
Let the multi-label dataset be denoted by D and the number of class labels for D be L.
. . , y il }. y ij is 1 when label j is positive for instance x i , otherwise the value of y ij is 0. Let each x i ∈ R p . We randomly equi-partition D into a training set, D tr and a test set, D te . Let X be the set of training instances (without the label information).
We calculate class-imbalance ratio of each label j, j = 1, 2, . . . , l denoted by imb j . imb j = Number of negative training instances for label j Number of positive training instances for label j In multi-label datasets, we have a single set of observations covering all labels. We construct a Relative Neighborhood Graph (RNG) whose vertexes are represented by the members of X .
To extract more refined information about the class structures, we have to extract a label-specific lattice from these graphs. Firstly, we extract the homogeneous edges of the graph. As explained earlier, homogeneous edge is one whose both vertexes belong to the same class, there are two classes of homogeneous edges, positive and negative. Let x i denotes the i th vertex of the graph and c j (x i ) denotes the class-membership of x i to label j.
Let an edge of Tree between two vertices x i and x k be represented by e ik . w ik denotes the edge-weight of e ik .
S pj and S nj are the sets of positive and negative homogeneous edges of label j respectively.
For each label j, j = 1, 2, . . . , L, Similarly, We arrange the elements of S pj and S nj in increasing order of their edge-weights to get the ranks of their respective elements. Let, for an edge e ik , its rank in its respective set (the set where it belongs) be denoted by R(e ik ). We obtain the ranks of the edges because we will select the lattice points from the shorter homogeneous edge weights. Shorter homogeneous edges have lower ranks than longer edges.
The mid-points of edges in S pj are stored in M pj . M nj stores the mid-points of M nj . Let N Pj and N Nj denote the number of negative and positive lattice points of label j respectively.
Let the representations of M pj and M nj be as follows. Similarly, m 1j , m 2j , . . . , m k nj represent the elements of M nj . It is easy to note that the number of elements of M pj and M nj depends on data distribution and are likely unequal. We have represented their cardinalities with k p and k n respectively.
The transformed mapping of instance x i with respect to label j denoted by z ij is as follows: z ij is a k p +k n dimensional vector or feature. Its first k p components are generated by taking distance from the midpoints of the positive homogeneous edges and multiplying them with the imbalance ratio of label j. The remaining k n components by taking distance from the negative homogeneous edges. Let Z j = {z ij , i = 1, 2, . . . , n}. Z j represents the transformed feature mapping of the training instances in D tr for label j.
Let Min and Maj be the minority and majority classes of a label respectively. Let Cost j (Min, Maj) and Cost j (Maj, Min) denote the misclassification costs of a minority instance to the majority class for label j and vice versa. For each label, Cost j (Min, Maj) is equal to the product of a cost factor (cf) and logarithm of the it's imbalance ratio. In this work, we have fixed the value of cf to 1.
For each label j, Cost j (Min, Maj) = cf × max((log 2 (imb j ) + 1), 1) (13) Cost j (Maj, Min) = 1, i = 1, 2, . . . , n For each label j, we train a learner W j by invoking Z j and the above defined cost function for label j. For classifying a test instance t with respect to label j, we first obtain its transformed mapping for label j and invoke W j to predict it's class. We have used linear SVM classifier implementation of LIBSVM ( [41]) for modeling and classification.  (Experiment 2, 3, 4). L I , F I gives the number of labels and features of these datasets in Experiment 2, 3 and 4 ( experiment on imbalance). min IR, max IR and avg IR gives the minimum imbalance ratio, maximum imbalance ratio and average imbalance ratio associated with the labels of the datasets with respect to the label information of L I .

A. COMPLEXITY ANALYSIS
We analyse the complexity of the proposed scheme of feature extraction and class-imbalance handling separately below. • Class imbalance handling: In our method, we add a dedicated misclassification cost for each label. We set the misclassification cost according to the imbalance ratios of the labels. For L labels and N points, we calculate the misclassification ratios by going through the class labels of N points just once. Hence, for N points and L labels the complexity for calculating the class imbalance ratio and misclassification cost is O(N · L).

V. EXPERIMENTAL SETUP
In this section, we have presented a detailed empirical study where four sets of experiments are carried out. Motivation of each experiment and its experimental layout are presented in the next three subsections.

A. FIRST EXPERIMENT: FEATURE EXTRACTION
In the first experiment, we demonstrate the relative competencies of the proposed and compared methods in a generalized multi-label framework. Eleven regular multi-label datasets are used. The detailed statistics of the datasets is given in Table 2. These datasets are obtained from MULAN [42] and MEKA [43] repositories. For the comparative analysis, we have considered five multi-label learners from different genres.
• Binary Relevance (BR) ( [14]): It is a first-order approach which considers one classifier for each label. Basically we transform the multi-label classification problem into L binary classification problems for L labels.
• Calibrated Label Ranking (CLR) ( [19]): It is secondorder approach which considers pairwise correlation of labels. It also considers a synthetic label to distinguish the set of relevant and irrelevant labels of the instances.
• Random k-Labelsets (RAKEL) ( [21]): A higher order approach, which considers a number of subsets of labels and learns the full correlations within the subsets. We have considered the overlapped version of RAKEL as it considers more number of subsets and captures greater degree of correlation among labels. We have used paper recommended settings of k = 3 and number of subsets m = 2q.
• Ensembles of Classifier Chains [24]: It is a higher order approach which uses binary classifiers for each label. Label correlation is facilitated by the inclusion of predictions of preceding labels into the succeeding ones. Ensembles with randomized label order is considered to distribute the learning of correlations. We have considered ensemble size 100.
• Multi-label learning with label specific features (LIFT) ( [12]): This work is based on a feature extraction scheme, where a dedicated set of features is learned for each label. The label-specific features are used to invoke L binary classifiers, one for each label. As recommended in the paper, r value is set to 0.1.
In this experiment, we have considered equal misclassification cost of minority and majority classes in LIIML to test the inherent efficacy of the proposed method. We have taken the number of positive lattice points to be 100 and varied the cardinality of negative lattice set according the imbalance of the labels. Evaluating metrics: Six metrics namely Hamming Loss, Coverage, One Error, Ranking Loss, Average Precision and Macro-averaging AUC are employed to evaluate the relative efficacies of the comparing and proposed methods. Let x i , i = 1,2,. . . ,N be the set of N test instances and Y i be the L-dimensional label vector of x i . Let Y i be the complement label set of x i . Let α i be the label prediction vector for x i . We denote the label specific predicted score of x i for label j by f j (x i ).
• Hamming Loss: It measures the fraction incorrect predictions for all instances across the entire label set. Lower the value achieved by a learner, better is its performance.
• Average Precision: It calculates the fraction of labels ranked higher (predicted) than a particular labels correctly by a learner. r i (j) denotes the rank of label j for x i instance predicted by a learner.
• One-Error: One error counts the number of instances for which the predicted top-rank label is not present in the actual label set. Lower the value of one-error, better is the performance of the learner.
• Coverage: Let us considered an ordered list of predicted labels for each instance, where the top-ranked label is label is numbered one. Coverage evaluates the number of steps we need to move down the list to get the set of all true labels of the instance. It can easily intuited that lesser the value of coverage better is the performance. Let l j be the j th label. Let rank(x i , l j ) be the rank of j th label w.r.t. instance x i .
• Ranking Loss: Ranking Loss calculates the average percentage of mis-ordered pair of labels. Lower value of ranking loss is desirable for a classifier.
Ranking loss • Macro-averaging AUC: Let AUC j be the AUC score for label j. We calculate the average AUC score of all labels in Macro-averaging AUC. Higher the value of Macro-averaging AUC, better is the performance of the learner.

B. SECOND EXPERIMENT: CLASS-IMBALANCE
We present the empirical study on class-imbalance aspect of multi-label datasets in this subsection. The same eleven multi-label datasets used in the first experiment are used in this section but with some preprocessing. In all of these datasets, we have removed the labels whose imbalance ratio (number of negative instances / number of positive instances) is more than 50 or the number of positive instances is less than 20. A similar protocol has been suggested in [33]. For the nominal datasets, we have performed reduction in the feature set according the same recommendation. The attribute information of the datasets with respect to this experiment are presented on Table 2. Since this work deals with differential class imbalance ratio of multi-label datasets, we have also showed the minimum (min IR), maximum (max IR) and average imbalance (avg IR) statistics of each dataset in Table 2. These datasets are obtained from MULAN [42] and MEKA [43] repositories. For comparative analysis, we consider RAKEL ( [21]), LIFT ( [12]) and CLR ( [19]) of multi-label learners which are used in the first experiment. In addition to that, we have also included COCOA [33] and RML [44]. COCOA specifically addresses class-imbalance problem in multi-label datasets. Additionally, we have also included Reverse-nearest neighborhood based oversampling for multi-label dataset (RNNOML) [37] in this empirical study. MLKNN invokes a set of k-nearest neighbor based classifiers for multi-label datasets. Besides these, we have included a couple of methods -namely SMOTE ( [45]) and Random Undersampling (USAM) which are dedicated to general class-imbalance problem and used them in multi-label setting. The proposed method is run in a cost-sensitive learning framework, where an imbalance adaptive misclassification cost is assigned for each label.
For evaluating their performances we have employed Macro-averaging F 1 and Macro-averaging AUC. They are described below.
• Macro-averaging F 1 : It calculates the average of F 1 values across all labels. Let tp j , tn j , fp j and fn j denote the number of true positive, true negative, false positive and false negative predictions for label j respectively. We calculate F 1 for label j, Macro-averaging F 1 = 1 L L j=1 F 1j (22) • Macro-averaging AUC: Let AUC j be the AUC score for label j. We calculate the average AUC score of all VOLUME 8, 2020 We analyze the utility of the proposed scheme of imbalanceadaptive misclassification cost in this study. We consider two first-order methods LIFT and B R in their default settings where the mis-classification costs of the classes are equal. We compare their performances with an enhanced cost version of each of them, LIFT-cost and B R respectively, where the cost of misclassification of the minority instances is set according to the proposed scheme. We evaluate the difference in performances using Macro-averaging F 1 metric.

D. FOURTH EXPERIMENT: PARAMETER OPTIMIZATION
In this experiment, we have studied the effect of variation of cost factor and number of lattice points on class-imbalance focused multi-label learning. Cost factor is varied between 0.5, 1, 2 and 4. Variation of the number of positive and negative lattice points is also explored.

E. STATISTICAL SIGNIFICANCE TEST
We have conducted Wilcoxon Signed Rank Sum Test to measure the statistical significance of the difference in performance given by the proposed method, LIIML with respect to a competing method. In this work, we have a number of experiments and each is evaluated with more than one metric. Experiment 1 and 2 are the key ones of this work. We have constituted the statistical tests for these two experiments. We report the p value at which the performance of the two methods are different. Lower the p value, more significance is the difference or more certain we are about rejecting the null hypothesis. The null hypothesis assumes that the performance of two methods are same. p value 0.05 or 5% significance level is the standard threshold for rejecting or accepting a null hypothesis. We have used p value 0.05 as the threshold for statistical significance of difference.

VI. RESULTS AND ANALYSIS
In this section, we summarize the results of LIIML with that of the state-of-the-art multi-label learners. Table 3 record the results of Experiment 1, which is dedicated to evaluate the efficacy of the methods in a regular setting. In Table 4 we present the outcomes of Experiment 2, where we have evaluated enhanced-cost versions of LIIML in a class-imbalanced setting. The outcomes of Experiment 3 is presented in Table 5.
Outcomes of experiment 4 are portrayed graphically in Figures 2-5.    Table 3). It reports the p value at which LIIML's performance is statistically superior to that of a comparing method for a given metric. Each row corresponds to a metric and each column to a method. Lower the p value, more significant is superiority. We have selected p = 0.05 as the threshold for statistical significance. Outcomes at which p < 0.05 are indicated in boldface. LIIML achieves statistical superiority with respect to BR, CLR, RAKEL, ECC and LIFT on 5, 2, 5, 4 and 4 cases respectively.
Experiment 1: Table 3 shows the performances of LIIML and the competing methods on the 11 multi-label datasets. On Hamming Loss, the proposed scheme has achieved lowest error value on 10 out of 11 (90.90%) datasets. On One error, Coverage and Ranking Loss, either of LIIML have achieved best scores on 9 (81.81%), 4 ( 36.36 %) and 5 (45.45%) datasets respectively. On Average Precision, LIIML has achieved best scores on 8 (72.72%) datasets. LIIML's performance is superior to other datasets on Macro-averaging AUC metric across 6 (54.54%) datasets. On Table 3, each method has 66 observations ( 11 datasets × 6 metrics ). We have summarized the cumulative observations of Table 3 as follows.
• LIIML's performance is better than CLR on 46 out of 66 (69.69%) pairwise observations. We note that LIIML could not outperform CLR on Ranking Loss and Coverage metrics. The working principles of CLR is based on ranking of labels and this aspect has likely contributed to it's efficiency.
• LIIML has performed better than LIFT on 50 (75.76%) occasions. For 2 pair-wise observations, the scores of LIFT and LIIML are tied. On remaining 14 cases, LIFT has outperformed LIIML.
Experiment 2: Table 4 shows the performance of the proposed and comparing methods on Macro-averaging F 1 and Macro-averaging AUC. LIIML achieves the best score on a total of 8 out of 11 cases ( 72.72% cases) on Macro-averaging F 1 . On one remaining dataset Corel5k, RML has obtained the best results of Macro-averaging F 1 . On 2 datasets, RNNOML has performed best. On Macroaveraging AUC, LIIML performs better than all other methods on 7 out of 11 datasets (63.63%). The remaining 4 best VOLUME 8, 2020 FIGURE 2. Macro-averaging AUC results of four datasets subject to varying and increasing misclassification costs for the minority class. We have varied the cost factor between 0.5, 1, 2 and 4. It can be observed that increasing the cost factor value upto 2 improves the learning of minority classes of each label. The graphs of these figures indicate a loss of performance on cost factor beyond 2 on all the four datasets. On using a value beyond 2, the classifier is getting over-biased towards the minority class. The optimal cost factor value is 2 for three datasets and 1 for one dataset.
scores of Macro-averaging AUC are shared by COCOA (2), CLR (1) and RNNOML (1). Experiment 3: Table 5 records the Macro-averaging F 1 scores of LIFT and BR in regular cost framework and enhanced misclassification cost framework. The results indicate certain effectiveness of the enhanced cost scheme in handling class-imbalance and recognition of the positive FIGURE 3. Macro-averaging F 1 results of four datasets subject to varying and increasing misclassification costs for the minority class. We have varied the cost factor between 0.5, 1, 2 and 4. The observation and analysis of this figure's data is in congruence with our findings from Figure 2. On all four cases, Macro-averaging F 1 value increases on increasing the cost factor value upto 2. It marks the optimal value for learning the given datasets. An increase beyond cost factor value 2 is observed to cause a loss of minority performance.
(minority) class of the multi-label datasets. Macro-averaging F 1 performance of LIFT has improved by over 20% on 8 out of 11 datasets using the misclassification cost-enhancement learning. On BR, the improvement using this scheme is also pronounced as we witness the improvement in results by over 20% on 8 out of 11 datasets also. For two datasets Corel5k and CAL500 the percentage of improvement is more than 100 (w.r.t both BR and LIFT).  Figures 2 and 3 shows the variation of Macro-averaging AUC and Macro-averaging F 1 scores on varying ranges of cost factor. Increasing the value of the cost factor promotes the recognition of minority class instances at the cost of majority class performance dataset. Macroaveraging AUC and Macro F 1 scores also indicate the same. Increasing the cost factor beyond 2 results in sharp fall of Macro-averaging F 1 and Macro-averaging AUC scores for all four datasets. Number of lattice points is important for perceiving a functional geometry of the data points. Considering a lower number of lattice points will give a distorted geometry. This in turn causes a fall in performance. On the other hand increasing the cardinality of the lattice point set is computationally more intensive. Th findings demonstrated in Figures 4 and 5 are somewhat in agreement with the above. The only exception is Medical dataset, which has shown falls in performance with increasing number of lattice points.

Experiment 4:
Statistical Tests: Tables 6 and 7 show the results of statistical significance test on outcomes of Experiment 1 and  (Table 4). It reports the p value at which LIIML's performance is statistically superior to that of a comparing method for a given metric. Each row corresponds to a method and each column to a metric. A lower p value more significant difference in performance. We have selected p = 0.05 as the threshold for statistical significance. Outcomes at which p < 0.05 are indicated in boldface. On Macro-averaging F 1 , LIIML achieves statistical superior performance over 7 out of 8 methods. On Macro-averaging AUC metric, LIIML's performance is statistically superior to 4 out of 7 methods. Experiment 2 respectively. For analyzing results of Experiment 1 (Table 6), we have performed 30 (5 × 6) tests for 5 comparing methods and 6 metrics. LIIML has achieved statistically superior performance on 22 (73.33%) occasions. For analyzing results of Experiment 2 (Table 7), we have conducted 15 tests for 8 comparing methods (RML did not output Macro-averaging AUC scores) and 2 metrics. On 11 (73.33%) cases, LIIML has achieved statistically superior performance.

VII. CONCLUSION
In this paper, we have proposed a novel multi-label learner which takes into account the class imbalance and class geometries of the multi-label datasets. The simple and naive framework of imbalance-adaptive misclassification cost framework has given a new direction in the field of class-imbalanced multi-label learning. The detailed experimental evaluation indicates the performance of the proposed method as comparable to superior across different evaluating metrics. The novelty of LIIML is two fold. First one is the feature extraction method which takes into account both class imbalance and class geometries of the labels. The other key contribution is the imbalance-adaptive cost sensitive learning. It is an effective yet simple tool for handling diversified imbalance ratios of the labels. Moreover, it has a general framework which can be used with any first order multilabel approach. Our future studies are directed at exploring and finding out the optimal number of lattice points across different labels to represent the class geometries of a multilabel dataset.