HML-RF: Hybrid Multi-Label Random Forest

Multi-label classification is the supervised learning problem in which an instance is associated with a set of labels. In this, labels are correlated, and hence label dependency information plays a vital role. Its always been a question of research to decide the order of labels to exploit their inter-dependency. Hence, to this end, many research works are done that, in general, can be categorized as problem transformation and algorithm adaptation techniques. The problem transformation reconstructs the multi-label problem as a multiple single class problem. The algorithm transformation modifies the existing well-known machine learning approaches to solve the multi-label classification problem. However, these two techniques have their pros and cons. In this paper, we propose a novel approach to consider the merits of both techniques, hence named Hybrid Multi-Label Random Forest (HML-RF). The multi-label decision trees are used as base classifiers in the proposed approach to construct the HML-RF model. Each base classifier is constructed over a randomly selected subset of labels to exploit the label dependency. We also formulate a way to compute the tree strength of a multi-label decision tree, which is used to construct the HML-RF with strength (HML-RFws). The efficacy of the proposed approach is tested over the ten well-known and publicly available datasets. Experimental results show the HML-RF is performing better for at-least six datasets, and the HML-RFws is performing better for at-least nine datasets in comparison to state-of-the-art approaches in terms of accuracy, hamming loss, and zero-one loss. Finally, the statistical test is also validating all the experimental results.


I. INTRODUCTION
The traditional Random Forest (RF) [1] is mostly used for the multi-class classification problem, i.e., where an instance is associated with a single label. For example, object classification [2], face recognition, brain tumor segmentation [3], etc., are a few such problems in which an instance is mapped to a single label [4]. However, there are often many applications in the real world where an instance is associated with multiple categories simultaneously. For example, in the text classification problem [5], a text snippet may belong to more than one class label like sad, angry, and politics; an image containing one of the seven wonders of the world, the Tajmahal, may be tagged as ancient architecture, cultural heritage, and India. Similarly, in many application domains, such as computer vision, image annotation, and bioinformatics, an image is The associate editor coordinating the review of this manuscript and approving it for publication was Yeliz Karaca . associated with multiple labels [6]- [8]. In the past, a lot of work has been done in the direction of multi-label classification [9]- [12]. In multi-label classification, each sample x i is associated with {y i 1 , y i 2 , . . . , y i L } labels such that label vector y i can have either 0/1 possible value, signifying the absence or presence of a label [13].
In the past, several attempts have been made to find the solution for the multi-label classification problem [14]. All the existing approaches can be categorized into two kinds of techniques: problem transformation and algorithm adaptation [13], [15]. The problem transformation approach is an intuitive technique where a multi-label problem is decomposed into multiple independent binary classification problems. Further, any well-known machine learning model can be used to solve each binary classification problem, such as support vector machine (SVM), logistic regression, decision tree, etc. [8]. The Binary Relevance (BR) [16] is one of the well-known problem transformation approaches.
However, this method considers the labels independent of each other, leading to poor performance since labels are correlated in multi-label classification problems. However, several approaches have been developed to exploit the label correlation under the problem transformation. Classifier Chain (CC) [17], Ensemble of Classifier Chain (ECC) [18], Power Set (PS), RAndom k labEL (RAkEL) [19] are few problem transformation approaches, which considers the label correlation for the classification. In the algorithm adaption technique, the existing multi-class algorithm is transformed to solve the multi-label classification problem. This technique considers the label correlation as its intrinsic property. Multi-Label C4.5 Decision Tree (ML-DT) [20], AdaBoost.MH [21], Multi-Label k Nearest Neighbour (ML-kNN) [22] are few of the most used approaches in this category.
Apart from that, inspired by the great success of deep convolution neural networks in the direction of single-label classification [23], [24], several deep learning approaches have been proposed to solve the multi-label classification problems [26]- [28]. 1 Most of these approaches are based on graphical models [29] to consider the label co-occurrence dependencies and further use the Markov random fields [30] to infer the final joint label information. Several other methods have recently been explored, such as extreme machine learning using stacked auto-encoders [31], ML-Forest [32] in the direction of multi-label classification. However, most of these methods cannot model higher-order correlations and are computationally expensive. Similarly, inspired by the fact that all features are not necessary to be essential for all the class label prediction, work has been done to consider the mapping between the feature space and label space [33], [34]. Liu et al. [35] proposed an approach to exploit the relationship of the instances rather than the relationship of the variables and the class labels. Ma et al. [34] proposed an approach to consider the local feature selection and local label correlation under the assumption that instances can be clustered into different groups and feature selection weights and label correlations can be shared by instances in the same group.
Two main challenges are associated with the multi-label classification problem: how to exploit label correlation effectively and handle the high dimensional data [15]. However, several approaches have been proposed like CC, ECC, RAkEL, ML-DT, and ML-kNN in the past to address these challenges. Still, these are the open challenges. Hence, to this end, in this paper, we propose a simple yet effective multilabel learning algorithm Hybrid Multi-Label Random Forest (HML-RF), which can effectively handle both challenges. The proposed approach is an ensemble of multi-label decision trees. It uses the Joint Information Gain (JIG) [37] as a splitting criterion for the construction of trees. The JIG is the 1 The proposed HML-RF approach is in the direction to consider the advantages of both problem transformation and algorithm adaptation approach; hence it has been compared to approaches under these categories. sum of all Information Gain (IG) computed by considering one task at a time. Therefore, the JIG considers the correlation present among all the labels of a multi-label classification problem. As in the past, the RF is used as a dimension reduction and feature selection [39], the JIG is computed on the randomly selected candidate features to decide the best splitting feature value. The presented approach combines both problem transformation and algorithm adaptation techniques. It exploits the advantages of both techniques. Also, to further improve the performance of the proposed HML-RF approach, the tree strength measure is presented to select the best trees for forest construction. Therefore, the forest formed with eligible trees based on tree strength improves the proposed approach performance. The experiments have been conducted over ten well-known publicly available datasets compared with all other state-of-the-art algorithms.
In a summarised way, the overall contribution of the paper are highlighted as follows: 1) A new learning model, named Hybrid Multi-Label Random Forest (HML-RF), is proposed to appraise the advantages of both problem transformation and algorithm adaptation approach.
2) The Joint Information Gain (JIG) is used as a splitting criterion to exploit the label correlation information for the classification.
3) The tree strength concept is proposed for the multilabel decision tree. Further, empirically set a threshold value in terms of tree strength for the multi-label decision tree to be part of the forest. The forest so formed is named HML-RF with strength (HML-RFws). 4) The performance of both the proposed HML-RF and HML-RFws algorithms is tested over ten well-known datasets. Also, the statistical test is applied over the obtained results to strengthen our empirical analysis. The rest of the paper is organized as follows. In Section II, we briefly describe several multi-label classification methods. Section III describes the problem definition. In Section IV, we present the proposed hybrid multi-label random forest along with an example for better understanding. Also, we formulate an approach to compute the strength of the multi-label decision tree. In Section V, we describe datasets used for the experiment, implementation details, and results analysis. The experiments are divided into two parts: in the first part, the proposed HML-RF is formed without considering the tree strength. In the second part, experiments are conducted by considering the tree strength for the formation of HML-RF. In Section VII, we draw the conclusion.

II. RELATED WORK
One of the trivial approaches to solve the multi-label classification task is to transform it into n independent binary classification tasks, known as the Binary Relevance (BR) approach [13]. Note that the BR approach considers the label as an independent entity. Hence it leads to poor performance. To overcome this problem, the CC method is proposed [17]. VOLUME 10, 2022 It also learns n binary classifiers like BR; however, in the CC approach, each binary classifier is trained over a new feature space formed by appending the 0/1 output relevance of all previous classifiers as features and thus form a classifier chain. Therefore, during the construction of n th CC learner, the training set has all the feature set and n − 1 additional features. This approach considers the label correlation in sequential order for the multi-label classification. However, the CC's method performance is constrained by label order, and finding the best combination of label order for the chain formation is itself a challenging task. Moreover, the CC method cannot be implemented in parallel because of the dependency on the chains.
One more disadvantage associated with the CC approach is the error propagation problem. If any CC classifier predicts the wrong outcome at any stage, this will propagate the error into the remaining CC classifiers. Hence, the ECC method is proposed in which several CC classifiers are generated with random orders over the label space [18]. Several methods have been introduced to improve CC's effectiveness, such as replacing binary values by probabilistic outputs [40], finding a proper chain sequence by Monte Carlo method [41], and using recurrent neural network focusing only on positive labels as an extension of probabilistic CC approach [42]. Another popular approach, which considers label correlations, is the Label Power set (LP), in which all different combinations of labels are considered as classes under the single-class problem [43]. The drawbacks of the LP method is the high complexity in training due to the exponential increase of the number of label combinations for higher class labels dataset and the inability to predict the label combinations that do not appear in the training set [8]. Other multi-label classification approaches considered the subsets of labels randomly, such as RAndom k labEL (RAkEL) [19]. It divides the label space into equal partitions of size k, trains an LP set classifier per partition, and predicts by summing all trained classifiers' results. It reduces the computational complexity, but the appropriate value of k is itself a research question.
Similarly, several algorithm adaptation techniques have been proposed such as Multi-label k nearest neighbour [22], decision tree for multi-label classification [20], [44]. The main reason for the popularity of the kNN method is that it can be easily extended to multi-label classification. However, kNN is sensitive to noisy data, and its performance varies from k, whose optimal value relies on data at hand and is hard to be determined [45]. Hence, several variations of the kNN method have been studied, such as BRkNN and LPkNN [19]. The decision tree [20] is constructed as a top-down approach, with the root containing all the samples. At every node, features are examined one by one to find the best splitting point to split the data samples.
Recently, a lot of work has been done in the direction to utilize the label information efficiently. Like, han et al. [9] performed multi-label classification using label-specific features and correlations among instances. A probabilistic neighborhood graph model computes the instance correlation, and the label correlation is computed using cosine similarity. Khandagale et al. [12] proposed the concept of the Bonsai tree, which is a diverse and shallow tree. Authors have proposed a concept to utilize the label information representation using three ways: input, output, and a combination of input and output space representation. Also, the shallow tree has been constructed to avoid error propagation.
The approaches explored so far are either based on problem transformation or algorithm adaption approach. However, these have their own limitations. For example, problem transformation increases the model's overall complexity due to training for each transformed single label problem. The problem becomes more complex as the number of labels increases. The algorithm adaptation extends the traditional learning algorithms for solving the multi-label classification problem; hence its performance is deeply characterized by the traditional learning approach. Also, the sequence of labels to exploit their dependency is still a challenging task.

III. PROBLEM DEFINITION
where L is the number of class labels. A multi-label sample can belong to more than one class label at a time and a maximum of up to L class labels. Each element of the vector y i is having a binary value, indicating the corresponding class label is relevant to the sample or not. Several class labels can be active at once, unlike in multi-class where only one class-label is present. The distinct combination of labels is known as label-set.
The task of multi-label learning is to learn a function h : X → {0, 1} L . Thus, for any test instance x i ∈ X , the multi-label classifier h(·) predicts h(x i ) ⊆ y i as the subset of the labels. In the context of RF, the classifier h(·) is comprised of t n number of multi-label decision trees as classifiers t 1 , t 2 , . . . , t n , where each decision tree t j learns from training data to predict the relevance ofŷ i ∈ {0, 1} for the test instance x i . Thus, the aim is to design a classifier h(·) that can correctly predict the class-label vector for the test instances.

IV. PROPOSED METHOD: HML-RF
In the proposed HML-RF approach, several multi-label decision trees are constructed by using the Joint Information Gain (JIG) as a splitting criterion. It also considers the label dependency with a selection of labels randomly to construct each multi-label decision tree. More specifically, it chooses the labels' sequence completely randomly to analyze their dependency. Hence, in the proposed approach, multi-label decision trees are constructed over the randomly selected subset of labels, which leads to computing the first-order, second-order, or higher-order inter-dependency of the labels. In constructing the multi-label decision tree, the first-order inter-dependency means considering only a single label, the second-order means considering a pair of labels, and the higher-order means considering more than two labels. The choice of these labels is entirely random.
In the HML-RF algorithm, initially we divide the dataset D into training set D 1 and testing set D 2 , such that D = D 1 ∪ D 2 and D 1 ∩ D 2 = φ. The underlying assumption is that the training and testing set have the same data distribution. Therefore, the model trained over the training set can be directly applied to the testing set. All the multi-label decision trees are constructed over the training set using bootstrap sampling. However, the dimensions of the output space keep varying to examine the label correlation. Thus, every time a subset of labels is selected randomly for constructing the multi-label decision tree. As subsets of a set may contain single or more elements, the selected subset may have a single label or multiple labels, respectively. The multi-label decision tree constructed over the dataset having the label set as a subset with a single element behaves as a binary classification tree. Therefore, all decision trees, with a single label as a subset, inherit the property of the problem transformation technique. Similarly, the decision trees constructed using subsets with more than one label will be responsible for inheriting the merits of the algorithm adaptation technique.
One of the trivial approaches to exploit the label correlation is to select all the possible combinations of labels and consider label dependency to construct multi-label decision trees. That means, for L labels, a total of L 1 + L 2 + . . . + L L number of combinations are possible. It includes first-order, second-order, and higher-order correlations. However, this increases the computational cost and the over-fitting as the number of labels increases. Therefore instead of considering all the combinations of labels, considering the selected subset of labels reduces the computation cost. It also considers the first order and higher order label correlation into constructing the multi-label decision trees. Refer Fig. 1 which demonstrate the HML-RF algorithm as a block diagram. We consider a multi-label dataset D = {X , Y }, such that X ∈ R 100×10 and Y ∈ R 100×6 , i.e., dataset contain a total of 100 instances and 10 attributes. Each instance is associated with 6 output labels, i.e., the label set is having {1, 2, 3, 4, 5, 6} elements. Initially dataset is divided into training set Let the HML-RF is having a t n = 5 multi-label decision trees. Therefore, a total of 5 subsets are selected as a output labels, one for each multi-label decision tree from the label set. Further, five times data samples are selected with replacement from the training dataset to construct the five multi-label decision trees. Thus, we have dataset and Y 4 ∈ R 70×1 , and d 5 = {X 5 , Y 5 }, such that X 5 ∈ R 70×10 and Y 5 ∈ R 70×2 corresponding to each multi-label decision tree respectively. We use the JIG as splitting criterion for the construction all multi-label trees. At the end, ensemble of all such multi-label decision trees constructs the HML-RF, and at last majority voting is performed over all the predicted labels from the multi-label decision trees.
Note that, the selected subset may contain only a single label as present in dataset d 2 and d 4 or more than single label as present in dataset d 1 , d 3 , and d 5 . Therefore, multi-label decision trees constructed using d 2 , and d 4 resembles to problem transformation and multi-label decision trees constructed using d 1 , d 3 , and d 5 datasets resembles to algorithm adaptation approach. However, it is important to note that each decision tree predicts all the output labels associated with the test instance, although it has been constructed as a problem transformation or algorithm adaptation approach. The JIG considers the label dependency for the construction of multilabel decision trees. Algorithm 1 describes the pseudo-code for the HML-RF approach.

A. JOINT INFORMATION GAIN (JIG)
The Information Gain (IG) [46] is the measure of the importance of an attribute of the feature vector. It is used to decide the attributes for splitting the data at the node of a decision tree. The IG is derived from the concept of entropy, and the entropy is the measure of impurity or uncertainty present in the dataset. Hence, the objective is to reduce the overall entropy of the dataset.
The entropy of the dataset D can be given as: Here, p(x) is the fraction of examples in a class. The IG can be given as: In a multi-label classification problem, more than one class labels are present. Hence, instead of computing the IG with respect to one class label, compute the JIG, which is the sum of IG computed with respect to each label [37]. Therefore, the JIG can be given as: The JIG computes the information gain with respect to each label 1, 2, . . . , L respectively over the same attribute value x i j = α. Therefore, it considers every attribute value to decide the maximum information gain. So, those attributes having a high correlation with the output labels will lead to having a high impact on classification capability. Hence, at these attribute values, JIG will be maximum and chosen as a splitting feature. In other words, JIG search for an attribute that leads to better discrimination of the labels by reducing the entropy of the dataset concerning each label. This signifies VOLUME 10, 2022 the consideration of label dependencies to decide the best attribute to split while constructing the multi-label decision tree.

Algorithm 1 Creation of HML-RF
Step 3.5: go to step 3.2 end B. TIME COMPLEXITY COMPARISON Let X be dataset with M × N dimensions, and Y be corresponding class labels with M × L dimensions. Let x i is the input instance; i ∈ {1, 2, . . . , M }, and y i is the associated output vector such that y i = {y i 1 , y i 2 , . . . , y i L }. For the implementation of BR, CC, RAkEL, and ECC, the extremely randomized decision tree is used as a base classifier [38]. In the construction of decision tree is k = √ N features are randomly selected at each node. Therefore, time complexity to construct a decision tree is O(k · M ) [25]. The time complexity for the one BR [16] can be defined as O(k · M ). A total of L labels are present, hence total L decision trees are constructed. Thus, the effective time complexity of the BR is O(L · (k · M )). In the CC approach a total of L number of decision trees are constructed, hence its time complexity is O(L · (k · M )). In case of ECC, a total of ''p'' number of CC are constructed, hence its total complexity would be O(p · (L · (k · M ))). Let there are total t n number of trees in the forest, hence the complexity of the RAkEL is O(t n · (k · M )). The complexity of the proposed HML-RF is also O(t n · (k · M )).

C. STRENGTH OF MULTI-LABEL DECISION TREE
The decision tree is a week classifier. Therefore, every decision tree may not always contribute substantially to the overall performance of the forest. Hence, we formulate a way to compute the multi-label decision tree's strength, inspired by the strength formulated for the multi-class decision tree [47]. Every forest tree assigns class labels to test instances based on class probability computed at the decision tree's leaf node. The expected value of the margin function computed over all instances is called tree strength.
Let P t (x i , y i j ) be the class probability computed by decision tree t at the leaf node to assign the class label y i j to the input instance x i . Let y i j be the class labels other than y i j ; j ∈ {1, 2, . . . , L}, computed for the input instance. Then the margin function for the multi-label decision tree t can be defined as: The margin function's value signifies the tree's confidence in classifying the input data. Hence, the higher is the margin value, the more will be confidence in classification. The expected value of the margin function over the input dataset is called the strength of the multi-label decision tree. Thus, the strength of the decision tree is: s t = E X ,Y (mr(X , Y )), where E(·) indicates the expected value over input data X . Here, s t be the strength of the t th multi-label decision tree. Therefore, it can be defined as: The strength of the forest is the average strength of trees in the forest. Hence, let t n be the number of trees in the forest. So, the strength of the forest s, can be given as: Note that the idle value of the strength of a multi-label decision tree is one. Let's consider a sample x i has reached one of the leaf nodes, where all samples belong to the same class labels. In this case, the class probability would be one for all the labels. Therefore, the computed margin function value would be one, i.e., mr(x i , y i ) = 1. If this is true for all the samples x i ; i ∈ {1, 2, . . . , M }, then from using Eq. 2 and Eq. 4, the strength of the tree will be one. On the other side, the worst-case value of tree strength is zero. In this case, all the samples present at the leaf node will have equal class distribution concerning all the labels. Hence, the effective margin value is mr(x i , y i ) = 0. Thus, the range of the strength of the multi-label decision tree lies in [0, 1]. The forest strength is the average strength of decision trees. Therefore, in an idle case, when the strength of all multi-label decision trees is one, the forest so formed will have strength value one. Similarly, on the other side, the worst-case value of the strength of multi-label decision trees will decide the worst value of the strength of the forest, i.e., zero.
Note that a single label classification task is a particular case of multi-label classification. Hence, a multi-label decision tree behaves as a single label decision tree with L = 1. Hence, in that case, Eq. 2 and Eq. 4, indicating margin function and tree's strength, are turned as Eq. 6 and Eq. 7, respectively. Similarly, Eq. 5 is turned out to be Eq. 8, showing the computation of the strength of the forest.

V. EXPERIMENTS USING HML-RF
This section describes the experiments conducted over all the datasets to test the performance of the proposed approach. Initially, we carried out experiments to compare the HML-RF to other state-of-the-art methods. We also compared the ranking of the proposed approach to other competitive approaches. Following that, we experiment to empirically decide the threshold value of tree strength. Furthermore, the HML-RFws are constructed using qualified multi-label decision trees based on tree strength. The experiments are further conducted to show the improvements in terms of measuring matrices values suing the HML-RFws.

A. DATASETS
All the experiments are carried out on ten multi-label benchmark datasets of different types and sizes, which are summarized in Table 1. For example, flag, scene, and Corel5k are image datasets, emotions are a music dataset, yeast and genbase are from the biology domain, birds are audio type data, and enron, slashdot, and delicious datasets are from text-domain [48]. Datasets are also varying in terms of their cardinality and density value. The cardinality of the dataset indicates the average number of labels over all instances, while density is defined as the division of cardinality by the number of labels. It indicates the average number of labels present in a single instance.

B. EVALUATION METRICS
A test instance in multi-label learning involves multiple labels simultaneously, which could be partially correct, entirely wrong, or correct. This makes the traditional single-label classification evaluation metrics, such as recall, precision, and F-measure, are not suitable for evaluating the performance of multi-label algorithms [8]. Therefore, a variety of evaluation metrics for multi-label learning are proposed. We use three widely-used evaluation metrics, Jaccard score, hamming loss, and zero-one loss, to verify the performance of the proposed approach and state-of-the-art methods in our experiments. Jaccard score computes the percentage of correctly predicted labels among all predicted and actual labels. It is also referred to as a multi-label accuracy measure. It is a label-set-based evaluation in which each complete label set is used for the measure. Let y i and y i be actual and predicted label-set respectively, for a test instance x i . Let there is a n number of test instances. Hence, accuracy can be computed as: Hamming loss evaluates how many times an example-label pair is miss-classified, i.e., a label not belonging to the example is predicted, or a label belonging to the example is not predicted. The smaller value of hamming-loss is desirable.  It is a label-based evaluation. It can be computed as: Zero-one loss is a label-set-based evaluation. This is also known as the exact match measure or zero-one loss as a loss measure. The smaller the value of the zero-one loss is desirable. It can be computed as: 0/1 loss is a rigorous measure since it requires the predicted set of labels to be exactly matched with the actual set of labels. It equally penalizes predictions that may be almost correct or wrong. In contrast, hamming loss tends to be very lenient due to the sparsity of multi-labeling and ignores the multi-label problem as a whole. That is why multiple measures are used for the performance analysis of multi-label classification.

C. IMPLEMENTATION DETAILS
The implementation is carried out using a decision tree as a base learner for all the sate-of-the-art methods, like BR [16], CC [17], RAkEL [19], ML-DT [20], ECC [18], and proposed HML-RF. The IG is used as a splitting criterion. The minimum number of samples as a stoppage criterion at the leaf node is set as five. The train-test ratio is kept as 0.8, and the maximum depth is kept as fifteen. The ensemble of classifier chains is constructed using ten classifier chains. The RAkEL is implemented using a decision tree as a base learner, and the size of label sets, k, for each dataset is set to half the number of labels. The ML-DT is implemented using the joint splitting criterion. The proposed HML-RF is constructed using the JIG as the splitting criterion for constructing the multi-label decision tree. A total of fifty multi-label decision trees are considered for the construction of the HML-RF. All other parameters are kept the same for a fair comparison with other state-of-the-art methods. All the experiments are iterated ten times, and the average of obtained results over all the evaluation parameters is quoted. Experiments have been done using python 3.6 on Ubuntu 18.04 with an Intel Core i7 processor and 4 GB RAM.

D. RESULTS AND ANALYSIS
This section presents the experimental results for three evaluation measures, i.e., accuracy, hamming loss, and zero-one loss. Table 2 -Table 4 show the results for accuracy, hamming loss, and zero-one loss computed over all the datasets. Analyzing the value of all the performance measures in the above tables for all the methods, one can observe that the proposed method shows improvement for most of the datasets. To be more specific, it can be observed from Table 2, the BR method [16] is the least efficient since it considers all the labels as independent. The CC method [17] considers the label information as features; hence it has performed better than the BR approach. The RAkEL [19] is also performing moderately, whereas the ML-DT algorithm is the second least efficient. Although it considers the JIG to consider the label correlation into account, it does not consider the permutation and combination of labels to compute the label dependency.
One can also analyze the ensemble-based methods like ECC and HML-RF are outperforming in overall comparison. To be precise, ensemble learning improves the prediction capability of the model. Compared to the ECC approach, the proposed HML-RF has shown significant improvement over the eight datasets out of ten datasets in terms of accuracy (refer Table 2). We also assign each model's rank (from 1 to 6) based on the measuring metrics' value computed over respective datasets. The rank is mentioned inside the small braces corresponding to each method for each dataset. Finally, to compare their effectiveness more intuitively, we compute the average rank assigned to each method and decide the overall ranking of each method. The HML-RF has been assigned rank one as it efficiently classifies most of the datasets compared to other algorithms. Table 3 showing the results for hamming loss. The lower value of hamming-loss is desirable. The proposed HML-RF has shown improvement over six datasets out of ten datasets. As expected, the ensemble-based methods have outperformed. In terms of ranking, the proposed approach has been assigned rank one, and the ECC approach has been assigned rank two. The other measure is a zero-one loss. It is also known as a binary measure since it matches the whole instance labels. One can observe from Table 4, the HML-RF method has performed well over six datasets while the ECC performed over four out of ten datasets. However, TABLE 2. Comparative results for accuracy between the proposed approach and state-of-the-art methods along with their ranking (written in small braces, lower the value, higher the rank) over respective datasets (Higher the value, better the results).

TABLE 3.
Comparative results for hamming loss between the proposed approach and state-of-the-art methods along with their ranking (written in small braces, lower the value, higher the rank) over respective datasets (Lesser the value, better the results).

TABLE 4.
Comparative results for zero-one loss between the proposed approach and state-of-the-art methods along with their ranking (written in small braces, lower the value, higher the rank) over respective datasets (Lesser the value, better the results).
in terms of overall rank comparison, the ECC ranking is higher than the HML-RF. This is due to the construction of lower-strength decision trees. Hence, to analyze this and overcome the effect of such decision trees, the HML-RFws is proposed. As a graphical representation, Fig. 2 to 4 shows the comparison of the HML-RF and HML-RFws with other stateof-the-art methods for accuracy, hamming loss, and zero-one loss measures. These bar graphs show the normalized score (ranging from 0 to 1) of all the three measures computed over datasets. Apart from this, all the state-of-the-art approaches are compared in terms of their ranking for all the measures, as shown in Fig. 5. We can analyze the HML-RF average ranking score is higher than state-of-the-art approaches for accuracy and hamming loss. For the zero-one loss measure, the average rank score of the ECC is slightly higher than the HML-RF. To further investigate the effectiveness of the proposed approach, we compute the overall average rank of the HML-RF approach and all the state-of-the-art methods and plot them (refer Fig. 6). One can observe that the HML-RF has outperformed well and has been assigned a higher rank than other state-of-the-art methods.

VI. EXPERIMENTS USING HML-RFWS
The efficiency of the ensemble method depends on the performance of its base classifiers [4]. Similarly, in the case of HML-RF, multi-label decision trees act as base classifiers. However, it is essential to note that the decision trees are weak classifiers. That means its performance is slightly better than the random classifier but not necessary to always contribute enough for the testing in an ensemble approach. Therefore, we propose a concept of tree strength for the multi-label decision tree, as described in section IV-C. We compute all the   constructed multi-label decision trees' strengths and choose only those trees having tree strength more than the threshold value to be part of the HML-RFws. The threshold value is empirically decided. Thus, all the qualified multi-label decision trees construct the Hybrid Multi-Label Random Forest with tree strength (HML-RFws).

A. IMPLEMENTATION DETAILS
All the parameters required for the implementation of the HML-RFws are kept the same as HML-RF described in section V-C. One additional parameter required is the tree  strength threshold value. To decide the optimum threshold value, we conduct the experiments over the threshold value ranging from 0.1 to 0.9 with a step size of 0.1. The HML-RFws is also consist of a 50 number of multi-label decision trees. However, these decision trees are selected over the condition of decision tree strength > threshold, otherwise discarded, and another multi-label decision tree is constructed. The searching of multi-label decision trees continued until the top fifty qualified trees were obtained or the tree construction process repeated for 1000 times. This upper limit is set to reduce the complexity. Table 5 shows the accuracy of the HML-RFws forest computed over various threshold values. If the number of qualified trees for a specific threshold value is less than 50, we mentioned the actual number of qualified trees found during the construction of the multi-label trees (Actual tree count is written beside the accuracy value). The hamming loss and the zero-one loss value computed over the varying range of threshold are shown in Table 6, and Table 7 respectively. One can conclude that at the 0.7 threshold value, the HML-RFws has shown improved performance for all three measures with respect to all the datasets.   We compute the accuracy, hamming loss, and zero-one loss using the HML-RFws and compare the results to the HML-RF, i.e., the forest constructed without considering the tree strength. Table 8 shows the comparison of the HML-RFws with HMl-RF results. It can be observed that the overall results for accuracy, hamming loss, and zero-one loss have been improved by considering the tree strength. Further, we also show the comparison of accuracy, hamming loss, and zero-one loss of the proposed HML-RFws with all other approaches. The visual appearance of all the three measures is shown in Fig. 2 to 4 for all the state-of-the-art methods along with HML-RF and HML-RFws. We normalized the results for a better visual appearance. One can observe the further improvement in the ranking of the HML-RFws as compared to the HML-RF approach. Especially the rank of HML-RFws improved for the genbase and slashdot datasets in terms of accuracy from (3) to (1) and (2) to (1), respectively.

B. RESULTS AND ANALYSIS
Similarly, the HML-RFws rank improved for the emotions, scene, genbase, and corel5k datasets in terms of hamming loss. The rank of HML-RFws also got improved for zero-one measure over emotions, yeast, enron, and slashdot datasets. This also results in an improvement of the overall average rank score of the HML-RFws. Refer to Fig. 5 as a reference to see the comparison of ranking for all the measures. The overall average rank score comparison of all the algorithms is shown in Fig. 6.

C. STATISTICAL TEST ANALYSIS
To strengthen the claim of obtained results, we applied a statistical paired sample t−test [49] over the accuracy measure. To apply the statistical test, we set the significance level as 0.05. A total of 10 iterations are repeated for each of the experiments. The null hypothesis is defined as: there is no significant difference between the proposed and each of the other methods in terms of accuracy. Let A and B be two Comparative results for accuracy, hamming loss, and zero-one loss between HML-RF, and HML-RF with strength (HML-RFws), computed with a tree strength threshold value of 0.7 for all the datasets.  (i) . Under the null hypothesis, for the n = 10 iterations, the degree of freedom is n − 1 = 9. Therefore, the null hypothesis can be rejected if |t| > t 9,0.975 = 2.262 (two-sided). All the obtained statistical test values are shown in Table 9. It can be concluded that there is a significant difference between HML-RFws and other state-of-theart methods in terms of accuracy. The only close competitor of the HML-RFws algorithm is the ECC algorithm, showing the most consistent results. It is due to the ECC method being an ensemble approach, consisting of several classifier chains, due to the consideration of the label correlation using classifier chain. Although, in comparison to the ECC method, the HML-RFws method has shown significant improvement for six datasets out of a total of ten datasets, whereas in comparison to the other four state-of-the-art approaches, HML-RFws is showing significant improvement for most of the datasets.

VII. CONCLUSION
This paper proposed a simple yet effective, Hybrid Multi-Label Random Forest (HML-RF) approach to consider the advantages of both problem transformation and algorithm adaptation approach. It considers the subset of the label-set as the output labels to construct the multi-label decision trees. The label dependency is considered into the account using the joint splitting criterion. To further improve the efficacy of the proposed approach, the tree strength concept is proposed to compute the strength of the multi-label decision tree. Empirically the threshold value is decided in terms of tree strength to consider trees into the HML-RFws. The HML-RF and HML-RFws have tested over ten well-known publicly available benchmark datasets. Experimental results show the novelty of the HML-RF and HML-RFws along with statistics compared with other state-of-the-art methods. Further, all the algorithms' ranking is decided based on the accuracy, hamming loss, and zero-one loss values. The HML-RFws method has shown significant improvement and assigned rank one. Overall, it has been observed that label dependency plays an essential role in multi-label classification. The inclusion of tree strength can lead to improving the performance of the forest.