A Novel Hybrid Feature Selection Algorithm for Hierarchical Classification

Feature selection is a widespread preprocessing step in the data mining field. One of its purposes is to reduce the number of original dataset features to improve a predictive model’s performance. Despite the benefits of feature selection for the classification task, to the best of our knowledge, few studies in the literature address feature selection for the hierarchical classification context. This paper proposes a novel feature selection method based on the general variable neighborhood search metaheuristic, combining a filter and a wrapper step, wherein a global model hierarchical classifier evaluates feature subsets. We used twelve datasets from the proteins and images domains to perform computational experiments to validate the effect of the proposed algorithm on classification performance when using two global hierarchical classifiers proposed in the literature. Statistical tests showed that using our method for feature selection led to predictive performances that were consistently better than or equivalent to that obtained by using all features with the benefit of reducing the number of features needed, which justifies its efficiency for the hierarchical classification scenario.


I. INTRODUCTION
Data mining applications have become essential in recent years due to the massive increase in the amount of data generated and stored. The manipulation of data to transform it into understandable and advantageous information creates new research challenges.
Feature selection aims to identify as many relevant features as possible and decrease the costs for processing data. Typically, data mining tasks use feature selection as a preprocessing step. In this paper, we will focus on feature selection approaches for the classification task. Therefore, we considered only datasets with labeled instances. Improving classifiers' predictive accuracy and reducing the execution time of classification are some of the benefits of feature selection [1].
Among data mining tasks, classification has received considerable attention from the scientific community [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Pavlos I. Lazaridis .
Classification predicts the class label(s) of examples based on the problem domain represented by its features. There are different complexity levels of classification problems in the literature. In traditional (flat) classification problems, one or more class labels are assigned to each dataset instance, and the classes are independent of each other. However, in many real applications, more complex classification problems in which classes that label instances are organized into a hierarchical structure [2] represented by a tree or a directed acyclic graph (DAG), so-called hierarchical classification problems, exist.
Studies have proposed different methods to solve hierarchical classification problems. These methods are categorized as local or global approaches according to how the method handles the class hierarchy. In the local approach, classification is conducted using a set of flat classifiers. In contrast, the global approach uses a single classifier that considers the class hierarchy as a whole. Hierarchical classification methods may also be able to predict different numbers of paths of labels. A method can be restricted to predicting only a single path of labels (single-label problem) or multiple paths of labels (multilabel problem).
Despite the benefits of using feature selection methods as a preprocessing step for the classification task, many of the existing feature selection techniques in the literature cannot be directly applied to a hierarchical classification scenario. The initial efforts to solve feature selection for the hierarchical classification problem proposed applying conventional feature selection techniques and constructing classifiers by breaking down the hierarchical classification problem into several flat classification problems. This type of approach allowed researches to use feature selection techniques and classification algorithms traditionally adopted in flat classification [3]- [5].
Few recent approaches that also use a set of flat classifiers have proposed techniques based on recursive regularization that consider the hierarchical information of classes (e.g., parent-child, sibling, and graph relations) [6], [7]. In addition to structure information, another approach used a semantic description of class labels to select different feature subsets for each subclassifier [8]. It is worth mentioning that none of them conducted experiments using global hierarchical classifiers. Other ranked-based methods have proposed readjusting some existing popular filter feature selection algorithms to consider the hierarchical structure of classes [9], [10].
Unlike the previously mentioned studies, we propose a feature selection approach designed specifically for global model hierarchical classifiers that directly address class hierarchy relations. In the literature, several works propose modifications to existing flat classifiers to address the entire class hierarchy in a single step [11]- [18]. Given the relevance of global classifiers to the hierarchical classification scenario, one can see the importance of developing preprocessing techniques capable of handling the class hierarchy as a whole.
This paper presents a hybrid supervised feature selection method, combining filter techniques to form the ranking of features and metaheuristic techniques to search and evaluate feature subsets to construct solutions capable of improving the predictive performance of global hierarchical classifiers. This paper is an extension of a previous work [19] in the following aspects: • We propose an algorithm that uses a variation of the variable neighborhood search (VNS) [20] metaheuristic, called general variable neighborhood search (GVNS) [21], that applies the basic variable neighborhood descent (B-VND) [22] procedure as a local search method.
• We characterize and compare the running time behavior of the GVNS algorithm to its previous version.
• We add experiments with a wrapper-based feature selection method to compare the effectiveness of the proposed algorithm.
• We include experiments with an additional hierarchical classifier that uses induction of clustering trees for hierarchical multi-label classification (CLUS-HMC).
• Finally, we conduct experiments considering a more extensive dataset collection that covers different domains.
To summarize, our major contributions in this work are as follows: • We propose another method that explores and takes advantage of joint a filter-based approach adapted to consider the hierarchical structure of classes and a search-based metaheuristic technique to find the best subset of features.
• We propose an efficient feature selection algorithm for the supervised hierarchical single-label classification task.
• We conduct extensive experiments on twelve real-world hierarchical datasets from protein and image domains to evaluate our approach's efficacy.
• The proposed method is consistently better than or equivalent to our previous algorithm [19].
• When we consider the running time behavior, the proposed method performs better than our previous method [19] since it achieved the improvements first.
The remainder of this work is organized as follows. Section II presents an overview of hierarchical classification and feature selection. In Section III we present the related work, and in Section IV we describe the problem addressed in this work. The proposed algorithm is detailed in Section V. Section VI presents the computational experiments and reports the results of the comparative experiments. Finally, conclusions and directions for future work are described in Section VII.

II. BACKGROUND
Throughout Sections II-A and II-B, we present an overview of hierarchical classification and feature selection methods, respectively.

A. HIERARCHICAL CLASSIFICATION
Most classification studies in the data mining field are related to flat classification problems, in which the classes are independent of each other. However, in many real applications, the classes that label instances are organized into a hierarchical structure.
Different aspects can characterize hierarchical classification methods [2]. The first aspect is related to the type of hierarchical structure that the method can process (tree or DAG). Fig. 1 presents examples of a tree and a DAG, where the nodes represent the classes, and the edges indicate relationship between them. Basically, in a tree structure (Fig. 1a), each node (class) can possess only one parent node, while in a DAG (Fig. 1b), a child node (class) can have multiple parent nodes. The second aspect is related to how deep in the class hierarchy the classification performs. A method can either perform mandatory leaf node prediction (MLNP) or nonmandatory leaf node predictions (NMLNP). In MLNP, the most specific class assigned to an instance must be one of the classes at a leaf node in the class hierarchy. In contrast, in NMLNP, any class node in the hierarchy (internal or leaf) can be assigned to an instance.
The third aspect refers to the number of different paths of labels in the class hierarchy in which the method can associate an instance. The methods may predict just a single path of labels in the class hierarchy (single-label problem) or be less restricted, predicting multiple paths of labels (multilabel problem), for each instance.
Finally, the fourth aspect concerns how the classification method handles the class hierarchy. Classification methods can perform either flat or hierarchical classification (using a local or global model approach). In flat classification, the methods ignore the class hierarchy and make predictions considering only the classes associated with leaf nodes. In the local model approach, the class hierarchy is explored through a local perspective using a combination of classifiers that consider, in an isolated manner, different parts of the hierarchy. According to Silla Junior and Freitas [2], we can categorize local model approaches according to how they use the local information of the hierarchical structure and how they build their classifiers around it. There are three standard ways of using local information: a local classifier per node, a local classifier per parent node, and a local classifier per hierarchical level. The global model approach uses only one classifier, i.e., it builds a single model considering the class hierarchy as a whole.
In the literature, several works proposing modifications to existing flat classifiers to address the entire class hierarchy in a single step are available. Some examples of modifications of traditional flat classification algorithms are the following: HC4.5 [11] and HLC [12], modified versions of C4.5; global model naive bayes (GMNB) [13], a modified version of the naive bayes; CLUS-HMC [14], a method based on predictive clustering trees; hant-miner [15] and hmant-miner [16], both adaptations of the ant-miner algorithm; HMC-LMLP [17], a neural network method based on multilayer perceptron; and, more recently, the CSHCIC method [18], which integrates hierarchical classification and cost-sensitive learning to reweight training data for the imbalanced class problem.

B. FEATURE SELECTION IN CLASSIFICATION
Feature selection has received increasing attention from researchers in recent years due to the continued rapid growth in the volume of data. Powerful as a preprocessing step, it selects a subset of predictive features to improve the performance of learning models. Data containing irrelevant or redundant features can reduce the predictive capability and increase the classification processing time of classifiers [23]. Several research works have already shown that in specific datasets some of the features can be removed from the feature set without jeopardizing the predictive accuracy of a classifier [24]. In practice, the use of feature selection in the classification task can result in the following benefits [25]: (i) Improvement of the predictive capability of classifiers.
(ii) Reduction of the running time spent in the classification learning process. (iii) Development of simplified classification models, which allow for easier interpretation.
We can categorize feature selection methods according to different aspects. The first aspect is related to the use of labels (class value). Feature selection methods can process datasets that have previously labeled, partially labeled, and nonlabeled instances, leading to the development of supervised, semisupervised, and unsupervised algorithms, respectively. A supervised feature selection algorithm determines the relevance of features by evaluating their existing correlation with the class feature. In this paper, we considered datasets with labeled instances. Therefore, we will focus on studies that proposed feature selection approaches for the supervised learning context, specifically feature selection approaches for the classification task.
Another aspect is related to how the methods evaluate the quality of the predictive features. In this sense, we can consider different approaches that generally can be categorized into embedded, filter, wrapper, or hybrid (involving possible combinations among embedded, filter, and wrapper) methods [26].
A method is categorized as a filter when it uses only intrinsic properties of the data. However, when a method uses a classifier to assess the quality of a given feature subset, it is categorized as a wrapper. Filter methods have the advantage of being independent of a classifier and are generally faster than wrapper techniques. Nevertheless, the wrapper approach usually has the advantage of achieving higher predictive performance than filters.
When we use an embedded feature selection approach, the classification model performs feature selection simultaneously with its creation. Typical examples of these techniques are decision tree algorithms because they select features placed into the nodes of the generated trees [12], [14], [27].
As outlined above, filter approaches are independent of the classification algorithm that will be applied. They use the features' intrinsic properties (i.e., the ''relevance'' of the features) to evaluate the quality of features or subsets of features. Typically, one can divide techniques based on filter approaches into two groups: feature ranking-based approaches and search-based approaches.
Feature ranking-based approaches apply statistical metrics to evaluate each feature individually, rank features according to their relevance, and select the top k features from the ranked list (where k is a predefined number). This approach's drawback is that it considers only one feature per evaluation (univariate method), ignoring the correlations between features. One feature that is irrelevant by itself can be significantly informative when considered together with other features [25]. Examples of ranking-based methods are information gain (IG) feature ranking [28], symmetric uncertainty [25], gain ratio [25], and chi-squared [28].
Search-based approaches consider the relationship between features in a feature subset (as a multivariate method) and search for the space of possible feature subsets. Each feature subset considered by the search method represents a candidate solution, which has its quality measured by an evaluation function. Assuming that the evaluation function penalizes redundant feature subsets, this approach has the advantage of eliminating feature redundancy. However, these approaches take more time to generate and measure each feature subset's quality, making them slower than univariate approaches. Recall that if there are n possible features initially, then there are 2 n possible subsets, which makes the evaluation of every candidate feature subset prohibitive for all but a small fraction of the total number of possible subsets.
In this sense, one can apply various heuristic search strategies such as hill climbing and best first [25] to search the feature subset space in a reasonable time. Metaheuristic algorithms such as simulated annealing (SA) [29], genetic algorithm (GA) [30], and particle swarm optimization (PSO) [31] have also been applied efficiently as search-based feature selection approaches. Recently, researchers have explored strategies that design parallel algorithms to improve the running time of their feature selection approach, as proposed by Huang et al. [32] for internet text classification. Examples of search-based methods are correlation-based feature selection (CFS) [23], [33], and consistency-based feature selection [34].
In wrapper approaches, the same classifier used in the classification step evaluates the quality of the feature subsets. Therefore, the ''usefulness'' of a given subset of features is measured by evaluating the trained classifier using only the features included in that subset. As search-based filter approaches, wrapper approaches need to promote searches among possible subsets of features. Each feature subset is then used to train a classification model evaluated according to some performance measure [35]. The search process proceeds until it finds the subset with the highest evaluation in terms of the classifier's predictive performance.
Methods that follow a wrapper approach generally produce better predictive performance results than those based on a filter approach since the classification algorithm itself drives feature selection. However, in wrapper-based methods, the classifier must be trained and evaluated multiple times during the search process, which could cause very high computational costs, making the method impractical for high-dimensional datasets [36]. Therefore, in the last few years, hybrid filter-wrapper techniques have become the focus of many studies, as in this way, they aggregate the advantages of filter and wrapper approaches. Examples of hybrid filter-wrapper algorithms designed for flat classification problems are HFS-C-P, a framework that integrates a correlation-guided clustering technique and PSO [37]; BDE-X Rank, an approach that combines a wrapper method based on a binary differential evolution (BDE) algorithm with a ranking-based filter method [38]; MIMAGA, an algorithm that combines the mutual information maximization (MIM) and the adaptive genetic algorithm (AGA) [39], and HI-BQPSO, a method that combines a filter technique with an improved quantum-behavior PSO algorithm [40].
This paper designs a hybrid feature selection method for the hierarchical classification context based on the GVNS [21] metaheuristic. It combines a filter step, wherein a feature ranking is constructed based on the hierarchical symmetrical uncertainty (SU H ) measure [10], with a wrapper step, wherein a global model classifier evaluates feature subsets. We used two classifiers of this type, the GMNB [13] and the CLUS-HMC [14].

III. RELATED WORK
Few studies in the literature discuss feature selection techniques for the hierarchical classification scenario as previously defined.
In the work of Koller and Sahami [3], document classification (whose classes represent a hierarchy of topics) was addressed through the local model classification approach VOLUME 9, 2021 combined with feature selection using probabilistic methods for feature selection and classification. They construct a binary classifier for each node of the class hierarchy. A feature selection method is then applied to identify the most relevant features for constructing each local classifier. The feature selection method uses a measure of information theory previously proposed by Koller and Sahami [41]. As a result of this application, besides improving the predictive accuracy, reducing the number of features allowed more robust and simpler classifiers.
Secker et al. [4] solved the problem of predicting protein functions by performing feature selection in conjunction with a local hierarchical classification approach. They used a top-down hierarchical classification strategy to select both classifiers and features for each dataset and each node of the hierarchy. Thus, in each node where a classifier has been constructed, a feature selection step is performed to reduce the dataset dimensionality of that particular node. The proposed feature selection method uses the CFS and the best first algorithm -both available in the WEKA data mining toolkit [42], [43]. They conducted experiments to determine whether feature selection could improve computational efficiency without jeopardizing accuracy in predicting protein functions. Their experiments showed that this top-down system proposal significantly reduced the time required to train and test the classification model while maintaining the predictive accuracy.
Paes et al. [5] explored the use of feature selection techniques to improve the predictive performance of two different hierarchical classification approaches, the local per parent node and local per level approaches. They proposed a method that produces a ranking of the features using the IG measure [44]. After forming the ranking, the p best features are selected, where p is an input parameter of the method. They used datasets from the bioinformatics area to conduct their experiments and concluded that the classifiers' best results occurred when some feature selection strategy was adopted.
In all of the works mentioned above, the feature selection techniques and classifier construction were performed by decomposing the hierarchical classification problem into several flat ones, which allowed the researchers to use feature selection techniques and classification algorithms traditionally adopted in flat classification. Some recent approaches that use local model classifiers have proposed techniques based on recursive regularization that consider the hierarchical structure of classes to select different feature subsets for each subclassifier [6]- [8].
Zhao et al. [6] first propose a hierarchical feature selection technique based on recursive regularization using parent-child and sibling relations in a tree for hierarchical regularization. Experimental results showed that their algorithm efficiently selects different feature subsets for each node in a hierarchical tree structure. They achieved competitive results in both classification accuracy and computational efficiency compared with flat feature selection approaches.
Similarly, Tuo et al. [7] proposed a hierarchical feature selection method with graph regularization. They sequentially used each internal node as the root node and the corresponding child nodes as leaf nodes, forming different subtrees. Then, they constructed parent-child relations as regularization of any two subtrees in the hierarchical tree structure. Their algorithm can also use the DAG label structure. They compared their method with different feature selection methods on six image datasets. The experimental results validate the efficiency and effectiveness of the proposed algorithm.
Huang and Liu [8] proposed the most recent study that uses recursive regularization. This is the first attempt to explore a method to take advantage of the semantic description and the hierarchical structure of class labels in supervised feature selection. First, they represent the label descriptions as semantic regularization via a vector of real numbers using sentence embedding techniques. Then, they propose a similarity score based on the attention mechanism to calculate the relevance between pairwise label vectors. Consequently, they explore the semantic similarities of labels and use them to guide feature selection. They also used parent-child and sibling relations as structural regularization. Finally, they built a supervised learning model and imposed semantic and structural regularization terms on each subclassifier. Their proposed framework outperformed the state-of-the-art feature selection methods in the hierarchical classification domain.
Unlike those studies, our approach does not train one classifier per tree node but works in association with a global hierarchical classifier, directly addressing the hierarchical structure of classes as a whole.
Other ranked-based methods propose to adapt some existing popular filter feature selection algorithms to handle the hierarchical structure of classes [9], [10]. The work of Slavkov et al. [9] proposes a feature selection technique capable of handling the hierarchy of classes as a whole without the decomposition of the hierarchical problem in several flat classification problems for hierarchical multilabel classification problems. They developed an adaptation of the ReliefF [45] algorithm to the hierarchical multilabel context, called HMC-ReliefF. They employed forward feature addition (FFA) curves, a stepwise filter-like procedure to construct classifiers for different numbers of top-k ranked features, to evaluate their method. By comparing the HMC-ReliefF curve to an expected FFA curve obtained from a set of random rankings of features, their experiments showed that for various datasets, the HMC-ReliefF algorithm performed well. Our approach is different because we address the hierarchical single-label classification scenario.
Concerning hierarchical single-label classification, Dias and Merschmann [10] proposed an adaptation of the symmetrical uncertainty (SU ) filter measure to consider the hierarchical structure of classes. Comparative analysis between the ranking generated from the SU H and another ranking randomly generated was performed. In the random ranking, the most relevant features were dispersed throughout the ranking positions. In this comparative evaluation, as expected, the SU H ranking resulted in higher predictive performances of the GMNB classifier than random rankings. We use the SU H filter measure to construct rankings and combine it with a wrapper step in our approach.
This work is an extension of a previous study [19] in which we proposed a hybrid algorithm based on the VNS metaheuristic, named VNS-FSHC, using the SU H measure in a filter step and the GMNB as the classifier of a wrapper step. The present work builds on this preliminary effort by providing a more efficient framework based on the VNS metaheuristic. Furthermore, we include experiments using two hierarchical classifiers (GMNB and CLUS-HMC) and consider a more extensive dataset collection covering different domains. We also add a wrapper hierarchical feature selection method to compare the effectiveness of our approach.
It is worth mentioning that Cerri et al. [46] proposed using the CLUS-HMC decision tree induction classifier as a feature selector and checked if the features selected to construct its tree were sufficiently good to be used as input for two hierarchical multilabel classifiers based on neural networks and genetic algorithms. Their experimental results show that using CLUS-HMC as a feature selector led to better results than when using conventional flat multilabel methods, showing the need to develop feature selection methods specifically to consider hierarchical class relationships. The time needed to train and execute a classifier, its complexity, the probability of overfitting, and dataset dimensionality increase as the number of features increases. Thus, removing irrelevant and redundant features from datasets can improve the accuracy of the predictive classifier, simplify the generated classification model, and reduce the time spent training a classifier. For this reason, feature selection is one of the most popular data preprocessing tasks in the data mining literature.

V. PROPOSAL
Next, we discuss our proposed hybrid feature selection method, which uses the GVNS metaheuristic to solve the FSHC problem. The representation of a solution and its evaluation are presented in Section V-A. Sections V-B and V-C describe how to build an initial solution and how to apply the neighborhood structures to explore the solution space of the problem, respectively. Finally, Section V-D provides a detailed description of the proposed algorithm, general variable neighborhood search for feature selection in hierarchical classification (GVNS-FSHC).

A. SOLUTION REPRESENTATION AND EVALUATION
Our method's first step is to generate an initial solution X ⊆ A and then explore the problem's solution space from this starting point.
To evaluate each solution X = {x 1 , x 2 , . . . , x m }, m ≤ n that is generated, we used the 5-fold cross validation strategy and the hierarchical F-measure (hF) [47] to evaluate the performance of each global hierarchical classifier adopted.
The hF measure is an adaptation of the traditional F-measure, intensely used in flat classification problems, used to consider the class hierarchy.
The quality of the solution X is calculated according to the following equation: where hP(X ) and hR(X ) represent the hierarchical precision and the hierarchical recall, respectively. Considering P j as the set consisting of the most specific class predicted for the test instance j and all its ancestor classes and T j as the set consisting of the true most specific class of this same test instance and all its ancestor classes, hP(X ) and hR(X ) of solution X can be defined according to (2) and (3): (3) VOLUME 9, 2021

B. BUILDING AN INITIAL SOLUTION
As in Costa et al. [19], we used the incremental wrapper subset selection (IWSS) approach [48] to generate the initial solution. IWSS has two steps: (i) Filter: A filter-based measure evaluates each predictive feature independently regarding the dataset classes to create a ranking R. We used the SU H measure [10] to consider the hierarchical context. Then, the ranking R of all features is constructed using the roulette wheel method as in the survival selection phase in GA [49]. Thus, a feature's selection probability is proportional to its SU H value compared to this metric value for all other predictive features. That is, the best-evaluated features according to the SU H metric are more likely to be selected in the first rounds of the roulette wheel method, occupying the initial ranking positions. (ii) Wrapper: The set initial solution X starts with the best-rated feature in the ranking R. Then, we try to insert the next feature A i ∈ R into X iteratively by evaluating the performance of that expanded subset X = X ∪ {A i }. We evaluate the quality of each candidate subset X in a wrapper way (using a global model classifier). If X increases the classifier's predictive performance, A i is added to X ; otherwise, it is discarded.
We used the same 5-fold cross-validation method in all wrapper evaluations to ensure fair comparisons. Additionally, we complement the IWSS method by adding a step that verifies the feature redundancy. When analyzing the inclusion of a feature A i in the initial solution, if its insertion in X does not improve the classifier's performance, we try to swap it with each feature already inserted in X . Then, if one of these temporary subsets increases the classifier's performance concerning X , the best-evaluated subset is maintained for the next iteration.
For instance, let X = {A 1 , A 2 } and hF(X ) = 0.70. We inserted A 3 in X , but it did not improve the classifier's performance. Therefore, we generated the temporary subsets Y = {A 3 , A 2 } and Z = {A 1 , A 3 }, where hF(Y ) = 0.75 and hF(Z ) = 0.60. As hF(Y ) is greater than hf (X ), Y is maintained for the next iteration, which would seek to include A 4 in Y . This procedure aims to revoke some previous decisions by identifying selected features that may become ineffective after the insertion of another feature. Thus, this step follows the well-known proximate optimality principle (POP) [50].

C. NEIGHBORHOOD STRUCTURES
We considered three types of neighborhoods for a solution X to search the problem solution space: (i) Neighborhood structure N 1 : It consists of removing a feature X j ∈ X from X , that is, (iii) Neighborhood structure N 3 : It consists of swapping a feature X j ∈ X with a feature A i ∈ (A \ X ).
For the example described in Section IV, given the solution X = {word count, verb count, noun count}, a swap movement consists of swapping a feature in X with another that is not already inserted in X. Thus, X = {word count, character count, noun count} is a neighbor of X considering the swap movement. Likewise, X = {word count, verb count, character count, noun count} is a neighbor example considering the insertion movement, and X = {verb count, noun count} is a neighbor of X produced by the removal movement.

D. GVNS APPROACH TO SOLVE FSHC
This section presents the GVNS-FSHC algorithm, an adaptation of the GVNS metaheuristic [22] to solve the FSHC problem.
GVNS is a variation of the VNS metaheuristic, a framework for building heuristics based on neighborhoods' systematic changes. It is applied to find a local minimum in a descent step and escape from the corresponding valley in a perturbation step [21]. GVNS differs from VNS in the local search method. While the local search is conventional in VNS, in GVNS, the local search is performed by the variable neighborhood descent (VND) [20] method.
In our GVNS-FSHC algorithm, we apply the basic sequential VND, named B-VND in Hansen et al. [21]. Algorithm 1 presents the pseudocode of the proposed GVNS-FSHC. if Relevance(X , X , w) then 12: X ← X ; 13: k ← 1; 14: attempt ← 0; 15: else 16: if attempt > attempt max then 17: k ← k + 1; 18: attempt ← 0; 19: end if 20: end if 21: until t ≤ t max 22: return X ; In Algorithm 1, D, C, and M are the training set, the hierarchical classifier, and the SU H filter measure, respectively. Furthermore, N 1 , N 2 , and N 3 are the neighborhoods defined in Section V-C. The attempt max , RDrate, and w inputs are predefined parameters and will be explained below.
The attempt max parameter defines the maximum number of attempts without improvement using the same level of perturbations k in the Shake function. In a classical GVNS algorithm, the level of perturbations is increased whenever there is no improvement in the solution. Instead, in our algorithm, we only increase the level of perturbations after performing some local search attempts without improving the current solution. This strategy follows the ideas introduced by Reinsma et al. [51] and used successfully in Santos et al. [52].
RDrate is a percentage rate used in the B-VND improvement procedure, described in Section V-D2. Finally, w is the number of folds in which the tested solution's evaluation (hF) must be greater than the current solution's evaluation. The Relevance function is described in Section V-D1.
The algorithm generates an initial solution X (line 3) by applying the IWSS approach described in Section V-B. In line 4, the variable k, which defines the number of random moves that will be applied in a given solution X to generate a perturbed solution in the current neighborhood, is initialized. In line 5, the variable attempt, used to control the number of iterations using the same level of perturbations k without improvement in the current solution X , is started.
A neighborhood structure is chosen randomly (line 8), and then the perturbed solution X is generated by the shaking procedure (line 9) that considers the neighborhood structure N l (.) to perform k moves on the solution X . The solution X is subjected to the B-VND local search procedure, generating the solution X . Next, the Relevance function verifies whether X is better than the current solution X . If an improvement is detected, X is considered the best solution found so far, and k is set to one. In lines 16 to 19, when no improvement is detected, if attempt max iterations have already occurred, the variable k is increased by 1 and attempt is restarted. GVNS-FSHC ends when the given total running time t max expires.

1) RELEVANCE FUNCTION
Algorithm 2 outlines the pseudocode of the Relevance function. It starts by measuring the average hF performance achieved using the 5-fold cross-validation procedure for each solution (lines 3 and 4). The solution X is considered better than X if the average hF(X ) is larger than hF(X ) (line 6) and if w-fold measures of the five X .
− → hF are greater than or equal to the corresponding measure of X .
− → hF (line 12). Thus, if both conditions are true, X is considered better than X . [21] is the local search used in the GVNS-FSHC algorithm (line 10 of Algorithm 1). Our B-VND approach uses the following sequence of neighborhoods, in this order: if counter ≥ w then 13: return TRUE; 14: end if 15: end if 16: return FALSE; N 1 , N 2 , and N 3 . We ordered these neighborhoods by their size, which is a common strategy in VNS-based algorithms, according to Hansen et al. [21].

B-VND
In Algorithm 3, X is the current solution subjected to the B-VND local search procedure and RDrate is a percentage used to calculate the maximum number of iterations without improvement of the random descent improvement step. Furthermore, inputs D, C, M , w, N 1 , N 2 , and N 3 are the same as defined in Algorithm 1. X ← X ; 6: RDmax ← RDrate percent of a predefined number of iterations 7: iterRD ← 1; 8: while iterRD ≤ RDmax do 9: Randomly choose X ∈ N l (X ) 10: if Relevance(X , X , w) then 11: iterRD ← 1; 12: X ← X ; 13: end if 14: iterRD = iterRD + 1; 15: end while 16: if Relevance(X , X , w) then 17: X ← X ; 18: l ← 1; 19: else 20: l = l + 1; 21: end if 22: end while 23: return X ; VOLUME 9, 2021 In line 3, l represents the current neighborhood structure used by the B-VND procedure. Initially, the maximum number of iterations without improvement (RDmax in line 6), used by the random descent improvement step (lines 8 to 15), is defined. Considering X the current solution and A the set of predictive features (Section IV), we will denote |X | as the number of elements of X , and RDmax = RDrate × |X | × |A \ X |.
Our B-VND procedure has a random descent step (lines 8 to 15) in the same neighborhood and a step to change neighborhoods (lines 16 to 21). At the beginning of the B-VND procedure, the algorithm makes a copy X of the current solution X (line 5). The random descent strategy starts by analyzing a neighbor X that belongs to the current neighborhood N l (X ) (line 9) and accepts it as the new current solution if it is strictly better than X (line 10). Otherwise, X remains unchanged, and the algorithm generates and analyzes another neighbor. The algorithm repeats this random procedure until there are RDmax iterations without improvement in the same neighborhood (line 8). Then, if the improved solution X is better than X , then X becomes the new current solution, and the random descent search returns to the first neighborhood (lines 17 and 18); otherwise, the search continues in the next neighborhood (line 20). The B-VND ends when there is no improvement in neither of the three neighborhoods.
It is worth mentioning that the SU H filter measure generates a feature ranking, used to direct the selection of features to swap, insert, or exclude features from a candidate solution.
To do this, we perform the roulette wheel method, as used in the survival selection phase in GAs. Therefore, the probability of inserting a feature in a candidate solution is higher if it has higher ranking values. Similarly, features in a candidate solution set with low ranking values have a higher probability of being removed from the solution set.

VI. EXPERIMENTAL RESULTS
The GVNS-FSHC algorithm presented in Section V-D was implemented in C++ using the compiler g++ version 4.8.5 for its execution. The experiments were performed on a computer with an Intel Xeon(R) CPU E5620 @ 2.40 GHz × 16, 48 GB of RAM, and a CentOS Linux 7 operating system. Although this computer processor has more than one core, the algorithm was not optimized for multicore-processing.
GVNS-FSHC is a preprocessing step designed specifically for global hierarchical classifiers. In this sense, computational experiments evaluate the efficacy of the proposed algorithm for feature selection in the hierarchical single-label classification context. We used the GMNB and CLUS-HMC hierarchical classifiers to evaluate the quality of the selected features. It is worth mentioning that the CLUS-HMC handles hierarchical multilabel problems, but it can also be used in the hierarchical single-label context. In the latter case, one needs only to consider single-label datasets as a particular case of multilabel classification in which the number of labels is equal to one.
Based on the evaluation metrics, the hierarchical precision and hierarchical recall, described in Section V-A, we compared the proposed GVNS-FSHC algorithm to the following feature selection strategies: (i) ALL: We measured the performance of the classifier without any feature selection preprocessing step, i.e., using all features from the dataset. (ii) VNS-FSHC: A previous version of this approach so-called variable neighborhood search for feature selection in hierarchical classification (VNS-FSHC) [19]. (iii) BF: We implemented a bottom-up wrapper-based approach of the best first algorithm, a well-known heuristic search method [25]. We first ranked all the features using the classifier performance evaluation in a descending manner. Then, starting with a subset containing only the first feature of the rank, the algorithm returns the best feature subset found by the heuristic search and measures the quality of each candidate subset based on the classifier performance. Instead of evaluating all the subsets of features generated in the OPEN list, we chose a predefined number of backtracking steps to a candidate solution in the OPEN list without improvements as the stopping criterion of the algorithm.
Section VI-A presents the dataset description and preprocessing steps. Section VI-B presents the parameter configuration. Section VI-C details the computational results of the proposed method using the GMNB and CLUS-HMC classifiers.

A. DATASET DESCRIPTION
The experiments use twelve public benchmark datasets with classes hierarchically organized in a tree structure, covering two domains, proteins and images. The protein domain is represented by bioinformatic datasets 1 referring to the yeast genome [11].
The image datasets 2 were selected from the ImageCLEF 2007 competition for annotating medical X-ray images. ImageCLEF aims to provide an evaluation forum for the cross-language annotation of the medical radiological images [53].
These datasets were initially available as multilabel data. Since our method focuses on addressing the single-label scenario, we perform a preprocessing step to convert the datasets into single-label data. Table 1 shows the general characteristics of the datasets. For each dataset, the second column corresponds to the dataset domain, and the third column represents the total number of features. The fourth column represents the number of instances, and the fifth column represents the number of classes in each level of the tree hierarchy. Data preprocessing was conducted in four steps. In the first step, we selected the most frequent class considering the leaf nodes in the original dataset for each instance. In the second step, each missing value was replaced using the hierarchical supervised imputation method (HSIM) [54]. In the third step, every class with fewer than ten instances was merged with its parent class until all classes possessed at least ten instances. Finally, in the fourth step, we applied the unsupervised discretization equal frequency binning [55] method with 20 partitions to convert all continuous features into discrete values.

B. PARAMETER SETTINGS
Our experiments and comparisons use the same 5-fold crossvalidation setup for each dataset. The best feature subset for both algorithms was selected using the 5-fold cross-validation procedure within the training set.
The parameters of the VNS-FSHC are those used by Costa et al. [19], which were fixed at the following values: VNSmax = 0.1 × (number of features included in the initial solution) × (number of features excluded from the same solution), and RDmax = 0.1 × (number of features included in the current solution passed to the RandomDescent method) × (number of features excluded from the same solution).
Regarding the BF algorithm, we performed preliminary experiments varying the stopping criterion from {5, 10, 15} in all the datasets. Since we did not significantly improve the classifier's performance using the value 15 compared to 10, we fixed the stopping criterion as 10 in all datasets.
The parameter tuning of the GVNS-FSHC used the Irace package [56], an automatic algorithm configuration method. Table 2 shows the tuning setup, and we applied the 5-fold cross-validation procedure to the training set of the SPO dataset. Irace generated three configurations, presented in Table 3. Configuration 1 (w = 2, attempt max = 4, and RDrate = 0.02) was chosen because it requires the lowest computational costs.

C. COMPUTATIONAL RESULTS
Subsection VI-C1 presents the computational results using the GMNB classifier, and Subsection VI-C2 shows the CLUS-HMC results.

1) GVNS-FSHC RESULTS WITH THE GMNB CLASSIFIER
Considering the stochastic nature of the VNS-based algorithms, each algorithm was applied 30 times to each dataset. To compare the GMNB performance using both the VNS-FSHC and the GVNS-FSHC algorithms, we first recorded the running time spent by each execution of the VNS-FSHC. Then, we executed the GVNS-FSHC with the same running time for a fairer comparison, considering the same dataset partition and seed for generating random numbers of those metaheuristic algorithms. As the BF heuristic is deterministic, it required only one execution for each dataset.
The results obtained in each dataset were compared by using two one-way hypothesis tests with a significance level of 0.05. To choose the most appropriate statistical test for each dataset result, we first verified whether they were well modeled by a normal distribution by applying the Shapiro-Wilk test [57]. If samples came from populations with normal distributions, we applied the ANOVA test [58], a parametric hypothesis test for two independent samples; otherwise, we applied the Kruskal-Wallis test [59], a nonparametric analysis of variance that can compare several independent samples. Table 4 shows the hF measure results obtained by using all features, the GVNS-FSHC algorithm, and the two comparison algorithms. The second to fifth column represent the hF values achieved by the GMNB classifier using all the dataset features (second column) and feature selection methods (other columns). In these columns, ''avg'' indicates the average result, and the standard deviation (sd) is in parentheses. Bold results show the best absolute value, and a result preceded by • indicates no statistically significant difference between the specific result and the GVNS-FSHC result.
The experiments showed that the GVNS-FSHC algorithm obtained the best absolute average for five datasets (CellCycle, Church, Phenotype, Expression, and ImageCLEF07D). Moreover, the GVNS-FSHC is better than at least one comparison strategy with statistical significance for four of these five datasets. For the remaining datasets (SPO, Gasch2, Eisen, Derisi, Gasch1, Sequence, and ImageCLEF07A), its performance was equivalent to the best result found, i.e., the difference was not statistically significant.
It is also important to mention that for the GMNB classifier, using a feature selection method improved the model's performance for most of the datasets (Church, SPO, Phenotype, Derisi, Gasch1, Expression, and ImageCLEF07D).  Table 5 shows the comparison results between the algorithms concerning the number of features used by the GMNB classifier. The second column presents the number of features used without feature selection, and the remaining columns represent both the averages (avg) and standard deviations (sd) of the number of features used by the algorithms.
When we compare the results of Table 5 and Table 4, we observed that when the GVNS-FSHC does not have the best absolute hF performance (SPO, Gasch2, Eisen, Derisi, Gasch1, Sequence, and ImageCLEF07A), it selects fewer features than the strategy with the best absolute performance for four datasets. The only exceptions to this performance occur on the Derisi, Gasch1, and ImageCLEF07A datasets in which BF is the best strategy regarding the best absolute performance and number of selected features.
Ultimately, these results show that the GVNS-FSHC algorithm with the GMNB classifier is consistently better than or equivalent to the other comparison strategies regarding the hF measure.
Aiming to characterize and compare the running time behavior of the GVNS-FSHC algorithm to its previous version (the VNS-FSHC algorithm), we used the multiple time-to-target plot (mttt-plot) tool [60]. The mttt-plot is an extension of the time-to-target plot [61] to sets of multiple instances.
Runtime distributions (ttt-plots) display the probability that an algorithm will find a solution at least as good as a given target value for a given problem instance, on the ordinate axis, within a given running time, shown on the abscissa axis [60]. To build a ttt-plot, the algorithm A is run q times on the fixed instance I and stops as soon as it finds a solution whose objective function is at least as good as the given target value look4. After concluding the q independent runs, a cumulative distribution function (CDF) represents the solution times.
To build an mttt-plot, instead of one single instance and target value, p instances I j and their corresponding targets look4 j are used, for j = 1, 2, . . . , p. Let each S j ≥ 0 be a continuous random variable representing the time taken by algorithm A to find a solution as good as the target value look4 j , such as I j ; and F S j (s) = P(S j ≤ s) be the cumulative distribution function of S j . The mttt-plot is defined by a set of z points (α k ,F S 1 +...+S p (α k )), for k = 1, 2, . . . , z and z q, where each α k is a sample of S 1 +. . .+S p , andF S 1 +...+S p is an estimator of F S 1 +...+S p . To generate these z points, we sample z occurrences of the sum of independent variables S 1 +. . .+S p using the algorithm proposed by Reyes and Ribeiro [60].
We considered one partition of each dataset (5) as instances. For each instance, two target values were considered (a = mean of 30 runs of each dataset, and b = a − 0.01×a), making a total of p = 10 instance-target pairs. Each algorithm was run q = 20 times for each instance-target pair, until a solution at least as good as the corresponding target was found for each instance. Fig. 2 shows the mttt-plot resulting from the 10 individuals ttt-plots using z = 2 × 10 4 , for each algorithm. We observed that the GVNS-FSHC performs better for this 10 instance-target pairs set. The GVNS-FSHC finds a target solution within 10 7.4 milliseconds approximately 70% of the time. In contrast, the VNS-FSHC finds a solution in more time (within 10 7.6 milliseconds), considering the same 70% of the times it ran. Furthermore, when we set a processing time, the GVNS-FSHC is more likely to reach the target value than the VNS-FSHC. For example, at 10 7.4 milliseconds, the VNS-FSHC reaches the target value in only approximately 15% of the executions while the proposed algorithm reaches the target value in approximately 75% of the executions.   generated the best result for the GVNS-FSHC in each dataset. The figure shows that, on most datasets, the GVNS-FSHC algorithm achieves improvements before the VNS-FSHC algorithm.

2) GVNS-FSHC RESULTS WITH THE CLUS-HMC CLASSIFIER
To see if our approach improved the performance of a classifier widely used in the literature, in this section, we compare the CLUS-HMC [14] performance with and without the feature selection generated by the GVNS-FSHC and the BF algorithm using the same classifier. Worth emphasizing that the CLUS-HMC is a classifier based on decision trees. Specifically, it embeds feature selection to optimize the objective function or performance of the learning model. Table 6 shows the results of the GVNS-FSHC using the CLUS-HMC classifier and following the same notation as   Table 4. The results show that the feature selection step using the GVNS-FSHC algorithm did not statistically significantly improve the performance of the CLUS-HMC classifier (except for the Church dataset), confirming the power of decision trees as natural feature selectors. However, the GVNS-FSHC algorithm did not statistically significantly jeopardize the performance of the CLUS-HMC classifier.
Considering the number of features used by the CLUS-HMC classifier with and without the feature selection step, Table 7 shows a significant reduction in the number of features. Thus, this feature selection can still be compelling since it can improve the model interpretability without losing accuracy.

VII. CONCLUSION AND FUTURE WORK
In this paper, we presented a novel feature selection method tailored for global model hierarchical classifiers. We developed a hybrid filter-wrapper approach based on the VNS metaheuristic, the so-called GVNS-FSHC, which uses the SU H measure in a filter step and the GMNB or the CLUS-HMC as the classifier of a wrapper step. We compare the GVNS-FSHC method with different feature selection strategies on twelve datasets (from proteins and images contexts).
The experimental results showed that the method using the GVNS-FSHC algorithm with the GMNB classifier achieved predictive performance that was consistently better than or equivalent to the other comparison strategies. Furthermore, the GVNS-FSHC reduced the number of features in all datasets without negatively impacting the classification accuracy.
We also observed that the predictive performance of the GVNS-FSHC is better than or equivalent to the VNS-FSHC algorithm. Moreover, when we considered the running time behavior, the GVNS-FSHC performed better than the VNS-FSHC since it achieved the improvements first.
Concerning the CLUS-HMC classifier, the GVNS-FSHC feature selection method did not improve the classification performance, showing the power of decision trees as natural feature selectors. However, the GVNS-FSHC was able to select fewer features with no statistically significant difference in the performance results.
We intend to investigate and develop subset filter-based measures adapted to treat the class hierarchy in future work. The goal is to incorporate the measures in a hybrid approach that runs the classifier less often in the wrapper phase of the feature selection to reduce its computational costs.