Automatic Recommendation Method for Classifier Ensemble Structure Using Meta-Learning

Machine Learning (ML) is a field that aims to develop efficient techniques to provide intelligent decision making solutions to complex real problems. Among the different ML structures, a classifier ensemble has been successfully applied to several classification domains. A classifier ensemble is composed of a set of classifiers (specialists) organized in a parallel way, and it is able to produce a combined decision for an input pattern (instance). Although Classifier ensembles have proved to be robust in several applications, an important issue is always brought to attention is the ensemble’s structure. In other words, the correction definition of its structure, like the number and type of classifiers and the aggregation method, has an important role in its performance. Usually, an exhaustive testing and evaluation process is required to better define the ideal structure for an ensemble. Aiming to produce an interesting investigation in this field, this paper proposes two new approaches for automatic recommendation of classifier ensemble structure, using meta-learning to recommend three of these important parameters: type of classifier, number of base classifiers, and the aggregation method. The main aim is to provide a robust structure in a simple and fast way. In this analysis, five well known classification algorithms will be used as base classifiers of the ensemble: kNN (Nearest Neighbors), DT (Decision Tree), RF (Random Forest), NB (Naive Bayes) e LR (Logistic Regression). Additionally, the classifier ensembles will be evaluated using seven different strategies as aggregation functions: HV (Hard Voting), SV (Soft Voting), LR (Logistic Regression), SVM (Support Vector Machine), NB(Naive Bayes), MLP (Multilayer perceptron) e DT (Decision Tree). The empirical analysis shows that our approach can lead to robust classifier ensembles, for the majority of the analysed cases.


I. INTRODUCTION
Classifier ensembles, also called ensembles, is a more elaborated Machine Learning structure that has been successfully applied in Pattern Recognition applications [1]- [3]. These systems are broadly used to solve a wide range of problems, such as face recognition [4], music classification [5], credit scoring [6], recommendation systems [7]- [9], software bug prediction [10], intruder detection [11], machine leaning, pattern recognition, knowledge discovery [12], and many other problems found in the real world.
The associate editor coordinating the review of this manuscript and approving it for publication was Jad Nasreddine .
However, defining the best components of an ensemble's structure is a complex task, since its performance is directly related to the characteristics of a particular problem. Moreover, components such as neural networks may have different results depending on their structure or their initialization parameters. Thus, the best configuration of an ensemble varies according to its application [13].
One of the alternatives for automating this definition process is through meta-learning. In this context, the main goal of meta-learning is to understand the interaction between the learning mechanism and the concrete contexts in which it is applicable [14]. This can be achieved by applying Machine Learning techniques to build models that explain the relationship between learning strategies and problems from a particular perspective. Meta-learning explores accumulated knowledge about various tasks and their possible applications in finding solutions to problems that are similar to those that originated such knowledge.
In the literature, several studies that use meta-learning to recommend algorithms can be found, such as in [13], [15]- [17] and [18]. However, although there are several relevant studies, it is clear that there is a lack of investigation towards the possibility of using recommendation methods to assist in effectively defining the main parameters of an ensemble, notably the classifier type, the number of classifiers and the combination methods.
The main aim of this work is to propose an efficient system for automatic recommendation of the structure of classifier ensembles. In order to achieve this goal meta-learning is used to define the best configuration of parameters for this structure. The concept of meta-learning will be applied in recommending three ensemble-specific project parameters: 1) Base classifier: the best set of base classifiers for an ensemble; 2) Ensemble size: the most appropriate number of base classifiers that will form the ensemble; and 3) Combination method: the best model for combining the results of the various base classifiers. To the best of our knowledge, there is no work that use meta-learning to recommend a set of ensemble parameters. Usually, they recommend only one ensemble parameter. In addition, very little has been done to recommend the aggregation function of a classifier ensemble.
In order to assess the feasibility of the proposed automatic recommendation system, an empirical analysis will be conducted, assessing the performance of the proposed system using 100 datasets. Additionally, a comparative analysis will also be performed, comparing the obtained results of the proposed system with some well-known ensemble-based systems.
This work is organized as follows: Section II presents the basic concepts of this paper while Section III presents the state-of-the-art technologies and the main advances on the subject of ensemble recommendation systems. In Section IV, we propose a method that uses a meta-learning concept for the recommendation of the size and type of classifiers and aggregation model of an ensemble of classifiers. Section V will describe the experimental methodology that is used in the empirical analysis, while the reported results are presented and analyzed in Section VI. Finally, the conclusion and perspectives for future researches are presented in Section VII of this work.

II. THEORETICAL REFERENCE
This section aims to present the main theoretical foundations that are used during the conception of this paper. Therefore, the next two subsection will present an explanation about classifier ensembles and meta-learning, respectively.

A. CLASSIFIER ENSEMBLE
It is well-known that there is not a single classifier which can be considered optimal for all problem domains. Therefore, it is difficult to find a good single classifier which provides the best performance in practical pattern classification tasks [3], [12]. Figure 1 presents a general structure of an ensemble, which consists of a set of c individual classifiers (ICs) and an aggregation module (Comb). Therefore, an input pattern {x i ∈ R d |i = 1, 2, . . . , n} is presented to all individual classifiers, and an aggregation method will combine their outputs to produce the overall output of the system O = Comb(y j ), {y j = (y j1 , . . . , y jk |j = 1, . . . , c and k = 1, . . . , l}, where the number of individual classifiers is defined by c, and l describes the number of labels in a dataset. One important issue regarding the design of classifier ensembles involves the appropriate selection of number and type of individual classifiers, and also the combination function (aggregation method). Machine Learning literature has ensured that diversity plays an important role in the design of ensembles, contributing to their accuracy and generalization [19]. The ideal situation would be a set of classifiers that present uncorrelated errors, also called diversity. In this paper, the composition of an ensemble is defined by a meta-learning recommendation system, that will be described in the next subsection.

B. META-LEARNING RECOMMENDATION SYSTEM
Recently, meta-learning techniques have emerged as an efficient alternative for ensemble parameters recommendation [20], [21]. The idea of meta-learning as a recommendation system can be applied to individual classifiers or to ensembles. In the case of ensembles, the performance of an ensemble is related to a set of characteristics (metafeatures) of the corresponding problem. Hence, it acquires knowledge based on the parameters of each configuration of the ensemble system. Then, this acquired knowledge is used in the design of an ensemble, when a new task is presented. According to [16], in general, the design of an ensemble recommendation system is composed of four main steps, which are: 1) Dataset characterization: in this step, the main metafeatures need to be discovered, so that they can be applied to the meta-learner. One of the first studies to extract meta-features from a specific dataset was presented by the Statlog project [22].
2) Definition of evaluation metrics: here, the process of selecting the best algorithm is performed. In this case, it is necessary to apply evaluation measures in order to select the best model to solve a specific problem, taking into account the more satisfactory performance for the analyzed problem. In this step, several evaluation metrics can be employed to measure the effectiveness of the used algorithms. In this paper, we will use accuracy as the main evaluation metric; 3) Definition of the recommendation output: the third step is related to the final result that will be presented by the recommendation system. The authors in [16] suggest three techniques: 1) definition of the best algorithm; 2) definition of a group of best algorithms; or 3) a ranking of the best algorithms. In this paper, we select and recommend a group of best algorithms; 4) Development of the recommendation model: here, the goal is to learn an implicit mapping between metafeatures and classes in the meta-label.
The set of available meta-instances is called metadata. In order to induce the mapping between input meta-features and meta-class, a machine learning algorithm, which is called meta-learner, is applied. Through it, it is possible to generate the recommendation of the ensemble structure. Initially, an ensemble size recommendation will be made. After that, the base classifier recommendation will be made, and finally the aggregation method will be defined.

III. RELATED WORK
Classifier ensembles have been proved to be an efficient pattern recognition structure that has been applied to different applications [23], [24]. Despite the large number of researches on classifier ensembles, finding an optimal parameter set that maximizes classification accuracy of an ensemble is still an open problem. The search space for all parameters of an ensemble system (type of classifier, size, classifier parameters, combination method and feature selection) is very large and the definition of the optimal parameter set is a hard research challenge.
A considerable amount of meta-learning research has been devoted to the area of algorithm recommendation. In this special case of meta-learning, the aspect of interest is the relationship between data characteristics and algorithm performance, with the final goal of predicting an algorithm or a set of algorithms suitable for a specific problem under study. This application of meta-learning can be both useful for providing a recommendation to an end-user and for automatically selecting or weighting algorithms that are most promising to a specific problem.
The first work to present an abstract model for metalearning is in [33], whereas the authors in [34] developed rules based on simple meta-features only determining whether a certain algorithm should be used for a problem instance or not. This approach was later extended by using more features and a Decision Tree learner based on StatLog project [35].
The authors in [36] presented a new strategy to recommend algorithms by using the K-NN algorithm to identify the most similar historical datasets. In [37], the authors tried to find functions that map datasets to algorithm performance. In [38], the problem complexity measures to characterize the datasets was used and the relation between these measures and performance of classification algorithms was analyzed. In [39], the authors proposed a rule-based classifier selection approach based on the technique proposed in [40], [41]. In the mentioned work, not only the algorithms themselves were recommended, but different parameter settings that will naturally led to performance variation of the same algorithm on different datasets.
As more recent studies, we can cite [42], in which a study was conducted for the construction of a symbolic recommendation model of the best Feature Selection algorithm. In addition, a lazy method was presented in [43] for the recommendation of Feature Selection algorithms, while a metalearning framework was developed in [44] to learn which Feature Selection algorithms are more suitable for a given dataset. The authors in [45] used five different categories of state-of-the-art meta-features to characterize datasets, and built a different regression model to connect datasets to each candidate algorithm. In [46], a new approach for meta-feature engineering was introduced. Finally, in [47], the authors developed a method that based on a ranking list, determines which aggregation algorithms are best for that list.
In summary, the majority of the researches use metalearning to recommend the best algorithm and/or parameter for a single classifier. To the best of our knowledge, there is no work that recommend the whole ensemble structure (classifier type, size and aggregation functions) using meta-learning.

IV. ENSEMBLE RECOMMENDATION SYSTEMS
In this Section, we will present two approaches for Ensemble Recommendation Method (ERM) proposed in this paper. The general architecture and operations will be presented, demonstrating the main steps that involve the proposed process of recommending the best topology of an ensemble using metalearning.
The two proposed approaches for ERM are named as follows: 1) ERM-ML -Ensemble Recommendation Method -Using Meta-learning; and 2) ERM-3ML -Ensemble Recommendation Method -Using 3 steps Meta-learning .

A. RECOMMENDATION OF ENSEMBLE PARAMETERS
It is well known that different ensemble configurations lead to different performance results, and the use of meta-learning may be an excellent option to help in selecting the best ensemble parameter set for a specific problem. Since the definition of the ensemble structure is a crucial step in its project, the proposal of this work will be based on the development of two models to recommend this structure using meta-learning techniques. In general terms, the flowchart of the proposal of this work is shown in Figure 2. In this flowchart two very distinct phases are highlighted: 1) Training: This step is responsible for building the Meta-base. The more datasets are evaluated, the greater the expectation of performance and generalization of the recommendation system. This Meta-base will have the meta-features of all used datasets as well as the recommendation of the best structure for each instance (meta-features of a problem); 2) Generalization: This step is responsible for creating, training and applying the meta-leaner in real world problems, providing the best ensemble structure made by the proposed models. This step will extract the metafeatures from new datasets (instances), and use a simple classification model (meta-learner) to recommend the best ensemble structure. In the training and generalization steps, the built metafeature database will have a structure similar to the one shown in Figure 3. Once the Meta-base is assembled, the original datasets will no longer be required, since all processing will be done only on the Meta-base, considerably reducing the computational effort of the proposed model.
In the generalization step, the Meta-learner method will be used to provided the recommendation of the best structure (number of classifiers, type of classifier and model aggregation) for the proposed ensemble.

B. ERM-ML-ENSEMBLE RECOMMENDATION METHOD -USING META-LEARNING
The ERM-ML method was first proposed in [48]. As it can be seen in Figure 4, the basic idea of this method is to initially train several algorithms using different sizes of classifiers and  aggregation methods. Based on this result, it is possible to obtain the best ensemble structure for a given classification dataset. Subsequently, the obtained meta-features are used to train a classifier (meta-leaner) that will be tested unseen classification problem data.
In order to better understand this proposed method, suppose that DS is a dataset, consisting of A = {att 1 , att 2 , . . . , att d } attributes, and N instances, where d is the total number of attributes of DS. The last attribute of this dataset is the class label of the instance. The instances will be divided into 2 sets: training TR = {tr 1 , tr 2 , . . . , tr nt } and validation V = {v 1 , v 2 , . . . , v nv }. Where nt and nv represent the sizes of the training and validation sets, respectively. Algorithm 1 presents the main steps employed by the proposed method to create the Meta-base.
The steps of the proposed method in Algorithm 1 must be executed for each dataset, and they can be described as follows: 1) Lines 1 to 5 set a dataset DS to be trained and they define the possible recommendations for the main parameters of the ensemble structure: poolSize (number of classifiers); poolClassifier (number of classification models); and poolAggregator (aggregation functions). Additionally, by using the DCT tool, VOLUME 9, 2021 Algorithm 1 Algorithm to Create ERML-ML Meta-Base Input: dataset DS. 1: Open Meta-base file: MB 2: poolSize ← Vector with ensemble sizes 3: poolClassifier ← Vector with classification models 4: poolAggregator ← Vector with aggregation functions 5: DCT ← dataset_characterization_tool(DS) 6: for each size in poolSize do 7: for each model in poolClassifier do 8: for each aggregator in poolAggregator do 9: the desired characteristics of the dataset DS are extracted. 2) In this nested loop, from lines 6 to 16, a classifier ensemble is applied, varying its size, classifier types and aggregation function. Then, the selected model is trained with the defined training dataset, and validated with the validation dataset. This process is applied for all possible combination of size, classifier type and aggregation function. Each accuracy result, for each combination, will be stored in the Result vector. The current size, classifier and aggregation function will be also stored in vectors S, C and A, respectively; 3) After the evaluation loop (from line 17 onward), the selection of the best topology is performed by selecting the best ensemble configuration using the SelectBest procedure. In this procedure, one parameter is fixed and we calculate the average accuracy for each value of this parameter using all possibilities of the remaining parameters. Then, the value that contains the highest average accuracy is selected. For instance, in order to define the best classification model, we calculate the average accuracy for each classification model using all possibilities of ensemble sizes and aggregation functions of this classification model. We do the same idea to select the best ensemble size and aggregation function independently of the already selected parameters.  The need of analyzing the behavior of error propagation during the recommendation of each of the three parameters has inspired the development of this proposed method. Figure 5 shows the block diagram of the recommendation proposal of the ERM-3ML (Ensemble Recommendation Method -Using 3 steps Meta-learning). The basic idea of this approach is, initially, to train sequentially meta classifiers to the ensemble size, then the classification algorithms and, finally, the aggregation function. Then, from these training steps, it is possible to obtain the best ensemble structure for a given dataset. Therefore, the main difference between ERM-3ML and ERM-ML is that ERM-ML selects each ensemble parameter in an independent and parallel way while ERM-3ML performs a serial recommendation. In order to do this, ERM-3ML creates three datasets, in a serial way, in which the output of one meta classifier is used as input attribute for the following meta classifier. In order to better understand this proposed version, suppose DS is a dataset composed by A = {att 1 , att 2 , . . . , att d } attributes and N instances, in which d is the total number of attributes of DS. The instances will be divided into 2 sets: training TR = {tr 1 , tr 2 , . . . , tr nt } and validation V = {v 1 , v 2 , . . . , v nv }. Where n and nv represent the sizes of the training and validation sets, respectively. Algorithm 2 presents the main steps used by the proposed method, in the Training and Validation phase.
The steps of the proposed method in Algorithm 2 must be executed for each new dataset, and can be described as: 1) Lines 1 to 7 set a dataset DS to be trained, and they define the possible recommendations for the main parameters of the ensemble structure: pool-Size (number of classifiers); poolClassifier (number of classification models); and poolAggregator (aggregation functions). In addition, by using the DCT tool, we extract the desired characteristics from the dataset DS. 2) In the first nested loop from lines 8 to 16, a classifier ensemble is evaluated, sequentially, varying size, classifier and aggregation functions. This process is  9: for each model in poolClassifier do 10: for each aggregator in poolAggregator do 11 classifier, using the best number and types of classifiers are applied, varying the aggregation function. Then, in line 33, we select the third parameter, applying a function to select the aggregation function with the best accuracy value present in the vector Result, and store it in the aggregator Meta-base file MBA. Meta-bases MBA, MBC and MBS will be available for use with new test instances in the Application phase. The main difference between both algorithms is that the former creates only one mate-base with the recommendation of all three parameters, while the latter creates three metabases, in a incremental way, one for each parameter.

V. EXPERIMENTAL PROTOCOL
In order to evaluate the design of our recommendation framework, we present an empirical comparison using 8 stateof-the-art techniques and our two recommendation methods under the same experimental protocol. All algorithms used in this work were implemented in Python Programming Language (sklearn package). In this section, we provide a complete description of the experimental analysis of this paper.

A. DATASETS
This experimental analysis is performed using a test bed composed of 100 datasets, of which 50 datasets were taken from the UCI machine learning repository [49], from the Statlog project [35] and from the Knowledge Extraction based on Evolutionary Learning (KEEL) repository [50]. The remaining 50 datasets were artificially generated from other datasets, using the SMOTE function (SMOTE −C0−K 5−P100−S1), available in WEKA 1 software. The key characteristics of such datasets can be found in Table 1.

B. COMPARISON METHODS
Given that the proposed methods are characterized as ensemble system, in order to assess their effectiveness, they will be compared with the most relevant Dynamic Selection algorithms presented by [51], and which are listed in Table 2. 2 Among the selected methods, three different approaches are used: Dynamic Classifier Selection (DCS), Dynamic Ensemble Selection (DES) and Topology Recommendation System (TRS).
The obtained results of all analysed methods will be evaluated using the Friedman statistical test [58]. In cases where a statistically significant difference is detected, the Nemenyi post-hoc test is applied [58]. In order to present the obtained results by the post-hoc test, the critical difference diagram is used. This diagram was selected in order to have a visual illustration of the statistical test, making it easier to interpret the obtained results.

C. EXPERIMENTAL METHODOLOGY
The experimental methodology used in this empirical ( Figure 6) analysis is divided into four phases: 1) data characterization phase; (2) meta-base label definition phase; (3) meta-learner training phase; and (4) meta-learner evaluation phase. The first two phases are related to the creation of the meta-base whereas the remaining two phases are related to the training and evaluation of the meta-learner.

1) META-BASE DATA CHARACTERIZATION PHASE
In order to create a meta-base, its attributes have to defined and this is done in the data characterization phase. Before the data characterization phase, the original datasets goes through a pre-processing phase, filling missing values and normalizing all numeric attributes.
According to [13], the measures that characterize the databases must contain relevant information to determine the relative performance among classification algorithms, and present low computational cost. Currently, the dataset characterization research focuses on three main aspects [59]: direct characterization, characterization based on landmarking, and characterization via models. Here we decided to adopt the direct characterization, based on the Statlog project [22]. More recently, the METAL project has been proposed 3 aiming at developing tools to assist the user in selecting an appropriate combination of pre-processing, classification and regression techniques. Table 3 displays all 25 meta-features considered in this paper, extracted by direct characterization of the datasets, using the DCT (Data Characterization Tool), proposed by the METAL project.

2) META-BASE LABEL DEFINITION PHASE
Once the attributes of the meta-base are defined, the next step is the definition of the labels for the instances of this meta-base. In order to do that, a brute force approach is performed, in which each instance (classification problem) will be submitted to a set of possible combinations of type of classifiers, number of classifiers and type of aggregation functions. After this, the best ensemble configuration is selected and put as label of the corresponding instance.
In relation to the number of classifiers in an ensemble, the values vary as follows: PoolSize = [2,5,8,10,12,15,18,20,22,25,28,30,32,35,38,40,42,45,48,50]. Moreover, the PoolClassifier vector will contain five well known classification algorithms, which are: k-NN (Nearest Neighbors), DT (Decision Tree), RF (Random Forest), NB (Naive Bayes) e LR (Logistic Regression). These classification algorithms were selected due to the different learning criteria that they provide. In addition, they have been widely used in many application domains. These algorithms are trained in a Bagging-based procedure in a 10-fold cross validation process. It is important to emphasize that the training process is made using the original classification dataset. Additionally, the classifier ensembles use up to seven different strategies as aggregation functions: HV (Hard Voting), SV (Soft Voting), LR (Logistic Regression), SVM (Support Vector Machine), NB(Naive Bayes), MLP (Multilayer perceptron) e DT (Decision Tree).
As five classification models are used in the definition of the meta-base label, the base classifier is selected using the following procedure.
1) For each classification model, several configuration (hyper-parameters) are assessed and the average is calculated.
2) The classification model with the highest average accuracy is selected.
The recommended ensemble has a homogeneous structure and the configurations with the highest accuracies are selected to be part of the ensemble.
As all the aforementioned classifiers have hyperparameters, for each base classifier, a hyper-parameter is randomly selected from a pre-defined interval. Additionally, we perform 10 executions for each ensemble configuration (size, classifier and aggregation functions). For instance, in an ensemble composed of 10 k-NN classifiers, the first classifier is selected by choosing k from a [2,20] interval. The same procedure is performed to select the k-NN hyper-parameter of the following 9 base classifiers. Then, this ensemble is tested using all seven aggregation functions cited above. In case of draw, the selection of the best model is based on the first best overall accuracy, which means the smallest possible ensemble.
ERM-ML recommends the best size/classifier/aggregation parameter set in an independent way. This number was defined by the average of the best results obtained by all possibilities of the other parameters. For each instance (classification dataset), the best accuracy records will have the ensemble size,classification model and the type of aggregation function stored together with the characterization of the dataset in the Meta-base. At the end of the training, we will have as many instances as the number of datasets. Each instance will have x attributes, the first ones referring to the characteristics of the database, and the last representing the class with the best structure (size, classifier and aggregation). VOLUME 9, 2021 Unlike ERM-ML, ERM-3ML will recommend all three parameters separately: size, classifier and aggregation function. In its first stage, ERM-3ML recommends the best size for that determined data profile. In the second stage, it uses the previous size information to recommend the best classifier for that particular size. Finally, with the previous size and classifier, it recommends the best aggregation function. In other words, the output of one meta classifier is used as input attribute for the following meta classifier. In this method, we will have three dataset, one for each parameter, and this phase defines the labels of all three datasets.

3) META-LEARNER TRAINING PHASE
Once the Meta-base is built, the next step is the creation of the meta-learner and it needs to be defined, trained and tested. For the selection of the meta-leaner model, an initial investigation was performed, in which the performance of four wellknown classification algorithms were assessed, k-NN, SVM, MLP and C4.5. After this evaluation, SVM provided the best overall performance and it has been selected to be the metalearner for both approaches.
As already mentioned, the meta-learner has been implemented using the Python Programming Language (sklearn package). The hyper-parameters were defined for each approach and are the following ones. For training the meta-leaner, a 10-fold cross-validation method is applied.

4) META-LEANER EVALUATION PHASE
The last phase is the evaluation of the meta-learner. As already mentioned, in the case of ERM-ML, only one classifier will be used, which will be responsible for recommending, in a single step, the best size, classifier and aggregation function. On the other hand, in ERM-3ML, three classifiers and each classifier will be responsible for recommending one of three parameters of the topology at a time: size, classifier and aggregation function.
In order to evaluate the meta-learners, the methods trained in the previous phase are assessed. They can be assessed in two different ways: the accuracy of the meta-learner or the efficiency of the recommended ensemble. In this paper, we will evaluate the meta-learner based on the second way, the efficiency of the recommended ensemble. In this sense, once the meta-leaner defines the output of a testing instance, the recommended ensemble is created, trained and assessed (using the original dataset) in order to analyse its performance in the corresponding classification problem. The results of this analysis are presented in the next section.

VI. RESULTS AND DISCUSSION
In this section, the obtained results of the reference methods and the two proposed ones will be presented and analyzed, aiming to bring an interesting discussion on the overall results.
As mentioned previously, the meta-learner recommends the optimal ensemble structure (type and number of base classifiers and aggregation method) for a classification problem (testing instance). Then, the recommended ensemble is created, trained and assessed. Values presented in this section represent the accuracy levels provided by the recommended ensembles. For the reference methods, as we are dealing with a different classification problem, the whole process has to be done according to the algorithms defined by these methods.

A. PERFORMANCE OF ALL ANALYZED MODELS
The accuracy results of all ten analyzed methods are summarized in Tables 4 and 5. The results of this tables represent the accuracy level of the recommended ensemble structure for the corresponding dataset. For ERM-ML and ERM-3ML models, for each test instance (classification problem), a meta classifier recommends the best ensemble configuration. Then, the recommended ensemble is trained using the original dataset in a 10-fold cross validation methodology. Regarding the results presented in Tables 4 and 5, they represent the accuracy of the recommended ensemble over the original dataset. For the ensemble-based methods, the ensemble systems delivered by these methods are also assessed in a 10-fold cross validation procedure and the presented results are also the obtained accuracy levels. In this table, each column represents one analyzed method, highlighting in bold the best accuracy result (highest value) of each dataset. In order to provide a concise analysis, the final row summarizes the number of times each method delivered the best accuracy result.
As it can be seen in the two previous tables, the results are very promising, which are summarized in Table 6. It is worth noting that the results of our proposed methods are superior to the others for the majority of datasets. The best overall result is obtained by the ERM-3ML model, which achieved the best accuracy in 64 databases (35 bases in [54], Overall Local Accuracy (OLA) [53], DES Performance (DESP) [56], K-Nearest Oracles Union (KNU) [55], Local class accuracy (LCA) [53], Classifier Rank (rank) [52], META-DES (META) [57], and our two methods: Ensemble Recommendation Method using Meta-Learning (ERMML) and Ensemble Recommendation Method using 3 steps Meta-Learning (ERM3ML). method was obtained by META in 15 databases (3 + 12), which is still far below the results obtained by our proposed methods.
When analysing the best performance delivered by both proposed methods, we can observe that they provided better performance with the real datasets (Table 4), while the existing ensemble methods delivered better performance with the artificial datasets (Table 5). It is a promising result for the proposed methods since the real datasets represent properly the real information of a classification problem, while the artificial dataset represent artificial manipulations of the original datasets.
In summary, we can observe that, in general, the performance of our proposed methods was superior to the existing methods (i.e. well-known in literature), showing that the use of meta-learning for the recommendation of the best ensemble structure can lead to robust classifier ensembles. Of the proposed methods, the sequential definition proposed in the ERM-3ML model seems to lead to more robust classifier ensembles than when using the ERM-ML model. We believe that this is due to the error propagation that occurs when we recommend all parameters of the ensemble structure (ERM-ML). In other words, the sequential recommendation is more appropriate to define the optimal ensemble structure. VOLUME 9, 2021 [54], Overall Local Accuracy (OLA) [53], DES Performance (DESP) [56], K-Nearest Oracles Union (KNU) [55], Local class accuracy (LCA) [53], Classifier Rank (rank) [52], META-DES (META) [57], and our two methods: Ensemble Recommendation Method using Meta-Learning (ERMML) and Ensemble Recommendation Method using 3 steps Meta-Learning (ERM3ML).

B. ANALYSIS OF ALL INVESTIGATED MODELS
Tables 4 and 5 present the accuracy results for each dataset individually. Based solely on these tables, it is not possible to observe the general performance of the analysed methods. Therefore, Figure 7 illustrates the boxplot of the accuracy results obtained by the all analyzed methods. Based on this boxplot, which was built from the data presented in Tables 4  and 5, it can be observed that our two proposed methods (the two rightmost boxes) present the best performance among all methods evaluated.
The proposed methods present the highest median value, having the lower quartile interval, superior to the others.
On top of that, the inter-quartile intervals are larger than others, meaning a better performance of the proposed methods. It is also important to highlight that there is a slightly higher presence of outliers in the ERM-ML model, which may characterize a certain instability of this model in relation to others. The results obtained in Figure 7 only corroborates with the results shown in Tables 4 and 5.
As mentioned previously, the Friedman statistical test and the Nemenyi post-hoc test are also applied in order to analyze the obtained results from a statistical point of view. The Friedman test was applied to the performance of all ten methods and resulted in: Friedman test = 245.08,  df = 9, p − value < 2.2 * 10 −16 . It is important to emphasize that the Friedman test is applied directly to the accuracy values of all analyzed methods. In analyzing the Friedman test, we observed that the performance of all methods was statistically significant. This difference was detected by Friedman test since p − value < 0.01. The post-hoc test was then applied. Figure 8 presents the post-hoc test results through the critical difference (CD) diagram.
As it can be observed in Figure 8, both proposed methods outperformed all eight existing methods, being first and second in the ranking. The leftmost method, ERM-3ML, was statistically better than all existing ensemble methods. However, the CD diagram shows that there was no statistical difference between our two proposed methods. In relation to the ERM-ML method, it was statistically superior to seven existing ensemble methods. However, there was no statistical difference between this method and META. Additionally, the best ranked existing method was META, which outperformed all seven remaining methods. However, META was only statically superior to RANK, MCB, OLA and LCA.
In summary, we can state that ERM-3ML method was statistically better than all existing methods, and ERM-ML method was similar to META and statistically better than all seven methods. The results obtained in Figure 8 only corroborates with the idea that the use of meta-learning as a recommendation tool to define the best ensemble structure.

C. ATTRIBUTE DEPENDENCY ANALYSIS
Once we analyzed the performance of the ensemble systems in all 100 datasets, we also carried out an analysis to evaluate the dependence which might exist between each attribute and the performance of each analyzed method. The main motivation for performing this correlation analysis is in the fact that Meta-base has background information and there may be some information about a database that are more relevant in a meta-learner's decision making process than others. This analysis aims to evaluate whether all attributes have similar influence or if some of them are more influential than others in ensembles' accuracy. In order to do this, Pearson correlation [60] has been used, and it is noteworthy that Pearson correlation coefficient measures the degree of linear correlation between two quantitative variables. It is a dimensionless index with values between −1 and 1, inclusive, which reflects the intensity of a linear relationship between two datasets. This coefficient, usually represented by a letter r, assumes only values between −1 and 1, where r = 1 means a perfect positive correlation between the two variables, r = −1 means a perfect negative correlation between the two variables, and r = 0 means that the two variables do not depend linearly on each other. However, there may be another dependency that is nonlinear, requiring data to be investigated by other means, this is better explained in [61].
The correlation between meta-attributes and the accuracy results of the obtained ensembles are presented in Table 7. Table 7 shows the main meta-attributes in the first column, and the analyzed methods are presented in the remaining columns. In addition, the highest correlation value for each meta-attribute is highlighted in bold.
From this table, it can be observed that the correlation between the majority of meta-attributes and the ensemble accuracy is weak (values close to 0). However, there is a high correlation between some meta-attributes and the ensemble accuracies. We highlight the the most correlated metaattributes as follows: Nr_attributes: Number of attributes; Nr_num_attributes: Number of numerical attributes; and SDRatio: An M-Statistic transformation that evaluates information into the covariance structure of classes.
Among these attributes, the highest correlation was detected to the Nr_attributes meta-attribute. In order to analyze if this correlation is really apparent, a further detailed analysis must be done. To do this, we divided all datasets into three groups, based on the Nr_attributes metaattribute, as lower (up to 5 attributes) central (between 6 and 47 attributes) and upper (equal or higher than 48 attributes).  In using this group division, the lower group is composed of 10 datasets, the upper group is composed of 10 databases and the central group is composed of 80 databases.
For each attribute size group, we calculate the proportion of winning for each analyzed method. The summary of this analysis is summarized in Table 8. Note that one proposed method, the ERM-ML method, had a proportion of the best results equals to 40 % in the lower group, 41 % in the central group and 100 % in the upper group. In addition, the ERM-3ML delivered a proportion of the best results equals to 90 % in the lower group, 56 % in the central group and 100 % in the upper group.
Thus, in all three attribute size groups, there is a predominance of the proposed methods in terms of good performance. However, the predominance of the proposed methods is stronger for the lower and upper groups. From this observation, we can state that the number of attributes is an important aspect that has a strong effect in the performance of all analyzed methods. However, this correlation is clear for the ERM-ML method, in which the proportion of winning increases as the number of attributes increases.

VII. CONCLUSION
As an attempt to solve the problem of defining the optimal ensemble structure, this paper proposed the use of metalearning as a recommendation tool for different ensemble parameters, such as: pool size, classifier types and aggregation function. The main aim is to propose a recommendation system to provide accurate classifier ensembles. In the proposed approaches, the recommendation task is divided into two phases, training and evaluation. During the training phase, we extracted the characterization of a dataset, evaluated the best ensemble topology for this dataset and stored this information in a meta-database. In the evaluation phase, we applied a classifier to model the Meta-base dataset (meta-leaner). Then, we recommended the best pool size, classifier, and aggregation function for an unseen instance (classification dataset).
The proposed approach can be used to recommend the optimal ensemble structure and it can be used to any classification problem. Nonetheless, it can be applied only to classifier ensembles. In addition, its main drawback is the creation of the meta-base, that can be time consuming, but it is a limitation of a meta-learning recommendation system.
In order to assess the feasibility of the proposed approaches, an empirical analysis was conducted. In this experimental analysis, the performance of the proposed approaches were compared to eight well-known ensemble methods, applied to 100 well-known classification problems (datasets). The obtained results indicate that the proposed methods can indeed be used as a recommendation tool of an ensemble topology, providing the most accurate classifier ensembles for the majority of datasets. This results are promising, showing that the use of meta-learning to recommend the ensemble structure is a robust way to achieve accurate classifier ensembles.