A Survey of Ensemble Learning: Concepts, Algorithms, Applications, and Prospects

Ensemble learning techniques have achieved state-of-the-art performance in diverse machine learning applications by combining the predictions from two or more base models. This paper presents a concise overview of ensemble learning, covering the three main ensemble methods: bagging, boosting, and stacking, their early development to the recent state-of-the-art algorithms. The study focuses on the widely used ensemble algorithms, including random forest, adaptive boosting (AdaBoost), gradient boosting, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), and categorical boosting (CatBoost). An attempt is made to concisely cover their mathematical and algorithmic representations, which is lacking in the existing literature and would be beneficial to machine learning researchers and practitioners.

are not prone to fatigue or prejudice. Meanwhile, learn-23 ing can sometimes be challenging, especially when learning 24 from high-dimensional and imbalanced datasets [7], [8], [9]. 25 Research has shown that conventional ML algorithms tend 26 to underperform when trained with imbalanced datasets [10]. 27 Therefore, researchers have frequently resorted to new and 28 improved learning approaches, such as ensemble and deep 29 learning. 30 The associate editor coordinating the review of this manuscript and approving it for publication was Wanqing Zhao .
Ensemble learning and deep learning are two approaches 31 that have dominated the machine learning domain [11], [12], 32 [13]. Ensemble learning methods train multiple base learners 33 and combine their predictions to obtain improved perfor-34 mance and better generalization ability than the individual 35 base learners [14]. The fundamental idea behind ensemble 36 learning is the recognition that machine learning models have 37 limitations and can make errors. Hence, ensemble learning 38 aims to improve classification performance by harnessing the 39 strengths of multiple base models. Meanwhile, some limita- 40 tions of ML algorithms include: that they result in models 41 with high variance, high bias, and low accuracy [15], [16]. 42 However, several studies have shown that ensemble models 43 often achieve higher accuracy than single ML models [17]. 44 Ensemble methods can limit the variance and bias errors 45 associated with single ML models; for example, bagging 46 reduces variance without increasing the bias, while boosting 47 reduces bias [18], [19], [20]. Overall, ensemble classifiers are 48 more robust and perform better than the individual ensemble 49 learners. 50 Ensemble learning methods are broadly categorized into 51 boosting, bagging, and stacking [21]. Gradient boost- 52 ing, XGBoost, and AdaBoost are examples of boosting 53 Section 5 presents a discussion and future research directions, 110 while Section 6 concludes the paper. 112 This section presents an overview of ensemble learning, 113 detailing the building blocks of most ensemble methods, 114 the techniques used for combining ensemble base learners, 115 and ensemble selection methods. Most research works refer 116 to the 1979 article by Dasarathy and Sheela [41] as one 117 of the foundation works of ensemble learning. The authors 118 presented a method to partition the feature space using 119 multiple component classifiers. Then in 1990, Hansen and 120 Salamon [42] demonstrated that applying an ensemble of 121 similar artificial neural network classifiers achieved superior 122 prediction performance than a single classifier. Meanwhile, 123 in the same year, Schapire [43] proposed the boosting tech- 124 nique, a method developed to convert a weak classifier into 125 a strong one, and this technique was the foundation that 126 gave rise to the present robust algorithms such as AdaBoost, 127 gradient boosting, and XGBoost [33]. 128 Ensemble learning is a technique used to combine two or 129 more ML algorithms to obtain superior performance com-130 pared to when the constituent algorithms are used individ- 131 ually. Instead of relying on a single model, the predictions 132 from the individual learners are combined using a combi-133 nation rule to obtain a single prediction that is more accu-134 rate. Generally, ensemble methods can be classified into par- 135 allel and sequential ensembles. The parallel methods train 136 different base classifiers independently and combine their 137 predictions using a combiner. A popular parallel ensemble 138 method is bagging and its extension, the random forest algo-139 rithm [44]. Parallel ensemble algorithms use the parallel gen-140 eration of base learners to encourage diversity in the ensemble 141 members. 142 Meanwhile, sequential ensembles do not fit the base mod-143 els independently. They are trained iteratively so that the 144 models at every iteration learn to correct the errors made by 145 the previous model. A popular type of sequential ensemble 146 is the boosting algorithm [45]. Figures 1 and 2 show block 147 diagrams depicting parallel and sequential ensemble learning. 148 Furthermore, parallel ensembles can be classified into homo-149 geneous or heterogeneous, depending on the base learners' 150 homogeneity. Homogeneous ensembles consist of models 151 built using the same ML algorithm, while heterogeneous 152 ensembles comprise models from different algorithms [46], 153 [47], [48]. 154 The success of ensemble learning techniques mainly relies 155 on the accuracy and diversity of the base learners [49]. 156 A machine learning model is considered accurate if it has a 157 good generalization ability on unseen instances. In contrast, 158 ML models are diverse if their errors on unseen instances are 159 not the same [47]. Therefore, diversity is seen as the differ-160 ence between base learners in an ensemble [50]. Unlike accu- 161 racy, there is no general rule of thumb in measuring diversity. 162 Meanwhile, it is challenging to have diversity in the base 163  Meanwhile, the ensemble's individual base models often 207 do not have equal performance. Hence, considering them 208 equally in the summation might be inappropriate [55]. A more 209 suitable solution is to weigh the performance of the individual 210 models using the weighted majority voting technique. 211 2) WEIGHTED MAJORITY VOTING 212 The weighted majority voting assumes that some classi-213 fiers in the ensemble are more skilful than others, and their 214 predictions are given more priority when computing the 215 final ensemble prediction. The conventional majority voting 216 assumes that all the base models are equally skilled and their 217 predictions are treated equally when calculating the final 218 ensemble prediction. However, the weighted majority voting 219 assigns a specific weight to the base classifiers, which is 220 then multiplied by the models' output when computing the 221 final ensemble prediction [56]. Assuming the generalization 222 ability of each base model is known, then a weight W t can 223 be assigned to classifier h t according to its estimated general-224 ization ability. Therefore, using the weighted majority voting, 225 the ensemble classifier selects class c * , if tion process as an optimization problem that can be solved 280 using mathematical programming or heuristic optimization.

281
Static methods limits the flexibility of the ensemble selection 282 process.

283
The dynamic methods dynamically select a subset of 284 models for making a prediction based on specific features 285 of the unseen instances. Every region of the input feature 286 space is assigned a subset of models that perform best in 287 that region. A foremost example of the dynamic selection 288 strategy is the k-Nearest Neighbor Oracle (KNORA) [63]. 289 Recently, Nguyen et al. [59] proposed a selection strategy 290 that combines aspects of both static and dynamic ensemble 291 selection. Other recently proposed dynamic ensemble selec-292 tion methods can be found in [ Boosting is a machine learning technique capable of con-299 verting weak learners into a strong classifier. It is a type of 300 ensemble meta-algorithm used for reducing bias and vari-301 ance. Meanwhile, a weak learner is a classifier that performs 302 a bit better than random guessing, while strong learners are 303 those that attain good accuracy [16], and they are the core 304 of which the boosting ensemble algorithms are built. The 305 boosting algorithm was first discussed in a 1990 paper by 306 Schapire [43]  The main idea behind boosting involves iteratively apply-313 ing the base learning algorithm to adjusted versions of the 314 input data [14]. In particular, boosting techniques use the 315 input data to train a weak learner, compute the predictions 316 from the weak learner, select misclassified training samples, 317 and train the subsequent weak learner with an adjusted train-318 ing set comprising the misclassified instances from the pre-319 vious training round [71]. The iterative learning process con-320 tinues until a predefined number of base learners is obtained, 321 and the base learners are weighted together [27]. Boosting 322 focuses more on reducing bias than variance [28]. Therefore, 323 it enhances base learners with a high bias and low variance, 324 such as decision stumps (a decision tree with one internal 325 node). Misclassified samples get more weight, causing the 326 base learner to focus on such samples. Hence, if the base 327 classifier is biased against specific samples, those samples 328 are given more weight; hence, the algorithm corrects the bias. 329 However, this iterative learning approach makes boosting 330 unsuitable for learning noisy data because the weight given 331 to noisy samples is usually much greater than the weights 332 given to the other samples, thereby forcing the algorithm to 333 focus excessively on the noisy samples, resulting in over-334 fitting. Despite that, boosting-based ensemble methods are 335 round. Furthermore, Z t is selected such that D t+1 will be a 373 distribution. Z t and α t are obtained using: where t represents the error rate of the classifier, and it is 377 obtained using: 379 When the given number of iterations have been completed, 380 the final strong classifier is computed using: The AdaBoost algorithm is summarized in Algorithm 1. 383 The AdaBoost is easy to implement with little need to tune its 384 hyperparameters [77]. Furthermore, the AdaBoost is flexible 385 and can use a variety of algorithms as the base learner; 386 hence, an algorithm suitable for a specific application can be 387 used as the base learner, and the AdaBoost can enhance its 388 performance. However, a limitation of the AdaBoost is that it 389 is sensitive to noisy data and outliers because of its iterative 390 learning approach, causing overfitting. 2) Calculate the weight α t of the classifier using (6).
3) Update the sample weights using (4) end for Output: Apply (8) to combine the predictions of the base classifiers to obtain the final strong classifier H (x).

392
Gradient boosting is a machine learning algorithm that uses 393 the boosting technique to create strong ensembles. It mainly 394 uses decision trees as the base learner to produce a robust 395 ensemble classifier, and it is also called gradient boosted 396 decision tree (GBDT  Given a training set S = {x i , y i } N 1 , the gradient boost-409 ing technique aims to find an approximation,F(x), of a 410 function F * (x) that maps the predictor variables x to their 411 response variables y, by minimizing the specified loss func-412 tion L(y, F(x)). The GBDT creates an additive approximation 413 of F ( x) through a weighted sum of functions: where ρ m represents the weight of the m th function, h m(x) .

416
These functions are the decision tree models in the ensemble.

417
The algorithm performs the approximation iteratively. Mean-418 while, a constant approximation of F * (x) is achieved using:

420
Successive base learners aim to minimize Meanwhile, rather than performing the optimization task and it is computed as: Algorithm 2 Gradient Boosting Input: training data S = (x 1 , y 1 ), . . . , (x 2 , y 2 ), . . . , (x m , y m ) A differential loss function L(y, F(x)). The number of iterations T . Procedure: 1) Initialize the model with a constant value using . . , n (ii) Train a base learner using the training set (iii) Obtain ρ m by performing the line search optimization: regularization term that prevents overfitting [48]: where F(x i ) represents the prediction on the i − th instance 465 at the M − th iteration, L( * ) represents a loss function that 466 computes the differences between the predicted class and the 467 actual class of the target variable. Meanwhile, (h m ) denotes 468 the regularization term, and it is formulated as: where ϒ represents the complexity parameter, and it controls 471 the minimum loss reduction gain required for splitting an 472 internal node. Assigning a high value to ϒ leads to simpler 473 trees. Meanwhile, T represents the number of leaves in the 474 tree, λ is a penalty parameter, and ω denotes the output of 475 the leaf nodes. Meanwhile, unlike the first-order derivative in 476 GBDT, a second-order Taylor approximation of the objective 477 function is employed in the XGBoost. Therefore, Equation 13 478 is transformed thus: where g i and h i denote the first and second derivatives of 481 the loss function. Assuming I j represents the samples in leaf 482 node j, then the final loss value is calculated by summing the 483 loss values of the various leaf nodes. Hence, the objection 484 function is represented as: rithms, which is why it has won several Kaggle competitions.

503
However, it has a few limitations, including its high number 504 of hyperparameters, making it difficult to tune [86], [87].  training time than the XGBoost algorithm [93]. In terms 542 of disadvantages, the LightGBM can overfit small training 543 datasets easily as it performs better with large datasets. Also, 544 splitting the tree leaf-wise could result in overfitting because 545 more complex trees are produced.

546
Meanwhile, the LightGBM has been applied for different 547 classification problems, achieving excellent results [94], [95], 548 [96], [97], and its procedure is presented in Algorithm 3. 549 A detailed explanation of the LightGBM technique can be 550 found in [88]. Also, a comprehensive mathematical overview 551 of the LightGBM algorithm is presented in [98].

Algorithm 3 LightGBM
The loss function L(y, (x)) The number of iterations T . The sampling ratio of large gradient data a, and the sampling ratio of small gradient data b. Procedure: 3) for t=1,. . . ,T: (i) Compute the absolute values of gradients: The CatBoost algorithm is an implementation of gradient 554 boosting proposed in 2017 by Prokhorenkova et al. [99]. The 555 algorithm effectively handles categorical features during the 556 training phase. A notable improvement in CatBoost is its 557 ability to perform unbiased gradient estimation that reduces 558 overfitting. Therefore, in order to estimate the gradient of 559 each example at every boosting iteration, the CatBoost algo-560 rithm omits that example from being used to train the current 561 model [100].  fore,   If σ = (σ 1 , . . . , σ n ) is the permutation, then x σ p ,k is replaced where P is a prior value, and a is the weight of the prior value.  perturbing the learning set could lead to significant modifi-618 cations in the obtained predictor; hence bagging can improve 619 accuracy [105]. Meanwhile, diversity is obtained in bagging 620 by creating bootstrapped replicas of the input data, where 621 several subsets of the input data are picked randomly with 622 replacements from the original training set. Therefore, the 623 various training sets are seen as diverse and used to train 624 multiple base learners of the same ML algorithm. 625 Basically, the bagging method involves splitting the train-626 ing data for each base learner using random sampling to 627 generate b different subsets used to train b base learners. 628 The b base learners are then combined using majority voting 629 to obtain a strong classifier [27]. The bagging procedure is 630 shown in Algorithm 4. Random forest is a popular implemen-631 tation of the bagging technique 632 Bagging enhances the performance of base learners more 633 if the algorithm used in learning the model is unstable. 634 An unstable algorithm significantly changes its generaliza-635 tion ability when slight modifications are made to its input. 636 Bagging focuses more on reducing the variance in the ensem-637 ble members than the bias. Therefore, bagging performs 638 optimally when the ensemble members have high variance 639 and low bias. An example of an unstable algorithm is the 640 decision tree; hence, bagged decision trees usually perform 641 better than the single decision tree. Meanwhile, k-nearest 642 Neighbor (KNN) and naïve Bayes are examples of stable 643 algorithms, and bagging does not perform well with these 644 algorithms as base learners [28].

645
A major advantage of bagging is that it efficiently 646 decreases the variance without increasing bias. Other advan-647 tages of bagging include its ability to introduce diversity 648 in the input data because of the bootstrapping approach. 649 For large datasets, bagging has less computational time than 650 most ML algorithms since it trains the model with a small 651 sample size [20]. Meanwhile, a limitation of bagging is that 652 it enhances the model's accuracy without regard for inter-653 pretability. For example, if only one tree were applied as 654 the base learner, a suitable and easy-to-interpret tree dia-655 gram would have been obtained; hence, the interpretability 656 is neglected since bagging uses many decision trees. Also, 657 the selected features during training are not interpretable 658 in bagging, so there could be situations where certain vital 659 features are never used.  3) Repeat steps 1 and 2 for T times to build a forest of T trees. end for Output: Combine the outputs of the various trees. For a given test sample x, the final predicted class label from the T trees is: Stacked generalization (Stacking) is an ensemble learning 715 framework that trains a separate ML algorithm to combine 716 the predictions from two or more ensemble members. It was 717 introduced in 1992 by Wolpert [117] to reduce the generaliza-718 tion error in machine learning problems. Stacking is useful 719 in situations where several ML models are uniquely skilful 720 on a particular task; then, the stacking approach would use a 721 separate ML model to learn when to use the predictions from 722 the various models [118]. The stacking framework is depicted 723 in Figure 3.

724
Specifically, it involves building models using multiple 725 base algorithms, called level-0 models, and a meta-learning 726 algorithm that trains another model to combine the predic-727 tions from the base models. The meta-model is termed a 728 level-1 model [33]. The core idea in stacking is that the level-0 729 base learners are trained using the training dataset and are 730 provided with out-of-sample or unseen data; their predicted 731 target labels on the unseen data, together with the actual 732 labels, form the input and output pairs of a new dataset used 733 to train the meta-learner [119]. The base learning algorithms T Procedure: Step 1: Train base learning models for t = 1, . . . , T : Fit a base learner h t using S end for Step 2: Obtain a new dataset from S for t = 1, . . . , T : Step 3: Train the meta-learnerĥ using the new dataset return H (x) =ĥ(h 1 (x), h 2 (2), . . . , h T (x)) Output: A stacked ensemble classifier H .

803
Ensemble learning methods have obtained excellent perfor-804 mance in numerous applications, attracting much attention in 805 many research fields. These methods are capable of enhanc-806 ing the generalization ability of single classifiers. This section 807 highlights some applications of ensemble methods that have 808 dominated the field of applied machine learning in recent 809 years: medical diagnosis, fraud detection, and sentiment anal-810 ysis [127], [128], [129], [130]. Each of these application 811 areas had problems that made it difficult for traditional ML 812 algorithms to perform well. The problems include the class 813 imbalance in medical and fraud detection datasets, the lim-814 ited sample size in medical datasets, too many redundant 815 attributes in medical and fraud detection data, and the chal-816 lenge of decoding the ambiguity of human language in senti-817 ment analysis.

819
Recently, there have been numerous advancements in the 820 application of machine learning for diagnosing diseases, such 821 as heart disease, hypertension, cancer, and diabetes. The 822 early detection of diseases is crucial in effectively man-823 aging the progression of the disease. Electrocardiograms, 824 computerized tomography (CT) scans, and other medical 825 tests can detect diverse diseases. Still, the high cost of 826 using such machines has made it difficult for people to 827 access them, especially in developing countries [131]. Mean-828 while, machine learning-based methods have been developed 829 to overcome the challenges associated with the traditional 830 methods [132]. However, there are specific challenges faced 831 datasets, including the imbalanced class problem and outliers  Deep learning (DL) algorithms have recently received 877 much attention due to their robust learning and generalization 878 abilities. An et al. [137] proposed an ensemble approach cou-879 pled with deep learning techniques to predict Alzheimer's, 880 a brain disorder associated with memory loss. Firstly, two 881 sparse autoencoders (SAE) were employed for feature learn-882 ing. Secondly, different ML algorithms were employed to 883 develop several models from the learned data. A deep belief 884 network (DBN) was utilized for training a meta-model that 885 combines the predictions of the base models. The final deep 886 ensemble was then applied to classify Alzheimer's disease 887 using clinical data. The results show that the proposed method 888 has accuracy 4% higher than six widely used ensemble algo-889 rithms.

890
There have been significant advancements in machine 891 learning for medical diagnosis. However, many machine 892 learning models only output the disease prediction or classi-893 fication without explaining the fundamental decision-making 894 process. Gu et al. [138] proposed using ensemble learning 895 combined with case-based reasoning (CBR) for explainable 896 breast cancer detection. The model performed a case-based 897 explanation of the output predictions, thereby aiding clini-898 cians in making better decisions. The XGBoost algorithm 899 was used in building the predictive ensemble model, and 900 the CBR provided an interpretation of the predictions. The 901 proposed method is an improvement to recent ML research 902 that has focused more on improving the accuracy rather than 903 providing relevant explanations for the predictions. 904 Aljame et al. [139] applied ensemble learning to detect the 905 novel coronavirus disease (COVID-19). The study aimed to 906 detect the disease early using the XGBoost algorithm. Mean-907 while, the dataset was first preprocessed to make it suitable 908 for building the ML models. Hence, the missing values and 909 outliers in the dataset were fixed using the k-nearest Neighbor 910 (KNN) imputation technique and isolation forest, respec-911 tively. The dataset contained 5644 instances with 559 posi-912 tive cases, indicating an imbalanced dataset. Therefore, the 913 SMOTE technique was used to create a new dataset with 914 an even class distribution. The XGBoost classifier trained 915 by combining SVM, KNN, and random forest (RF). Several 972 ensemble combination schemes were employed, including 973 majority voting, weighted average voting (WAV), and aver-974 age voting. The study employed the SMOTE technique to 975 resample the data and obtain a balanced dataset for training 976 the ML model. Also, the Boruta feature selection technique 977 was used to select the optimal feature set. The experimental 978 results indicated that the ensemble classifier obtained using 979 the weighted average voting method got the highest accu-980 racy of 100%. Furthermore, the statistical results showed 981 the superior performance of the weighted average voting 982 technique, which efficiently distinguished heart disease cases 983 from healthy ones. Chicco and Jurman [148] applied ensem-984 ble learning to detect hepatitis. The study employed the ran-985 dom forest algorithm for feature selection and to build the 986 prediction model, which was trained using electronic patient 987 records containing 615 instances. Firstly, it identified alanine 988 aminotransferase and aspartate aminotransferase enzymes as 989 the most informative features in the dataset. The ensemble 990 achieved a classification accuracy of 95.4%.

991
In another research, Ghiasi and Zendehboudi [149] used 992 two decision tree-based ensemble algorithms to build breast 993 cancer prediction models. The algorithms include the extra 994 tree classifier (ETC) and the random forest algorithms, and 995 the Wisconsin breast cancer dataset was used to train and test 996 the models. The models achieved accuracies of 100%, out-997 performing the methods in previous literature, and the study 998 concluded that both algorithms were effective at accurately 999 diagnosing breast cancer. Similarly, Nanglia et al. [150] pro-1000 posed a breast cancer detection model using an ensemble of 1001 multiple classifiers, including the decision tree, SVM, and 1002 KNN. The study employed a meta-classifier to combine the 1003 predictions of the ensemble members. The proposed ensem-1004 ble achieved a classification accuracy of 78%, which was 1005 superior to the individual base models. Furthermore, the pro-1006 posed ensemble model's performance was better than models 1007 developed using naïve Bayes, ANN, and logistic regression 1008 algorithms.

1009
From the papers surveyed in this section, it is observed that 1010 ensemble methods have been widely utilized for diverse dis-1011 ease predictions and medical diagnoses, achieving excellent 1012 performance and outperforming traditional machine learn-1013 ing algorithms Table 1 presents a summary of the articles 1014 discussed in this section.     [161]. Table 2  studies. Firstly, a feature extraction step was implemented 1149 using the N-gram stacked autoencoder neural network, and 1150 the output of this step was then used to build ML mod-1151 els using SVM, decision tree, KNN, and random for-1152 est algorithms. Secondly, the ML models were combined 1153 using mean and mode-based voting methods. The proposed 1154 deep learning-based feature extraction step integrated with 1155 the ensemble learning classification step was compared 1156 with other methods in recent literature, and the proposed 1157 method obtained superior performance with an accuracy 1158 of 87.8%. 1159 Lin et al. [167] proposed an ensemble learning method 1160 to identify fake and harmful information in news reports. 1161 Usually, a news report is expected to be neutral, stating the 1162 facts, and should not contain too much of the writer's emo-1163 tions and personal opinions. However, this is not always the 1164 case, as there are news reports that are written maliciously 1165 to achieve personal gains; hence, the study aimed to detect 1166 such news and classify them as fake. The study employed 1167 the bidirectional encoder representation from transformers 1168 (BERT) model using ensemble learning techniques with text 1169 sentiment classification. Specifically, the research employed 1170 bagging and stacking ensemble techniques to enhance the 1171 model's performance. The experimental results showed that 1172 the ensemble techniques enhanced the prediction of fake 1173 news with an accuracy of 99%.

1174
AlGhamdi et al. [168] proposed an ensemble learning-1175 based approach to perform aspect-oriented sentiment iden-1176 tification using datasets from multiple domains. The study 1177 employed the prior knowledge topic model algorithm and the 1178 stacking-based ensemble technique. From the experimental 1179 results, the proposed approach efficiently classify labels of 1180 the three review domains, i.e. Saudi airlines, restaurant, 1181 and movie, with a classification accuracy of 84.4%, 83.2%, 1182 and 84%, respectively. Meanwhile, the proposed approach 1183 achieved better performance compared to some baseline clas-1184 sifiers. Another aspect-based sentiment analysis model was 1185 developed by Zhang et al. [169] to predict customers' true 1186 consumption emotions and readiness to consume again using 1187 restaurant review data. Firstly, four conventional ML algo-1188 rithms were used to develop single models. The algorithms 1189 include SVM, XGBoost, gradient boosting decision tree, 1190 and logistic regression. However, since a model developed 1191 using a single ML algorithm usually has inadequate diversity, 1192 the study employed the four algorithms as base learners 1193   [172], and an opinion target extraction in restaurant review 1214 sentences using majority voting [173]. Table 3   In medical diagnosis, it is observed that random forest 1229 and XGBoost have been mostly used in the literature. Both detection, the AdaBoost is the most popular. The AdaBoost 1238 algorithm is fast, simple, and easy to model with less need 1239 for hyperparameter tuning. A variety of algorithms can be 1240 used to train the base learner in AdaBoost. Therefore, the 1241 user can select an algorithm suitable for a specific task as 1242 the base estimator. For example, some research works used 1243 LSTM, a robust algorithm for time-series modelling, as the 1244 base estimator for credit card fraud detection, a task needing 1245 time-series modelling.

1246
From the previous section, we can infer that stacking is 1247 mainly preferred for sentiment analysis. Several researchers 1248 have applied and proposed diverse stacking-based methods 1249 for different sentiment analysis problems. Firstly, the main 1250 challenge in sentiment analysis tasks is obtaining models that 1251 efficiently learn the data to achieve high performance. Stack-1252 ing methods are mainly used because, unlike bagging and 1253 boosting, stacking employs different ML algorithms, which 1254 learn from the data differently. These algorithms approach 1255 the learning problem differently, leading to enhanced per-1256 formance due to diversity in the ensemble model. Secondly, 1257 a meta-classifier learns how best to combine the base models, 1258 ensuring optimal performance.

1259
As is evident in this paper and other research works, 1260 ensemble learning has evolved over the years and can be con-1261 sidered a mature subfield of machine learning compared to 1262 deep learning, transfer learning, and reinforcement learning. 1263 Furthermore, ensemble learning methods have been com-1264 bined with deep learning for different applications, includ-1265 ing human activity recognition [174], time series forecasting 1266 [175], disease prediction [137], [176], [177], wind speed 1267 prediction [178], and outlier detection [179]. Also, ensemble 1268 learning, deep learning, and transfer learning were combined 1269 in [180] to estimate the capacity of lithium-ion batteries. 1270 The literature shows that the performance of deep learn-1271 ing techniques can be further improved when fused with 1272 some ensemble mechanism. For example, based on the notion 1273 that diverse convolutional neural network (CNN) methods 1274 learn different aspects of an image representation, in [181], 1275 an ensemble of CNNs was developed to achieve better feature 1276 extraction from medical images. The experimental results 1277 showed that the proposed approach obtained better feature 1278 extraction and accuracy than when a single CNN architecture 1279 was used. Even though much research has been carried out 1280 in ensemble deep learning and other areas, as seen in the 1281 previous section, many problems are still unexplored where 1282 VOLUME 10, 2022 ensemble methods can perform well. These unexplored areas are potential future research directions, and they are outlined XGBoost, LightGBM, and CatBoost. This paper will be 1338 crucial for machine learning researchers and practitioners 1339 who wish to understand ensemble learning and the popular 1340 ensemble algorithms.