A Sampling-Based Stack Framework for Imbalanced Learning in Churn Prediction

Churn prediction is gaining popularity in the research community as a powerful paradigm that supports data-driven operational decisions. Datasets related to churn prediction are often skewed with imbalanced class distribution. Data-level solutions, like over-sampling and under-sampling, have been commonly used by researchers to address this problem. There are limited number of case studies that attempt to evolve these data-level solutions by integrating them with computationally advanced frameworks, like ensembles. Ensembles primarily employ algorithmic diversity using a ﬁxed set of training instances to achieve superior performance. This study aims to introduce algorithmic diversity in ensembles by modifying the ﬁxed set of training instances using diverse sampling strategies to increase predictive performance in imbalanced learning. Data is acquired from the world’s largest open hotel commerce platform company. A four-part series of experiments is conducted to analyze the effectiveness of sampling techniques and ensemble solutions on model performance. A new sampling-based stack framework called ‘‘Stacking of Samplers for Imbalanced Learning’’ is proposed. The framework combines the prediction capabilities of sampling solutions to stimulate the information gain of the meta features in ensemble. It is observed that the proposed framework leads to improvement in model performance with AUC of 86.4% and top-decile lift of 4.7 for customers of the hotel technology provider. Additionally, results show that the framework records a higher information gain for meta features used in a stack, compared to commonly used stack frameworks.


I. INTRODUCTION
Churn prediction aims to identify the consumers who are likely to terminate their service contract with a service provider. The intention behind a customer's churn decision may be involuntary or voluntary in nature. A consumer in financial distress may not have a reputable payment history, forcing the service provider to cancel the service contract. This is an example of involuntary churn. When a consumer actively chooses to leave a brand because of a price advantage from a competitor, or poor customer service standards, this is called voluntary churn.
A typical churn prediction workflow that uses machine learning techniques follows the process of data collection and cleansing, feature selection, model application, and finally, churn prediction. Literature on churn prediction focuses mainly on one or more of the above-mentioned processes.
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Liu.
Churn datasets mostly use demographic and product-usage related features. However, recent studies have incorporated a new set of attributes that are derived from social network of a consumer. Predicting churn, like any other discipline, is associated with a few challenges. Class imbalance is one such challenge that has not received much attention to clear ground for academic innovation. An imbalanced dataset, unfortunately, is not ideal for many learning systems. Application of data-level solutions, like sampling strategies, provides no conclusive evidence on the superiority of one technique over the other. Effectiveness of these techniques primarily depends on use-case under consideration, and further studies are required to evolve these data-level solutions into a more generic form.
The paper proposes a new framework called Stacking of Samplers for Imbalanced Learning (SS-IL) that aims to incorporate and integrate the prediction capabilities of sampling methods, into a single ensemble framework via stacking. The study performs a comparative analysis of classifier VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ performance to establish superiority of SS-IL by applying sampling strategies to an imbalanced churn dataset. The dataset is sourced from a hotel commerce platform company.
Results are encouraging and show that the framework leads to enhanced performance of churn classifiers. The remaining paper is organized as follows. Section 2 presents related literature on sampling techniques and ensemble methods that are applied to imbalanced datasets. Section 3 describes the methodology used in the study. This is followed by section 4, where details on the model evaluation measures are discussed. Section 5 provides an overview on the experimental setup. Results are presented in section 6. Finally, section 7 contains a brief conclusion.

II. RELATED WORK
Improving the performance of classifiers has always been the main objective of churn prediction-related experiments. However, most churn data sets are class imbalanced. Imbalanced datasets are defined as those in which there is a significant difference between distribution of positive class and negative class. As a result, sampling techniques are used to balance the target class while training the model. Sampling strategies help to balance the training data to facilitate seamless integration with machine learning classifiers. Additionally, researchers have also used ensemble techniques to improve predictive capability of models. These techniques are discussed in the following section.

A. SAMPLING TECHNIQUES
There are 2 main types of sampling strategies implemented for balancing training data -over-sampling and undersampling. Over-sampling methods are a group of techniques for balancing the class distribution by either creating copies of the minority class, or by generating synthetic samples of the minority class. Random Over-Sampling (ROS) is the simplest over-sampling technique to balance skewed datasets. However, ROS often leads to overfitting of classifiers. Synthetic Minority Oversampling Technique (SMOTE) is another oversampling strategy that creates synthetic samples of minority class based on k nearest neighbor principle [1], [2]. A drawback of SMOTE is that the synthetically generated samples may lead to noise in the balanced dataset. A variation of SMOTE is k-means SMOTE. This algorithm over-samples the minority instances in clusters that have the highest number of minority samples. K-means SMOTE addresses the problem of noise in a balanced dataset. In another algorithm, called Adaptive Synthetic Sampling (ADASYN), areas with low density of minority samples are given higher priority and synthetic minority class instances are generated in these areas [3]. A detailed study conducted in [4] presents a comparison of 85 versions of over-sampling strategies using 104 imbalanced datasets for evaluation. The findings reveal that the type of evaluation metric utilized has an impact on a model's performance. Additionally, the study helps to establish a baseline for defining key principles that lead to best performance of a model.
Under-sampling methods are another set of techniques that aim to balance the class distribution by reducing the number of samples from the majority class. Random Under-Sampling (RUS) is the simplest method in this group that randomly selects and removes instances from the majority class. The main disadvantage of RUS is loss of information due to random instance selection. Additionally, there are heuristic methods of under-sampling that are based on sample significance and information content. Near Miss Undersampling [5] is one of the heuristic sampling strategies that selects samples of majority class on the basis of its average Euclidean distance from samples of minority class. Another algorithm, called Condensed Nearest Neighbor (CNN), seeks to create a subset of the main dataset to preserve maximum information of the data [6], [7]. Tomek link, on the other hand, is a modified version of CNN. It identifies pairs of samples belonging to opposite classes that have the minimum Euclidean distance in the defined feature space. These pairs are called Tomek-links [8], [9]. The majority class samples are then removed from the identified Tomek-links. One sided selection (OSS) [10], Neighborhood Cleaning Rule (NCL) [11], are few other popular under-sampling methods. Main drawback of these heuristic techniques is the lack of control over the number of majority instances to retain or delete in the dataset. In cases where the majority class is not diversely represented in feature space, application of heuristic undersampling methods becomes a challenge.
Sampling strategies have been widely applied for churn prediction. In a study, researchers proposed an improved version of SMOTE to address the problem of class imbalance in a telecom dataset [12]. They utilized a multi-objective rain optimization algorithm to determine the best sampling rate for SMOTE. On the other hand, an under-sampling balancing technique was adopted in [13] where majority class is undersampled in each iteration of cross validation. Each iteration used different independent groups of under-sampled majority class.
Although application of sampling technique is common in churn prediction, their effectiveness mostly depends on the use-case under consideration. There is no conclusive evidence on the superiority of one technique over the other. As a result, there is a need to evolve these data-level solutions into a more integrated and generic form.

B. ENSEMBLE METHODS
Ensemble methods are very popular for enhancing a classifier's prediction capability. These methods are based on the paradigm of combining multiple classifiers for optimizing quality of prediction [14]. The framework utilizes the strength of each classifier in the ensemble, and minimizes their weaknesses, for better classification results. A general ensemble framework is presented in Fig. 1 (a). An ensemble consists of several base learners (BL) in the first level i.e., level 0. The base learner decisions are further aggregated using aggregation rules like majority voting, followed by classification decision. Bagging and boosting are common examples of 68018 VOLUME 10, 2022 ensemble techniques. There are numerous studies that provide evidence of improved performance in classification task by using ensemble framework, as compared to standalone models [15], [16].
Stacking is an advanced and more refined version of ensemble framework, where output of base learners is combined and used as features for aggregation and classification. It incorporates a multilevel hierarchical approach where the number of levels, number of models, the selected set of base learners (BL), can vary and are design-dependent [17]. Fig. 1 (b) is an example of a stacking framework that comprises of n base learners in level 0. The base learner decisions are further aggregated at level 1 and utilized by another learner, called meta classifier, as attributes.
Ensemble methods and stacking have been deployed in few churn-related studies. Results are indicative of enhanced predictive capability of the framework [18]. However, in order to optimize the performance of an ensemble or stacking framework, it is important to focus on maximizing the ensemble diversity. This is a fundamental problem of ensemble methods. While ensemble diversity is not easy to quantify, information gain serves as an alternative that can be utilized to measure ensemble effectiveness. This study is an attempt to promote information gain in stacking by increasing span of attributes, using sampling strategies.

A. DATASET
The dataset is obtained from world's largest open hotel commerce platform company and is private in nature. It consists of 78 independent variables and 1 dependent variable that represents a binary target class. The attributes in this dataset comprise of a customer's demographic and product usage variables. Features that represent geographic location of a customer are one-hot encoded. Values present in numeric attributes are discretized to tackle outliers. There are 19,542 customer instances, with the minority class representing 14% of the dataset. The study is the first to utilize this dataset for churn analysis.
A time window technique is followed for deriving the target variable. Customer characteristics are tracked and used as features for a 9-month period in year 2021. In case a customer is active throughout this period, the instance is classified as a non-churner, or belongs to the negative class. If the customer discontinues service contract in this period, the instance is classified as a churner i.e., it belongs to the positive class. Independent variables related to product usage are derived based on the month of discontinuation, in case of churners; or the last day of the 9-month window, in case of non-churners.

B. PROPOSED FRAMEWORK SS-IL
A sampling-based stack framework called SS-IL is proposed for churn prediction. The framework exploits the ensemble paradigm for improving classifier performance. As discussed before, ensemble learning primarily fuses the decisions of several base classifiers to finally classify instances. Stacking is a special case of ensemble learning in which there are several base learners, or level 0 learners. The base learners are trained using the same training set [19]. However, in the proposed framework, varied training sets are used for level-0 classifiers, with an aim to increase the span of attributes and promote information gain in the ensemble through sampling. Stack framework is further characterized by the presence of an additional meta learner that is trained using the predictions of the level 0 learners. The meta learner learns the combination weights for all base level decision probabilities, and classifies instances. For a stack ensemble to perform well, it is important to promote information gain of features used for training the meta learner through level 0 base learners [19], [20]. The proposed framework is motivated by this rationale.

1) BASE LEARNERS
There are 6 models that are selected in this framework as base learners, namely, Random Forest (RF), k nearest neighbor (KNN), AdaBoost, Support Vector Machine (SVM-rbf kernel), Decision Tree (DT), and Logistic Regression (LR). The justification behind selection of these models as base learners is mainly to incorporate maximum diversity in the ensemble keeping in mind the varied nature of the algorithms. Each model is briefly described in the following section.

a: RANDOM FOREST (RF)
RF is an ensemble-centric classifier that fuses the decisions of individual decision trees [21]. Classification is based on the following steps: Step 1: The algorithm uses bootstrap sampling to generate n training subsets Step 2: An out-of-bag (OOB) dataset is generated for each training subset Step 3: For each training subset T i , decision trees are formed using splits of smaller set of independent variables. The process of splitting is repeated until a leaf node is reached. This leads to n decision trees that constitute the random forest.
Step 4: For a new unseen sample X, each decision tree casts a vote predicting its class. Finally, RF assigns a class to the unseen instance for which it received maximum votes from decision trees.

b: K NEAREST NEIGHBOR (KNN)
KNN is a distance-based supervised learning algorithm. It assigns a class to an unseen instance X, by calculating the Euclidean distance between X and its k nearest training neighbors [22]. The formula for calculating Euclidean distance between two vectors P = (x 1 , x 2 ,. . . x m ) and Q = (y 1 , y 2 ,. . . y m ) is given by (1): Here, m denotes the number of features that represent the sample and i = {1, 2,. . . , m}. X is assigned the majority class of its k nearest neighbors.

c: ADAPTING BOOSTING (AdaBoost)
AdaBoost is an ensemble learner that uses the boosting principle to classify instances. The model continuously learns from mis-classified instances by assigning them more weight during training iterations to improve predictive power [23]. The final class for unseen instance X is calculated as a weighted majority vote of base classifiers where t = {1, 2,. . . ,T}, denotes identifier of base classifier of the algorithm, and α = {α 1 , α 2 , . . . , α T } represents the weight of base classifiers satisfying condition T i=1 α = 1. The decision of t th base classifier for instance X is represented by c t (X).

d: SUPPORT VECTOR MACHINE (SVM)
SVM is a learning algorithm that classifies data points using a hyperplane defined by a set of support vectors forming the best possible decision boundary separating classes with maximal margin [24]. The hyperplane is represented in (3): where w is a vector that is orthogonal to the hyperplane, and x is feature vector of a sample defined in n-dimensional space.
In (3), b is a constant that minimizes the error on training set.

e: DECISION TREE (DT)
DT is a tree-based classifier that is composed of nodes, branches and leaves [25]. Nodes represent the variables, branches are the rules applied on variables, and leaves represent the classification outcome. The algorithm is guided by optimization of either Gini impurity or entropy, for each split of the growing phase of tree. Gini impurity measures the amount of randomness in samples. It is calculated as Entropy is the information required to describe an entity and is defined as: where i denotes the number of classes, and p i represents the probability distribution of i th class. An unseen sample X = (x 1 , x 2 ,. . . ,x n ), is assigned a class that is indicated at the leaf node of the decision tree following the decision rules satisfied by the features of sample.

f: LOGISTIC REGRESSION (LR)
LR uses a statistical approach for classification based on regression equation derived from learning the training data.

2) META LEARNER
RF is used as a meta learner in SS-IL framework. The meta learner takes predictions of base learners as input, and utilizes it for classification task.

3) SS-IL FRAMEWORK
The data set is split into a training set and test set following 90% -10% distribution. Z-score transformation rule is applied to the numerical features and merged with the one-hot encoded geographic location-related demographic variables to form the complete training dataset. Thereafter, numerical variables in the test set are transformed using transformation parameters from the training set. The training set is used as an input to 3 systems, namely, an under-sampling cluster, an over-sampling cluster, and lastly, a sampling-free cluster. Each cluster contains 6 base learners.
The under-sampling cluster uses Tomek link algorithm with sampling strategy set as ''not minority'' for removing the majority class instances. The under-sampled training set is thereafter used as an input to the 6 base learners in the first cluster. Similarly, the over-sampling cluster, over-samples the training set using the SMOTE algorithm with sampling strategy set as ''not majority'' with k = 5. The new over-sampled training set is further used as input to 6 base learners in the second cluster. The final cluster uses the original training set with imbalanced class distribution. The transformation of original training set into 2 modified training sets, and 1 original training set, is in stark contrast with the normal practice of employing a single training set as input to all base learners in stack frameworks.
The test set is classified by 6 trained base learners in each of the 3 clusters. Hence, there are 18 classification decisions received for test set. The out-of-sample predictions from level 0 learners are saved for each split of the 10-fold cross-validation process. After completion of 1 cycle of crossvalidation for the full data set, a consolidated meta data set comprising of 18 features obtained from out of sample level-0 predictions, is created. This set is denoted by P = {p 1(1) , p 2(2) ,. . . ,p m(18) }, where p i(j) denotes the classification decision for i th instance and j th base learner; m denotes the total number of instances in the dataset. The set P is further used to train the meta classifier. Performance is evaluated via a second 10-fold cross validation process. Fig. 2 is a graphical representation of this framework.

IV. EVALUATION MEASURE
The study uses 5 parameters to evaluate the performance of the proposed framework.

A. PRECISION
Precision is defined as the proportion of predicted churners that are classified correctly. The rationale behind selecting this metric is related to the use case of this study. The hotel technology provider company has limited resources to focus on nurturing possible churners. Hence, optimization of precision is critical. It is calculated using (7), where True Positive (TP) is the total number of churners that are identified correctly by the classifier. False Positive (FP) is the number of non-churners that are incorrectly classified as churners.
B. F1-SCORE F1-score is the geometric mean of precision and recall. Recall is defined as the proportion of true churners that are classified correctly. In the case of an imbalanced dataset, F1-score helps to rule out presence of model bias. F1-score and recall can be derived using (8).  (9). It is a metric that evaluates a classifier independent of any decision threshold. AUC can take values between 0 and 1; a value close to 1 indicates a good classifier. Hence, the metric is selected for an unbiased and threshold-independent evaluation of the framework.

D. TOP DECILE LIFT
Top decile lift (TDL) is a meaningful evaluation parameter, particularly in the case of churn. TDL represents number of times a classifier is better at predicting churn, compared to a random guess. The measure is particularly useful in cases where focus is on the top 10% of the predicted churners. Considering ''limited resources'' constraint for our existing use-case, TDL is appropriate as it enables an organization to effectively utilize scarce resources that are deployed to nurture the top 10% of the predicted churners [15]. Mathematically, it is derived using the formula below: where Perc_P dec represents the percentage of true churners in the top decile of predicted churners, and Perc_P denotes the percentage of true churners in the dataset.

E. INFORMATION GAIN IN ENSEMBLE
Information Gain has its roots in information theory and is closely related to the notion of entropy. Entropy is defined as the extent of randomness in a dataset. Let D denote a dataset with n samples, D = {(x 1 , c 1 ), (x 2 , c 2 ), . . . , (x n , c n )}. Here x i represents the feature vector of the i th sample, and c i is the corresponding class, c i e{0, 1}, for binary classification. For a randomly selected instance, the probability that it belongs to a class c i is given by p(c i ) = c i /n, where ||c i || is the number of samples in D with class c i [26]. The entropy of dataset D with respect to class variable c i is given by (11): where k = 2 for binary classification. The entropy of dataset D with respect to class variable c i , given a discrete variable f is as follows: where ||f|| denotes the frequency of values in feature f. Information gain (IG) of variable f can then be numerically derived by the reduction in the entropy: This study utilizes the IG criterion of meta features for evaluating commonly used stack ensemble approaches. Since there are multiple variables in the training set of a meta learner in stack, the overall information gain for all features of the training set is derived as follows. The probabilistic output of each level 0 base learner is transformed into a binary output using a threshold of 0.5. Thereafter, a new feature β is derived by concatenating the binary output of the level 0 classifiers for each sample. Information Gain for the training dataset is then calculated with respect to this new feature β.

V. EXPERIMENTAL SETUP
In order to assess the lift in the performance achieved by SS-IL, a set of 4 independent experiments are conducted. The algorithms and framework devised in this study are coded in Python using Spyder, which is a free integrated development environment available in Anaconda. A Windows 11 machine with Intel Core i7-10510U, 2.3 GHz processor and 16GB RAM is used for the experiments. Cross validation is a robust   by BL ns . No sampling is applied to train these classifiers. Performance is recorded following 10-fold cross validation process. The pseudo-code of this experiment is presented in Fig. 3.
The experiment uses data set D with n samples. A 10-fold cross validation process is followed for each train-test split. Classifiers BL ns are trained using algorithm A and training set T. Test set predictions are generated on test set D i . Precision, F1-score, AUC and TDL are recorded for each train-test split, and for each BL ns . The results of this experiment are discussed in the next section.

B. EXPERIMENT 2
The second set of experiments aim to analyze the impact of 2 sampling strategies on performance of the base learners. The pseudo-code of this experiment is presented in Fig. 4. The experiment uses dataset D with n samples. A 10-fold cross validation process is followed to split the dataset into 10 subsets. For each split, the training data T is first over-sampled to train base learners BL os using algorithm A. Similarly, training data T is under-sampled to train base learners BL us . The predictions on test set Di are recorded for BL os and BL us . Precision, F1-score, AUC and TDL are recorded for each train-test split.
Experiments 1 and 2 are aimed to understand the impact of over-sampling and under-sampling on performance of base learners. The results are plotted in Fig. 5.
As can be seen in Fig. 5, application of over-sampling or under-sampling leads to a reduction in precision for almost all base learners. This implies that BL ns classifiers capture maximum true churners from those that are predicted as churners. However, at the same time, these models record a low F1-score. This is indicative of a low recall value which implies that these classifiers predict majority of the instances as non-churners. Hence, even though precision is high for BL ns , the total number of true churners that can be identified using these models, is debatable. AUC, on the other hand, either improves slightly or remains same for most models, when trained with sampled data. It is observed that TDL improved in 50% of the base learners when trained with under-sampled data. Over-sampling is not effective in this case. However, overall, BL ns (RF) records the highest TDL of 4.5. This means that out of top 10% of customers that are predicted as churners by this model, 63% are correctly classified. Additionally, RF emerges as the best classifier in the second experiment.

C. EXPERIMENT 3
The third set of experiments aim to understand the effect of implementing a stacked framework on model performance. Here, BL os and BL us models are arranged in 2 independent and separate stacked ensembles, denoted by ''Stacked OSBL'' and ''Stacked USBL'', respectively. RF model is used as a meta learner in both the stack ensembles. This experiment follows the steps of a general stack framework as depicted in Fig. 6.
A stacking process begins with a dataset D consisting of n samples. It is used to train a given set of base learners BL. Let z ij denote the classification decision of j th BL on a test set for i th sample. A new dataset D meta is created using test set predictions for each split of a k fold cross validation. Finally, meta learner ML is trained using D meta . Classification VOLUME 10, 2022 decision of ML is denoted by p meta which is further used to measure precision, F1-score, AUC and TDL.
In experiment 3, there are 2 stack frameworks, called ''Stacked OSBL'' and ''Stacked USBL'', that are implemented following the guidelines described above. Performance of meta learner ML is recorded for evaluation. A 10-fold cross validation process is used to calculate average performance for both the experimental settings. Additionally, the training sets used for meta learners in Stacked OSBL and Stacked USBL, is analyzed to derive the corresponding information gain achieved using meta-attributes. The results of this experiment are discussed in the next section.

D. EXPERIMENT 4
The final experiment represents our proposed framework SS-IL. The algorithm consists of base learners BL ns , BL os and BL us , combined together in a single stacked ensemble. RF model is used as a meta learner. The algorithm is described in Fig. 7.
SS-IL uses dataset D with n samples. The dataset D is split into k subsets. The training sets O(T), U(T) and T are used to train base learners BL os , BL us and BL ns respectively. O(T), U(T) and T denote over-sampled, undersampled, and non-sampled training set. Predictions for test set Di for each train-test split are cumulatively consolidated to produce a new dataset D meta for training the meta classifier ML. D meta is characterized by an increased span of attributes that contribute to the information gain in base learners through sampling. The meta learner ML is finally trained and evaluated using k-fold cross validation. In order to compute the information gain achieved by the increased span of attributes in SS-IL, the training set of ML is analyzed. Information Gain calculation for the training set of this meta learner follows the same process as described in Experiment 3.

VI. RESULTS AND DISCUSSION
To summarize, the study aims to investigate improvement in performance through sampling in a stacking framework. Hence, a series of 4 experiments is conducted in a phased manner. Experiment 1 investigates the performance when no sampling is used in training phase. Experiment 2 analyzes the impact of introducing 2 sampling strategies in training data -SMOTE and Tomek link. Experiment 3 integrates sampling strategies with stacking to study performance variation -Stacked OSBL and Stacked USBL. Experiment 4 finally incorporates sampling strategies to increase span of attributes for meta classifier in a single stack. This framework is called SS-IL. The experimental results are consolidated and graphically presented in Fig. 8 and the following conclusions are drawn.
Results show that the sampling strategies in isolation, or in combination with stack framework, have no significant impact on AUC. This is in agreement with claims of [27] that AUC is not sensitive to class distribution. In the experiments, SS-IL achieves the highest AUC of 86.4% that is indicative of an improved TPR, independent of class distribution.
Another observation is related to precision. RF as a classifier achieves the highest precision of 0.72 with no sampling applied to training data. This is a case of a biased classifier that predicts majority of instances as non-churners, leading to high precision at the cost of low recall. The claim is further supported by a low F1-score of 0.39 for the same classifier with no sampling applied for training. While application of sampling techniques leads to a reduction in precision of RF for the use case under study, F1-score has the opposite effect. This indicates reduction in bias of a classifier towards majority class. Sampling when combined with stacking, has a positive impact on both precision and F1-score. Proposed framework, SS-IL, further records an improvement in precision and F1-score at 0.69 and 0.49 respectively, and emerges as second-best classifier. This can be attributed to SS-IL's ability to reduce model bias by expanding and diversifying the size of the training set.
Results for TDL show a performance of 4.49 for RF when no sampling is applied during training phase. Application of sampling techniques independently, or combined with stacking, do not show positive impact on performance. However, SS-IL achieves the best TDL of 4.7 in the experiment. This essentially means that out of the top 10% of customers that are predicted as churners by SS-IL, 65% are identified correctly. As discussed before, for our use case, TDL is an   important measure of performance evaluation due to ''limited resources'' criterion that is previously described in section 4.
Information Gain for meta features in SS-IL is the highest, compared to Stacked OSBL and Stacked USBL approaches. The expanded set of meta variables employed in the training of meta learner leads to maximum reduction in the randomness of the dataset with respect to the target variable. The observation further substantiates the relevance of the attributes used for training the meta learner in SS-IL.
The competence of SS-IL is supported by a rank chart in Fig. 9. Proposed framework emerges as the best classifier in 3 out of the 5 evaluation parameters, proving its effectiveness in imbalance learning for the use case under study. As the framework attempts to utilize the prediction capabilities of over-sampled, under-sampled and standalone models, it cancels out overfitting through the increased span of attributes in the ensemble aggregation process and emerges as a balanced classifier with maximum IG.
A. COMPUTATIONAL TIME The computational training time taken for each experiment is recorded and presented in Fig. 10. As can be seen, SS-IL is not computationally efficient recording the longest time for training base learners. There are 2 main reasons behind this. Firstly, the framework, in its current state, is sequential in nature. Due to limited computational resources for the conducted experiments, the base learners in the 3 clusters are trained sequentially. Secondly, SS-IL includes a computationally expensive base learner, SVM. Hence, although time taken by SS-IL is of the order of a few minutes, the chosen base learners in the framework play a significant role in the computational time spent in training phase. Parallel and distributed processing of 3 clusters in SS-IL can make it computationally efficient.

VII. CONCLUSION
Imbalanced learning is a key challenge in churn prediction. Datasets related to churn are often skewed, and over-sampling and under-sampling methods are applied to balance the training data for effective imbalanced learning. Ensemble methods have shown promising results in many studies proving their effectiveness in improving classifier performance. The new framework SS-IL aims to stimulate diversity with samplingbased base models, and cancel out overfitting by increasing span of attributes used to train a meta classifier in a stack ensemble. A four-part series of experiments is conducted on a churn dataset obtained from a leading hotel commerce platform company. A set of 5 evaluation measures are used for assessing performance, namely, precision, F1-score, AUC and TDL. Information gain is computed for the training sets used to train meta learner. Results show that the framework positively contributes to the information gain of the training set used for the meta learner of the stack. SS-IL, although computationally expensive, is effective in predicting churn achieving highest AUC and TDL of 86.4% and 4.7, respectively.
The study has few limitations. Firstly, the framework utilizes Tomek link and SMOTE as sampling strategies. Investigating the impact of other sampling methods on the performance of SS-IL is an interesting direction for future studies. Secondly, the framework has six models that are trained sequentially in each of the 3 clusters. This leads to relatively high computational time spent in training.
Incorporating efficient base learners in the framework, and using parallel and distributed processing in the training phase, are a few ways to address the limitation in future. Thirdly, the study validates the framework for a specific use case pertaining to a hotel commerce platform company. Future investigations that use imbalanced datasets from other domains will help to establish the effectiveness of the framework.