ChurnNet: Deep Learning Enhanced Customer Churn Prediction in Telecommunication Industry

In the Telecommunication Industry (TCI) customer churn is a significant issue because the revenue of the service provider is highly dependent on the retention of existing customers. In this competitive market, it is essential for the service providers to figure out the concerns of their existing customers regarding their services as the cancellation of the services by the customers and switching to new service providers will not bring any good to the service provider. In the context of TCI, numerous research have been made to predict customer churn though, after the performance evaluation of these studies, it shows that there is enough room for progress. Therefore, in this study, we proposed a novel customer churn prediction architecture namely ChurnNet to predict customer churn in TCI. In our proposed ChurnNet, the 1D convolution layer is integrated with residual block, squeeze and excitation block, and spatial attention module to improve the performance. Residual block aids in solving the vanishing gradient problem. Squeeze and excitation block and spatial attention module enable the ChurnNet to understand the interdependency between and within the channels respectively. To evaluate the performance, the experiment is performed on 3 publicly available datasets. As the datasets have significant class imbalance issues, three data balancing techniques such as SMOTE, SMOTEEN, and SMOTETomek are performed. Along with 10-fold cross-validation and after going through the rigorous experiment it was found that ChurnNet performed better than the state-of-the-art and obtained 95.59%, 96.94%, and 97.52% accuracy on 3 benchmark datasets respectively.


I. INTRODUCTION
The TCI has had remarkable expansion and progress in development over the past few decades on the basis of the level of competition, the number of operators, services, and so on.But TCI has significant customer turnover difficulties, which are considered a severe problem in this context due to high strong competition, congested markets, changing environment, and alluring & profitable offers [1].It is very convenient for customers to switch both services and service providers because the market is saturated.Also, there is so much variation and competitiveness in the services of the The associate editor coordinating the review of this manuscript and approving it for publication was Alba Amato .service providers.The customers who intend to stop taking the service of the company or intend to take services from other companies are considered churned customers from the perspective of first service providers.
In TCI customer turnover decreases the profit and the company's income severely [2].Besides, obtaining new customers is way more costly than retaining existing customers [3].Therefore, customer retention for a longer period is very important because it helps telecommunication service provider companies generate more revenues.
Customer churn is an unavoidable consequence when a customer is unsatisfied with the company's services for long periods of time.Unsubscription of service by the customer does not occur overnight; rather it comes from the customer as a vigorous act due to the accumulation of long-term disappointment.Therefore, it is vital for service providers to identify and overcome their limitations regarding customer service and satisfaction to retain irate customers.Besides, it is necessary for the service providers to point out which customers are likely to churn and which customers are willingly wanting to stay with provided services (non-churn) because it helps the telephone company (TELCO) to revise their strategy to retain potential churn by reducing dissatisfaction.Therefore, customer churn prediction (CCP) is crucial for TELCO because it is very much connected with brand value and revenue.For this purpose, telephone companies usually maintain standing reports regarding clients and services to understand the continuation of their receiving services.For TELCO, the objective is to retain the clients for the long term which makes the CCP so important in TCI [4].
The CCP problem has been studied in several research works and many machine learning-based algorithms have been proposed for customer churn prediction [5], [6] [7], [8].However, previous studies have struggled to achieve satisfactory outcomes in churn prediction due to the complex and diversified nature of customer churn behaviour.The traditional ML algorithm-based churn prediction models struggle to deal with complex non-linear relationships within the data [9].Further, dealing with high dimensional data requires manual feature engineering which is labor-intensive and time-consuming [10].Sometimes it is prone to human biases.Furthermore, traditional ML methods such as SVM, Random Forest, Decision Tree, and so on do not have any built-in mechanism for assigning attention to specific features for extracting key information from churn datasets [11].Also, in high-dimensional datasets, it is essential to extract local information for better performance which traditional ML algorithms lack.Along with machine learning approaches in previous studies, deep learning methods have also been utilized in customer churn prediction [12].However, the proposed deep learning-based models in prior works are unable to provide explicit feature importance in churn prediction due to a lack of attention mechanisms.To improve feature extractions from the data, the attention mechanism helps to capture the inter-spatial and inter-channel relationship of the features that vanilla CNN and ANN lack.Considering this fact we design a unique deep learning architecture (ChurnNet) that can assign attention based on specific feature importance.Besides, it can do feature extraction automatically and is capable of finding complex non-linear relationships within features.For example, ANN [13] struggled to perform satisfactorily on the IBM Telco dataset whereas our proposed model performed well because of identifying the important features of the data.The main contributions of the study are: • We introduce a novel attention mechanism-based customer churn prediction model, namely ChurnNet to predict customer churn in the Telecommunication industry.In the proposed ChurnNet, One-Dimensional (1D) convolution layer is integrated with Residual blocks, Squeeze and Excitation blocks, and Spatial attention modules to enhance the performance to predict customer churn.
• The datasets that are used in the research are highly imbalanced, so, 3 data balancing techniques such as Synthetic Minority Over-sampling Technique (SMOTE), SMOTE with Edited Nearest Neighbors (SMOTEEN), and SMOTE-Tomek Links (SMOTETomek) are used.
After empirical analysis, it was found that SMOTEEN provided better performance.Also, a 10-fold crossvalidation technique is used to make the ChurnNet more generalized.
• Along with hyperparameter tuning, extensive experiment is performed on 3 publicly available datasets such as IBM Telco Dataset, Churn-in-Telecom Dataset, and Churn-data-UCI Dataset.In performance evaluation, it obtained 95.59%, 96.94%, and 97.52% accuracy for 3 datasets respectively.
The organization of the paper is as follows.Section II shows the literature review which is followed by the methodology in Section III.Section IV presents the architecture of the proposed ChurnNet.Section V discusses the results and Section VI gives the conclusion.

II. LITERATURE REVIEW
Customer churn has been dealing with many approaches which include Machine Learning, Deep Learning, Data mining along with hybrid methodologies.Several Machine learning classifiers such as Logistic Regression (LR) [14], Random Forest (RF), K-nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM) [15], Naive Bayes (NB) [16], Bayesian classifiers have been used commonly in CPP problems.Besides, ensemble Classifiers [17] such as CatBoost, XGBoost, Adaboost [18], and Gradient Boosting (GBM) have also been proposed to address the problem.Also, Deep Learning strategies are also used on the Customer Relationship Management (CRM) in TCI to predict customer churn.For prediction tasks, CRM data has been used frequently [19].Therefore, a comprehensive review of prior work is beyond the range of this paper, but a brief relevant overview is given below.
Sana et al. [7] proposed 8 types of classifiers to deal with the CCP problem in the TCI.Along with the classifiers they applied 6 data transformation methods such as Logarithmic transformations, Z-score, Discretization, Boxcox, Weight-of-Evidence (WOE), and Rank, to see the performances of the proposed classifiers.After evaluation through cross-validation techniques on 4 publicly available datasets, it was found that the WOE data transformation method had a significant impact on the performance.
Using KNN and Pearson Correlation the author [20] made up an approach to deal with the serious issues of CCP in TCI.Using only one dataset of 7043 instances, they evaluated the proposed model.Without using any cross-validation techniques they compared their proposed model with SVM and RF and it was visible that their proposed model performed better and got 97.78% testing accuracy.
Similarly on a telco dataset, the study [21] focused on data mining techniques.Using the cross-validation method, they applied LR, MLP, SVM, DT, and backpropagation algorithms for classification tasks.Besides, they used DB scan algorithms along with K means algorithms for the clustering tasks.Also, they used Apriori and FP Growth algorithms.
The author suggested [22] an integrated framework based on customer analytics to manage churn.Their proposed framework is dedicated not only to predicting churn but also to doing customer segmentation.They used 3 datasets where to generalize, 10-fold cross-validation was used.Besides, to mitigate the data imbalanced issues they used SMOTE technique.To predict the churn they applied the baseline machine learning classifiers along with MLP.Further, Bayesian LR was introduced for analysis and the K means algorithm was used for the customer segmentation tasks.
Amin et al. [23] revealed a Just in Time (JIT) approach to tackle the customer churn problem.They used SVM as a base classifier; they also used homogeneous and heterogeneous ensemble methods for the (JIT) approach.After evaluating the performance using 10-fold cross-validation on two publicly available telecom datasets it was demonstrated that SVM performed better when it was applied in the heterogeneous ensemble compared to homogeneous ensembles or as base classifiers.
Similarly, the authors [6] implemented 4 ensemble-based algorithms which are RF, Decision Tree, GBM, and XGBoost for classifying customer churn.But in this research, the author introduced only one evaluation metric which is Area Under Curve (AUC) value.Among these algorithms, XGBoost performed well in terms of AUC values (93%) compared to other algorithms when SNA was integrated as feature selection.Integrating SNA in the feature selection was an important aspect of this research.
The study [24] used 4 imbalanced public telco datasets to predict customer churn.They used different oversampling techniques such as MWMOTE, SMOTE, TRKNN, MTDF, ICOTE, and ADASYN to deal with the class imbalance issue.The authors used 4 different algorithms such as Genetic algorithm, Learning from Example Module, version 2 (LEM2), Rough Set Theory (RST), and covering algorithms along with cross-validation to evaluate the performance.
Wang et al. [25] proposed several machine learning approaches such as classification and regression trees (CART), partial decision trees (PART), and bagged CART for customer churn prediction in TCI.In their research, they used only one dataset which was imbalanced.To resolve this issue, they used SMOTE.After evaluation through 10-fold crossvalidation, it was found that CART had 0.7643 accuracy which is the highest among all the algorithms.As telco churn is cost-sensitive, the authors [26] also introduced the CART model.After doing experimentations on real data it found that the proposed CART model performed well not only in classification tasks but also it helped to minimize the total cost of misclassification.
Herdian and Girsang [27] applied a machine learning and deep learning-based hybrid approach for the task of CCP.In the research, they used the two public datasets of 7043 and 5000 instances respectively.They used the Decision Tree (DT), ANN, and the hybrid model (DT-ANN) for classification purposes.In the result analysis, it was found (DT-ANN) outperformed DT and ANN.Point to be noted that, without using any cross-validation techniques they evaluated the performance of the algorithms using only an evaluation metric which is AUC.Besides, this paper [28] followed an approach to predict customer churn in a cloud environment where two layers of ANN were used for feature extraction.They used the Block-Jacobi SVD algorithm to reduce the dimensions of the extracted feature.Later on, the NB algorithm was used for classification tasks.
The study [29] approached CNN for CCP.Using only a telco dataset, it obtained 86.85% accuracy without applying K-fold.In this study, [11] authors developed a model by integrating both CNN and Bidirectional LSTM with spatial attention mechanism to predict customer churn on bank dataset.They compared the performance of the proposed model (AttnBLSTM-CNN) with BiLSTM-CNN.In the performance evaluation, it was found that AttnBLSTM-CNN performed better.

III. METHODOLOGY
The top-level overview of the customer churn prediction is presented in Fig. 1.There are several modules such as dataset collection, data preprocessing, data balancing techniques, ChurnNet classifier, and performance evaluation.

A. DATASET DESCRIPTION
In the research, we have used 3 publicly available telco datasets which are IBM Telco Dataset [34], Churn-in-Telecom Dataset [35], and Churn-data-UCI Dataset [36].Dataset descriptions are given below: This dataset consists of 7043 instances where each instance refers to the customer.In each instance, it consists of 21 variables/attributes where 20 variables are independent and 1 variable is dependent.The dependent variable is namely Churn which determines whether the customer is churn or non-churn.In the IBM Telco Dataset, 26.54% samples are churn.

2) CHURN-IN-TELECOM DATASET
This dataset consists of 3333 samples where each sample defines the customer.In each sample, there are 21 attributes where 20 attributes are independent and 1 attribute is dependent.Besides, among the independent attributes, 16 attributes are numerical, and 4 attributes are discrete.In the dataset 85.51% samples are churn.

3) CHURN-DATA-UCI DATASET
The dataset contains 5000 samples of customers.Each sample consists of 20 variables where 19 variables are independent and 1 variable is dependent which is Churn.In the dataset 85.86% samples are churn.

B. DATA PREPROCESSING
The following steps are used in preprocessing: • Unnecessary attributes such as the attributes that indicate ID/phone numbers are ignored.
• The attributes that have numerical missing values are replaced by 0 and no categorical values are missing.
• In order to deal with categorical values Label Encoding and One Hot Encoding are performed for three datasets.

C. DATA BALANCING TECHNIQUES
Since the three employed datasets have significant data imbalance issues, it was required to use different data balancing techniques to mitigate the issue.Fig. 2, Fig. 3, and Fig. 4 portray the distribution of data after using different data balancing techniques for the three datasets.

1) SMOTE
SMOTE algorithm works by producing synthetic data points for the minority class.SMOTE initially locates the minority class.Then it chooses a data point of that class randomly and creates data points that are located near the selected data point and its KNN.This technique is repeated for the appropriate number of samples from the minority class until the class distribution is balanced.

2) SMOTETomek
SMOTETomek is a data-balancing hybrid method that combines both oversampling and undersampling.SMOTE does oversampling by creating synthetic data points of the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

3) SMOTEEN
SMOTEEN is a technique that is a combination of SMOTE method and Edited Nearest Neighbour (ENN).SMOTE is capable of creating synthetic data points for the minority class and ENN eliminates some observations that have been of our knowledge, no such architecture has been proposed for customer churn prediction in TCI.Also, point to mention that through extensive experiment and hyperparameter tuning the proposed ChurnNet is designed.The training configuration is shown in Table 3.

A. 1D CONVOLUTIONAL NEURAL NETWORK
1D Convolutional Neural Network(CNN) is a type of neural network that is used to process 1-dimensional data shown in Fig. 6. 1D CNN works by performing convolution operation which is basically the dot products between kernel/filter and input data.Feature map is the output of the convolution operation which basically emphasizes certain features of the input data.Initially, in ChurnNet, the input tensor (1 × W × 1) (W refers to the number of attributes) is passed into the 1D convolution layer which has 128 filters having kernel size 5. Also, the stride is kept 1, and zero padding is used.This convolution layer is used mainly to extract the lower dimensional features by creating feature maps.The output tensor (1 × W × 128) is passed to the following Residual Block.Later on in the architecture of ChurnNet, a similar configuration convolution layer is used to extract the important features.

B. RESIDUAL BLOCK
To improve the performance of the model, residual blocks are introduced in the proposed architecture which consists of two subsequent symmetric convolution layers.And the configuration of the subsequent symmetric convolution layers is similar to previous convolution layer.Besides, Rectified Linear Unit (ReLU) activation function is used to minimize the exponential computational growth of the model.Through the skip connection shown in Fig. 7, the input (x), of the residual block is connected to the output (y) of the second convolution layer of the residual block.To mitigate the vanishing gradient problem skip connection is used [37].The output of the residual block having tensor shape of (1 × W × 128) is passed to the following SE Block.
C. SQUEEZE AND EXCITATION BLOCK SE block [38] is used in the proposed architecture to find effective channels of the feature maps.The output U ∈ R of the residual block is passed to the squeezer of the SE block where the Global Average Pooling (GAP) operation is performed shown in Fig. 8.By performing the GAP operation, the squeezer reduced the spatial dimension, squeezed each of the input channels, and converted it into a single numeric value to make the channel descriptor.
Later the channel descriptor is passed to the exciter of the network which consists of two Fully Connected (FC) layers.
ReLU activation function of the first FC layer helps the network to understand the nonlinear connection between the channels.The second FC layer helps to understand the non-mutually exclusive connections between the channels and the Sigmoid activation function helps to do so.The output of the second FC layer is multiplied by the input of the SE block and the obtained output (1 × W × 128) is passed to the next spatial attention block.

D. SPATIAL ATTENTION MODULE
The spatial attention mechanism [39] is used to focus on the most important regions of each channel in the feature maps.In the proposed model, the output of the SE block is passed as an input tensor to the spatial attention module where the two types of pooling operations are performed such as max pooling and average pooling shown in Fig. 9. Then the output of the pooling operation is concatenated to create a feature descriptor and later passes to the convolution layer for generating spatial attention maps multiplied with the input of the spatial attention module.The shape of the output tensor remains the same as the input tensor.

E. CLASSIFICATION LAYER
The output of the last Spatial Attention module is passed to the flatten layer.Then to reduce the overfitting issues 0.5 dropout is applied.The output of the Dropout Layer is passed to the first Dense Layer which consists of 128 neurons where the ReLU activation function is used.The output of the first dense layer is then passed to the last Dense Layer which consists of 1 neuron along with a sigmoid activation function to do binary classification of Churn and Non-Churn.

V. EXPERIMENTAL RESULTS AND DISCUSSION
The performance of the proposed ChurnNet has been evaluated using several well-known performance metrics such as accuracy, precision, recall, F-Measures, AUC, and Matthew's Correlation Coefficient (MCC).
Accuracy is a measure of the overall correctness of the model.It can be computed as follows: Precision is a measure that identifies the model's ability to correctly classify the positive instances among the instances it labeled as positive.It can be computed as follows: Recall is a measure that identifies the ability of the model to identify the positive instances from the total number of positive instances in the dataset.It can be computed as follows: Recall/Probability of Detection (POD) = TP TP + FN ( F-measure calculates the harmonic mean of the precision and recall.It can be computed as follows: Probability of False Alarm (POF) defines the probability of false alarm.The value of POF should be as less as possible.It can be computed as follows: AUC defines that area under the curve of ROC.It can be computed as follows: MCC is a metric that calculates the effectiveness of binary classification.It can be computed as follows: Here, True Positive (TP)=True Positive, False Positive (FP)=False Positive, True Negative (TN)=True Negative, False Negative (FN)=False Negative.

A. EXPERIMENTAL RESULTS
The proposed ChurnNet been experimented on 3 customer churn datasets using 10-fold cross-validation.All the computations have been conducted in the Google Colaboratory platform.The source code link is given. 1  Tables 4, 5, and 6 demonstrate the empirical result analysis of the proposed ChurnNet for 3 used datasets.For the 1 https://t.ly/1Oeo5three datasets, it was observed that the proposed ChurnNet performed reasonably well in terms of accuracy and AUC score without any data balancing techniques.It obtained 79.33%, 90.52%, and 92.86% accuracy for IBM Telco Dataset, Churn-in-Telecom Dataset, and Churn-data-UCI Dataset respectively when label encoding was performed for dealing with categorical values.An important observation from the experimentation is that the accuracy (79.44%) of the IBM telco dataset had increased slightly when one-hot encoding was performed instead of label encoding.However, for the rest of the two datasets label encoding provided performance.
When the data balancing techniques were introduced, the performance improvement for the 3 datasets increased significantly not only in terms of accuracy but also in other evaluation metrics.For the IBM Telco Dataset, when SMOTEEN was performed the accuracy increased to 92.87%.Also, SMOTEEN performed comparatively well in all evaluation metrics compared to SMOTE and SMOTETomek.
After hyperparameter tuning, we found that the performance of the proposed ChurnNet for the three datasets had increased sharply.When the filter or kernel size increased from 3 to 5 the accuracy increased by around 1%, 2%, and 1% for IBM Telco Dataset, Churn-inTelecom Dataset, and Churn-data-UCI Dataset respectively.It showed improvements for 3 datasets when we altered the number of filters or kernels.It reached 93.79%, 96.26%, and 97.01%accuracy for IBM Telco Dataset, Churnin-Telecom Dataset, and Churn-data-UCI Dataset respectively when the number of filters used was 128 instead of 64 or 32.As shown in Table 4, 5, and 6 the integration of Flatten layer in the proposed ChurnNet provided better performance instead of using the Global Average Pooling layer for the 3 used datasets.Besides, in the architecture of the proposed ChurnNet, incorporating the Basic Channel Attention module of [39] in place of the SE block provided very decent performance.However, the inclusion of the SE Block provided the best performance for the 3 datasets.
ChurnNet performed well in different optimizers (learning rate=0.001)for the 3 datasets.Among different optimizers, ADAM obtained a better accuracy of 95.59% (IBM Telco Dataset), 96.94% (Churn-in-Telecom Dataset) and, 97.52% (Churn-data-UCI Dataset).Apart from that, NADAM also obtained 94.98%, 96.76%, and 97.38% accuracy for IBM Telco Dataset, Churn-in-Telecom Dataset, and Churn-data-UCI Dataset respectively.Besides, the RMSprop also performed well.In different metrics, the performance of RMSprop is very close to ADAM and NADAM for the 3 datasets.
The performance of ChurnNet on 0.001 learning rate (Optimizer=ADAM) provided the highest performance.From the tables, it had been found that as the learning rate 4480 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.decreased the performance of the ChurnNet also decreased.And it is valid for the 3 datasets.For IBM Telco Dataset accuracy hit 95.59% when the learning rate was 0.001.Also, the performance of other evaluation metrics was above 90%.For the Churn-in-Telecom Dataset and Churn-data-UCI Dataset, the accuracy is 96.94%, and 97.52% when the learning rate kept 0.001.Also, in other metrics performance hit above 90%.The accuracy decreased to 95.21%, 96.74%, and 97.41% when the learning rate was 0.00001 for the 3 datasets.learning methods such as LR [40], SVM [41], and so on.Also, it performed better than deep learning methods such as MLP [42], CNN, LSTM, and GRU [43].Apart from these, the proposed model outperformed ensemble approaches such as Bagging, Boosting [41], and so on.

VI. CONCLUSION
In TCI, customer churn prediction is a key factor from the perspective of the service provider company.To deal with this issue, this study came up with a solution called ChurnNet which is a CNN-based hybrid approach.The uniqueness of the ChurnNet architecture is residual learning, squeeze and excitation network, and an effective spatial attention module are incorporated with 1D convolution layers.The research was conducted on 3 publicly available datasets where 3 data balancing techniques (SMOTE, SMOTETomek, SMOTEEN) were used to overcome the class imbalance problem of the datasets.After the 10-fold cross-validation evaluation, it was found that our proposed ChurnNet outperformed the state-of-the-art methods applied on 3 benchmark datasets.At present, the research is conducted in the centralized data repository.If the customer service provider aims to predict churn in a decentralized environment, then, in the future, the research can be extended through a federated learning approach.Also, different data transformation techniques can be used for empirical analysis to improve the performance of the proposed model.On top of that, as our proposed model is a black box model, so, explainable AI (XAI) techniques can be utilized to provide explainability of the produced results of the ChurnNet model.

FIGURE 1 .
FIGURE 1.The top-level overview of the research.

FIGURE 2 .
FIGURE 2. Data distribution of different data balancing techniques for IBM Telco dataset.

FIGURE 3 .
FIGURE 3. Data distribution of different data balancing techniques for Churn-in-Telecom dataset.

FIGURE 4 .
FIGURE 4. Data distribution of different data balancing techniques for Churn-data-UCI dataset.

FIGURE 6 .
FIGURE 6. Basic structure of a 1D convolutional neural network (CNN) design with a pair of convolutional layers.

FIGURE 8 .
FIGURE 8. Architecture of squeeze and excitation block.

FIGURE 9 .
FIGURE 9. Architecture of spatial attention module.

δ
ReLu activation function σ Sigmoid activation function xc Output of SE block s c Scaling factor s Output of the excitation operation u c cth channel of the input z c cth element of the squeezed channels z Squeezed input SOMAK SAHA received the B.Sc. degree from the Department of Computer Science and Engineering (CSE), BRAC University, Dhaka, in 2023.He has research experience in the arena of agricultural automation.He is working on multiple research projects in applied machine learning, deep learning, and computer vision.His research interests include federated learning, adversarial machine learning, biomedical image processing, and healthcare informatics.

TABLE 1 .
Summary of literature review for customer churn prediction in telecommunication industry.

TABLE 2 .
Description of the datasets.

TABLE 4 .
Performance of proposed ChurnNet on IBM Telco dataset.

TABLE 5 .
Performance of proposed ChurnNet on Churn-in-Telecom dataset.

TABLE 6 .
Performance of the proposed ChurnNet for Churn-data-UCI dataset.

TABLE 7 .
Comparison with other related study methods for IBM Telco dataset.

Table 7 ,
Table 8, andTable 9 demonstrate the performance of the proposed ChurnNet compared to state-of-the-art methods for IBM Telco Dataset, Churn-in-Telecom Dataset, and Churn-data-UCI Dataset respectively.It is visible that the proposed ChurnNet performed better compared to Machine

TABLE 8 .
Comparison with other related study methods for Churn-in-Telecom dataset.

TABLE
Comparison with other related study methods for Churn-data-UCI dataset.