A Survey on Churn Analysis in Various Business Domains

In this paper, we present churn prediction techniques that have been released so far. Churn prediction is used in the fields of Internet services, games, insurance, and management. However, since it has been used intensively to increase the predictability of various industry/academic fields, there is a big difference in its definition and utilization. In this paper, we collected the definitions of churn used in the fields of business administration, marketing, IT, telecommunications, newspapers, insurance and psychology, and described their differences. Based on this, we classified and explained churn loss, feature engineering, and prediction models. Our study can be used to select the definition of churn and its associated models suitable for the service field that researchers are most interested in by integrating fragmented churn studies in industry/academic fields.


I. INTRODUCTION
The term customer churn is commonly used to describe the propensity of customers who cease doing businesses with a company in a given time or contract [1]. Traditionally, studies on customer churn started from Customer Relation Management (CRM) [2]. It is crucial to prevent customer churn when operating services. In the past, the efficiency of customer acquisition relative to the number of churns was good. However, as the market saturated because of the globalization of services and fierce competition, customer acquisition costs rose rapidly [3], [4].
Reinartz, Werner, Jacquelyn S. Thomas, and Viswanathan Kumar. (2005) have shown that, for long-term business operations, putting efforts to increase the retention rate of all customers in terms of CRM is less efficient than putting efforts on a small number of targeted customer acquisition activities [22]. Similarly, Sasser, W. Earl. (1990) have suggested that retained customers generally return higher margins than randomly targeting new customers [23]. Additionally, Mozer, Michael C., et al. (2000) have proposed that, in terms of The associate editor coordinating the review of this manuscript and approving it for publication was Le Hoang Son . net return on investment, marketing campaigns for retaining existing customers are more efficient than putting efforts to attract new customers [16]. Reichheld et al. (1996) have shown that a 5 percent increase in customer retention rate achieved 35 percent and 95 percent increases in net present value of customers for a software company and an advertising agency, respectively [29]. As such, churn prediction can be used as a method to increase the retention rate of loyal customers and ultimately increase the value of the company.
Studies on customer churn have been proposed in various service fields. These studies on the churn analysis attempted to identify or predict in advance the likelihood that customers will churn using various indicators. The customer churn rate [5] is a typical customer churn analysis indicator. This refers to the ratio of subscribers who cancel a service to the total number of subscribers during a specific period [5], [32], [41]. The churn rate is the most widely used indicator for calculating the service retention period of subscribers in most service fields. Because of its importance and intuition, churn has been introduced in various service fields and developed to suit the characteristics of each field. Consequently, the research on the analysis of customer churn was fragmented according to each research field, thus the measurement criteria are all different. Currently, this is causing many problems. In the industry, communication costs arising from different churn criteria between service personnel in the process of fusing heterogeneous services (e.g., vehicle sharing service/insurance, online music service/department store) have been sharply increasing. Furthermore, since research on churn is simultaneously associated with two fields of engineering and business administration, it is not easy for researchers to describe two separate specialized fields on a single paper or to understand them.
In the past, customer churn of early days was used to define the customer's status in the CRM. The CRM is a business management method that first emerged as a way of increasing the efficiency in areas of retail, marketing, sales, customer service, and supply-chain, and increasing efficiency and the customer value functions of the organization [2]. Since then, in the architectural point of view, the CRM has evolved and become divided into operational CRM and analytical CRM. The analytical CRM is focused on developing databases and resources containing customer characteristics and attitudes [24], [25], [30]. The analytical CRM has been initially used for creating appropriate marketing strategies using customer status and customer behavior data, and particularly, it has been used to fulfil the individual and unique needs of customers [26]. From this point on, IT and knowledge management related technologies have been utilized, and companies started applying dedicated technologies for acquisition, retention, churn, and selection of customers [27], and ever since the technologies of IT field became implemented in the CRM, various companies began to use such technologies in business areas including data warehouse, website, telecommunication, and banking [28]. As described earlier, with studies on CRM claiming that increasing the retention rate of small number of existing customers is more efficient than acquiring new customers, churn analysis has become one of the important personalized customer management techniques [20], [28], [59]. There were survey papers that collected and summarized churn analysis techniques in the telecommunications field [6]- [9]. However, these studies are limited to the telecommunications field, and the log data used for the churn analysis do not include time series features, retention and survival, and KPI (Key Performance Indicator) features. There were also papers applied to services using various deep learning model-based churn analysis techniques in terms of computer science [10], [11]. However, these studies are limited to the deep learning algorithm, and lack underlying models and parameter description. There are also a few survey papers on churn, yet they do not cover the latest deep learning techniques but cover only churn in specific industrial fields [12], [13]. The trend of building churn prediction models is changing, and performance is rapidly improving. However, because of fragmented previous studies, there are many difficulties for researchers to launch new research on churn. In order to address these issues, this survey paper describes the differences in the definition of churn prediction algorithms in the fields of business administration, marketing, IT, telecommunications, newspaper publishing, insurance, and psychology, and compares differences in churn loss and feature engineering. In addition, we classify and explain the cases of churn prediction models based on this. Our study provides classification information for more detailed technologies on churn in a wider range than previous survey papers. Our research can reduce confusion about the churn criteria that are being fragmented and utilized across multiple industry/academic fields, and can be of a practical help in applying them to prediction models. In particular, this paper presents a deep learning model among machine learning techniques designed to solve non-contractual customer churns, which have recently appeared with the advancement of industries. The structure of this paper is as follows. Chapter II introduces typical definitions for churn in each business domain and their differences. Chapter III presents churn application cases in various business domains Chapters IV and V introduce losses and features used in Churn, respectively. Chapter VI introduces typical churn-based prediction models by classifying them according to each algorithm, and presents which algorithm is mainly used in each industry.

II. DEFINITION OF CHURN
Churn has been defined in various ways in multiple industries. In this chapter, we describe two typical types. As seen in Table 1, typical papers with different criteria for defining churn are summarized. In general, the dictionary definition of churn is known as the prolonged period of inactivity [14]. However, the criteria for 'inactivity'and 'prolonged'are different according to each research field. Such inconsistency is frequently found due to more services of modern days adopting loose subscription terms because of competition. In the past, customer churn had occurred explicitly through contractual cancellations, however, in the modern services including Internet and retail services, frequent customer churns occur due to the low customers' investment costs [19]- [21], [96]. These non-contractual customer churns occur due to low switching cost for changing the service [31]. Thus, we can divide the criteria of churn into contractual churn and noncontractual churn. Descriptions of each churn is as follows.
The first criterion is contractual churn. Contractual churn refers to churn that a customer does not extend the contract even when the contract renewal date is reached [15], [16]. This churn means that a customer loses interest in the relevant service area and changes his/her position to a state where reentry is no longer possible. It is usually present in churn problems occurring when customers close their banking accounts or when switching their carrier operator from one service to another. In addition, contractual churn is frequently found in a flat-rate service such as music and movie streaming services.
The second criterion is non-contractual churn. In general, in a non-contractual situation, customers can leave the service/contract without time constraints. In the service operating perspective, a criterion for churn is first constructed, then a customer that meets such criterion is categorized as the VOLUME 8, 2020 churn customer. To conduct this, the customer's behavioral changed date is counted [96]. When this inactivity or behavioral changed period exceeds the threshold, the customer is regarded as a churn customer. During this process, the period that is set as the threshold of the inactivity date is called the time window [17]. The defining of non-contractual churn has made it possible to infer the probability of the customers who are likely to churn within the certain period. The time window method is frequently used when analyzing activity logs these days in a non-contractual situation. When a customer does not use a service for a certain period of time, this method regarding customer as churned. Internet services do not usually delete accounts. Therefore, the Internet service interprets the log-in as prolonged, that is, the retention of the service, and interprets unconnected access for a certain period of time as churn [17]. Fig. 1 schematically illustrates the noncontractual churn case with time window method. The log of Fig. 1 was recorded for 10 weeks. The time window is set to 4 weeks from Week 4 to Week 7. Six users in Fig. 1 showed their activities in each week, and their activities were logged. In the time window period, users A, B, and C without any activity logs from Week 4 to Week 7 are regarded as churn, and the other users D, E, and F with activity logs are regarded as retention.
Churn analysis is usually performed to improve business outcomes. Therefore, in most churn prediction problems, the churn period is defined as a section that can restore customers' trust. If the time period during which a customer completely churns is selected as a time window, the period for churn definition exponentially increases and it does not provide any gain in terms of business as changing the will of the customers who want to churn is deemed impossible [17]. The contractual mentioned above are close to customers' complete churn from a service. Therefore, these days majority of the log-based churn prediction problems use the probabilistic method to determine whether customers are churning or not and to give customers incentive to reuse their service.
The criteria for setting the time window are different for each service feature. Yang, Wanshan, et al. (2019) analyzed log data to define the churn period of mobile games, and the analysis result showed that more than 95% of customers did not return when they were absent for 3 consecutive days. They set 3 days as the time window churn period [43]. Lee, Eunjo, et al. (2018) defined the period during which 75% of customers were continuously unconnected as a churn section by taking into consideration the characteristics of PC game services [17]. After collecting customers' unconnected periods, they drew a cumulative data graph. They selected the section where more than 75% of customers churned as the time window. Fig. 2 schematically illustrates the cumulative data of consecutive unconnected days collected by Lee, Eunjo, et al. (2018). According to Fig. 2, the period during which customers were unconnected more than 75% was 14 weeks. Therefore, the time window churn period is 14 weeks. In order to determine the time window period, researchers should collect customers' continuous unconnected periods and draw a cumulative graph. According to the figure, the period during which 75% or more customers are continuously unconnected is 14 weeks. Therefore, if a customer is unconnected for 14 weeks, he/she is regarded as churn and this period is a time window section.
As described above, there are two customer churn types, which are contractual churn and non-contractual churn. Additionally, there are three churn observation criteria as follows: monthly, daily, and binary. The monthly and daily churn observations are related to the cycle in which the customer's status is updated in the database. The binary churn observation is acquired by manipulating this database. In general, binary churn is determined by the existence of contract in the contractual settings. In the non-contractual settings, the company defines the customer inactivity features, and when a customer meets the inactivity or disloyal customer feature, the customer is regarded as binary churn [96], [141]. The reason for having multiple ways of defining customer churn is to periodically monitor the customers' status changes. And through such observation, the expected net business value can be increased by predicting customer churn rates and providing possible churn customers with incentives to retain them from leaving [16], [23].

III. CHURN ANALYSIS IN VARIOUS BUSINESS FIELD
The majority of the early studies on churn were conducted from a management perspective, especially CRM (Customer Relation Management) [30], [31]. CRM churn covers all churn problems that can occur in the process of customer identification, customer attraction, customer retention, and customer development. Modern churn prediction problems are mainly analyzed using log data. A log is trace data that remains when using Internet services. Therefore, the churn prediction models implemented using log data can be used for Internet services in various industries. There are 12 business fields that performed churn prediction. The cases of churn prediction for each business field are summarized in Appendix A.
The telecommunications industry accounts for the majority of previous studies on churn. Telecommunications services have high customer stickiness despite high customer acquisition costs. Therefore, if customer churn is prevented and appropriate incentives are provided, it is of great help in maintaining sales [16], [32]- [34].
The financial and insurance industries also predict customer churn. Zhang, Rong, et al. (2017) stressed the need to build churn prediction models and prevent churn, referring to high customer acquisition costs and high customer values in the insurance industry [11]. Chiang, Ding-An, et al. (2003) mentioned that customer values were high in the online financial market, and created a churn scenario according to the financial product selection and customers' financial product selection sequence using the Apriori algorithm [35]. Larivière, Bart, and Dirk Van den Poel. (2004), based on the assumption that the customer group was different according to the financial product attribute, demonstrated that the likelihood of churn differed depending on the tendency of customers who selected financial products by measuring the survival time for each product [36]. Zopounidis, Constantin, Maria Mavri, and George Ioannou. (2008) measured the switching rate of financial products, and the survival period of customers for each product to discover attractive products [37]. Here, as the survival period is short, churn occurs more frequently, which is used as an indicator to measure the need to supplement financial products. Glady, Nicolas, Bart Baesens, and Christophe Croux. (2009) measured the customer lifetime values and the decrease in expected earnings over time as an indicator corresponding to customer loyalty [38]. During this process, machine learning was used to calculate the churn rate which was used to estimate the customer lifetime values.
Later on, studies on churn have been actively conducted in the gaming field as in the telecommunication field. These services have a fast cycle of customer inflow and churn because of mass competition. However, if a single service is run for a long time, the service competition intensifies and the Customer Acquisition Cost (CAC) tends to increase [16], [39], [128]. As the CAC gets larger, the technology to predict and prevent churn becomes more crucial. Viljanen, Markus, et al. (2016) applied the survival analysis to mobile games and calculated the churn rate, similar to the churn prediction of financial services [40]. The game sector actively uses machine learning techniques when conducting research on churn because of the large volume of log data [10], [42], [43]. Milosevic, Milos, Nenad Živic, and Igor Andjelkovic. (2017) created a model predicting churn in the study on game churn, gave churn prevention incentives by finding out and dividing probable churn customers into A/B groups, and demonstrated actual effects statistically [44]. Runge, Julian, et al. (2014) conducted a similar study, and revealed that existing customers with a high possibility of churn had a higher marketing response rate when compared to general marketing targets [45].
Furthermore, the music streaming service field even held a competition to build a prediction model, and research on churn was also conducted in the Internet service and newspaper subscription fields. The newspaper subscription and music streaming service offer fixed-rate services, and customer churn is consistent with the contract renewal period. On the other hand, because the Internet service goes into an inactive state as customers wish, contract renewal takes place nearly-real-time. Research on churn prediction was also conducted in online dating, online commerce, Q&A services, and social network-based services [46] There were some studies which approached customer churn from a psychological perspective. Borbora, Zoheb, et al. (2011) analyzed that customers churned when their motivation to use games changed by combining the motivation theory with customers using MMO RPG games [47]. Yee, Nick. (2016) surveyed approximately 250,000 gamers, and showed that customers' attitudes toward games were clustered by country, race, and age [48].
In the marketing field, Glady, Nicolas, Bart Baesens, and Christophe Croux. (2009) used the features from a marketing perspective such as RFM (Recency, Frequency and Monetary) and CLV (Customer Life time Value) for churn prediction [38].
Studies on churn prediction were conducted in the human resources and energy fields although they were minority. Saradhi, V. Vijaya, and Girish Keshav Palshikar. (2011) conducted research on churn to reduce retraining costs when employees churned and to prove employee value in the human resources field [50]. Moeyersoms, Julie, and David Martens. (2015) estimated whether customers would churn to another energy supplier based on energy data and socio-demographic data provided to customers [51].

IV. CUSTOMER CHURN LOSS
Customer churn behavior is quantitative. However, it is difficult to directly relate customer churn to a decrease in sales. Therefore, of the studies on churn prediction, there is a study that introduces a method of calculating the loss of a single customer. In this way, we can calculate the value of a churn prevention model by multiplying the loss cost of one customer by the number of people who are prevented from churn with a churn prediction algorithm.

A. CUSTOMER ACQUISITION COST (CAC)
The customer acquisition cost (CAC) is the total cost that is spent until a customer is convinced of a service. CAC can be calculated by simply dividing all cost spent on acquiring customers, marketing campaign for example, by the number of customers acquired in the period the money was spent. The company measures the cost of acquiring a customer with the CAC, and that CAC cost is the minimum value to operate service makes from a customer with Return on Investing (ROI). The CAC occurs mainly through marketing activities. If a customer churns from a service, the company will have to recruit another customer by spending the CAC to maintain the service. According to the study conducted by Mozer, Michael C., et al. (2000), the retained customers are known to provide a better return on investment than the newly recruited customers through the CAC [16]. In such way, if a customer who is likely to churn the service in the near future can be inferred through the churn prediction, the basis for measuring a suitable incentive for the customer while minimizing the CAC can be established. In addition, by multiplying the number of customers planning to churn by the incentive cost to be provided to those customers, the cost of business loss that can incur when the customers are not retained can be calculated through the churn prediction model [45], [53]. Therefore, some studies measured the CAC, and calculated it as the loss incurred when a customer churned [8], [31], [72].

B. CUSTOMER LIFETIME VALUE (CLV)
The customer lifetime value (CLV) is the cost that a customer expects to pay when acquiring a customer. This reason why this cost is important is that the CLV is the expected earnings from the customer's use of a service when acquiring a new customer, and the CLV is a useful indicator for setting the upper bound when calculating customer-related costs. Marketing costs and incentives provided for customers who are going to churn are typical examples of customer-related costs [52]. The efficiency of a retained customer value is usually calculated with the CLV. A retained customer represents a customer who the churn prediction model predicted to churn in the near future but survived after receiving incentives. This is because the cost can vary depending on the company's policy and marketing timing for the CAC method. There are multiple studies on the method of calculating the CLV and that apply the CLV to churn models. Verbraken, Thomas, Wouter Verbeke, and Bart Baesens. (2012) and Neslin, Scott A., et al. (2006) proposed formulas for calculating the net profit using churn rate, CAC, CLV, fixed operating cost, and incentive cost [54], [55]. Additionally, the same approach has been taken by Fader, Peter S., and Bruce GS Hardie. (2009) as well [149]. In the formula for deriving CLV, the survival rate (retention) γ for the customer's time period t should be derived first. γ can be derived through the probabilistic distribution as well. When γ denotes a retention rate, 1 − γ expresses a churn rate. In this case, the expected survival time of a customer can be expressed as 1 1−γ . Lastly, assuming that the profit contribution cost per customer (customer value) for t period is expressed as m, the CLV can be obtained by mγ t . Here, the value of m may be different for each customer segment. In the case of calculating m value of new acquisition customer, m is derived by dividing the net profit from active customers for time t by the number of active customers. Further, the contribution of a specific segment can be calculated by dividing the net profit from activated specific segment customers for t period by the number of activated segment customers during t period. In the case where the customer value is discounted during the time t, the discount rate is defined as d, and the discount value during the time t is expressed as 1 (1+d) t . Ultimately, the CLV having discount term for a given time t can be expressed as mγ t (1+d) t . Since the CLV should include the concept of lifetime, time t can be generalized as follows.
As a way of calculating the loss of employee churn, Saradhi, V. Vijaya, and Girish Keshav Palshikar. (2011) calculated churn rates, and by using the method of multiplying customer value by the remaining survival time, the authors calculated the projected value in future time of churned employees who failed to fulfill the CLV [50]. Based on the formula for deriving net cashflow using survival time parameters suggested by Reinartz, Werner J., and Vijay Kumar. (2000), the study conducted by Glady, Nicolas, Bart Baesens, and Christophe Croux. (2009) used the approach of multiplying individual cashflows for entire product to calculate the CLV in the retail service field [38], [146].

V. FEATURE ENGINEERING
Churn is generally related to customers' last time activity. However, predicting churn and compensating for it with the last log of service usage do not change the overall service usage patterns of customers. Therefore, some studies show that short-term prediction and monetary rewards soon leads to another churn [10], [45]. Therefore, studies have emerged in recent years to develop other features that are as important as the last log the customer left before churning, or to discover potential churners by reprocessing the time series features.

A. DEVELOPING NEW FEATURE
Sifa, Rafet, Christian Bauckhage, and Anders Drachen. (2014) diagrammed the related signs leading to churn, and grouped the features corresponding to each diagram and managed them in Game field [56]. This study focused on detecting signs that led to churn rather than building churn models and comparing performance. They linked unmeasurable numbers such as key indicators of services, the number of complaints raised and psychological fluctuations to service features so as to measure them, and utilized them for the research on churn prediction.
Yang, Wanshan, et al. (2019) judged that the probability of churn would increase when the regularity of customers' behavior using services was broken [43]. They added the change in the customer service playtime distribution as a feature, and maintained that the feature was of a great help in estimating churn by creating a machine learning model. Hadiji, Fabian, et al. (2014) and Yang, Wanshan, et al. (2019) predicted churn using KPI features [42], [43]. They used the indicators used in business administration as churn prediction features since churn was related to management indicators. Runge, Julian, et al. (2014) associates the value of service goods possessed by customers with customer churn. This study intensively used features related to assets (e.g., reserves, items) in customers' services [45]. Paid goods, free goods, last purchase, and last purchase date, and so on were used as assets. They assumed that as the user had more goods to use in the service, the opportunity cost became larger, which would be a major indicator of churn.
In the finance field, Chu, Tsai, and Ho (2007) used customer demographic CRM features and business branch relationships to conduct churn prediction [137]. They have predicted customer churn using the customer information such as gender, zip code, and customer's industry code and the service provider information such as tenure, time of service suspended, and average invoice. In order to predict customer churn of Pay-TV services, Burez, Jonathan, and Dick Van den Poel (2008) selected customer behavioral loyalty features and combined them with CRM features [132]. In addition to using service information such as payment type and contract expiration month, the authors used demographic CRM features including customer's age, province, and customer type and additional features for classifying disloyal customersnamely bad payment behaviors, number of notices to pay, and number of deactivation of the device-in order to conduct churn prediction.
Logs remain in the majority of services. However, not all logs are helpful in estimating churn. Mozer, Michael C., et al. (2000) maintained that in general the indicators that measured the quality of services were good data for estimating churn in telecommunication field [16]. Dror, Gideon, et al. (2012), by collecting responses such as like and dislike from the Internet service they ran, used them to predict churn. They explained that customer satisfaction was expressed as an emotional expression, which was a direct expression of service satisfaction [57].
Fader, Peter S., Bruce GS Hardie, and Ka Lok Lee. (2005) and Glady, Nicolas, Bart Baesens, and Christophe Croux. (2009) applied the RFM method, which is used for selecting loyal customers in the marketing field, to the churn prediction [38], [49]. The RFM stands for recency (latest service transaction), frequency (service transaction frequency) and, monetary (customer's purchase size). These features are used to extract customers who carry business values. In terms of marketing, the RFM features are generally used to classify customers into five groups based on the RFM scores [140]. However, in the above two studies, the authors characterized RFM features as the important features that can derive the net business values of customer-service relationship and conducted churn prediction based on the premise that loyal customers have higher service stickiness and switching cost. On the other hand, to conduct a customer churn prediction, Tamaddoni Jahromi, Ali, et al. (2010) applied RFM features to the telecommunication field and added about 12 new features, including the latest telecommunication service subscription period, call frequency, total call cost, etc. [20]. The authors also used mobile carrier-specific features including call time, number of incoming or outgoing calls, and total talk time between specific customers to conduct churn prediction. Further, Wei, Chih-Ping, and I-Tang Chiu. (2002) have utilized RFM features in the telecommunication field as well [41]. By using the time length between the contract starting date and termination date, the authors set frequency VOLUME 8, 2020 of service use as the recency feature, payment type as the frequency feature, and payment type as the monetary feature. Additionally, as for the mobile carrier-specific feature, the authors derived an influence feature, which indicates the number of distinct receivers the customer called in the outgoing call list. Buckinx, Wouter, and Dirk Van den Poel. (2005) applied RFM features to the retail field to conduct churn prediction [96]. The authors used features of customer's recent purchase or consumption time of the day (Recency), number of purchases (Frequency), and the amount of spending (Monetary). As for the retail service-specific features, metadata such as customer-supplier relationship, buying categories, mode of payment, brand purchase behavior, and usage of promotions were used.

B. FEATURE MODIFICATION
Pure log data includes all truthful customer information. However, churn prediction tends to be more accurate when processing raw logs. This is because service indicators are generally sparse and have many outliers, and the data distribution is skewed to one side. Therefore, it is necessary to extract important information using an appropriate feature engineering technique for churn prediction.
In general, there have been few papers mentioning feature engineering know-how about churn. However, Zhang, Rong, et al. (2017) shared useful information when building an algorithm to predict churn from log data [11].

1) ONE-HOT ENCODING
One-hot encoding is primarily a feature engineering method for nominal categorical data. In order to implement machine learning on categorical data without using a tree-based algorithm, the data must be converted to numerical data. The algorithm used in this instance is called the one-hot encoding. One-hot encoding is expressed by a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0) bits. Onehot encoding on categorical data in this way produces an orthogonal feature space for each category. Although alternative methods, such as numerical encoding or binary encoding, exist for processing categorical data, each method has their drawbacks: Numerical encoding breaks the nominal nature inherent in categorical data and results in encoding with linearity between categories, likewise binary encoding produces distance between categories. These changes have the side effect of the model learning unintentional continuity. Therefore, categorical data should be modified into one-hot encoding features.

2) BUCKETING
Outliers are problematic for service data. Bucketing (also called Discrete binning or Data binning) can be used for both categorical and continuous features. For features that are too sparse to be used in the model due to the large variance because the feature value category is too wide, bucketing is a technique that makes these features into categorical features.

3) DATA IMPUTATION
Data imputation is the process of replacing missing data with substituted values. It is recommended to fill the missing data by any means. Nimmagadda, Sravya, Akshay Subramaniam, and Man Long Wong. (2017) argued that filling the missing data improved performance instead of dropping them [58]. Sifa, Rafet, et al. (2015) improved performance by using a semi-supervised learning technique since there was no sufficient data to solve prediction problems [52]. Data imputation techniques may help to solve data imbalances and make the best of the information used in the model.

4) NORMALIZATION
Normalization is data pre-processing technique for stability in several training machine learning algorithms. This process scale individual samples to have unit norm. The distribution of service data is generally skewed to one side. Without normalization, the machine learning model generally have to select a very small learning rate when searching for the optimum, resulting in a long training time. Normalized features can achieve rapid model training while using relatively large learning rates compared to the initial data. Nimmagadda, Sravya, Akshay Subramaniam, and Man Long Wong. (2017) used techniques such as log-normalization or quantile normalization to build a prediction model other than 0-1 normalization [58].

5) FEATURE EMBEDDING
Schweidel, David A., and George Knox. (2013) proposed a parsimonious model by integrating customer behavior data into latent attrition models in order to provide direct marketing target selection [142]. This model enables dense embedding of sparse customer data by extracting latent attritions. In general, when the customer behavior is stored as log data in gaming, Internet service, and telecommunication service fields, the customer data is often found in the form of highdimensional sparse data. Hence, in the past, the customer behavior data was simplified through a bucketing process. However, with the emergence of deep learning techniques of machine learning, it has become possible to manage time-dependent high-dimensional sparse data. Unlike the explicit methods such as bucketing, a deep learning algorithm can learn that customers' latent behavior who are about to churn in a end-to-end way. Moreover, a deep learning algorithm enables generating latent features by compressing long-term features that is a new technique of feature embedding. An autoencoder is one of the example techniques to this. Autoencoder is trained based on the encoding and decoding process, where latent vectors are generated during the process. The latent features generated during this process compresses high-dimensional sparse data into lowdimensional dense data. It has been suggested that the model that uses these obtained vectors as input of the fully connected networks provides better prediction performance than the traditional model that uses the sparse features as is. For example, to predict the future demand of Uber customers, Zhu, Lingxue, and Nikolay Laptev. (2017) improved the performance of the prediction model by compressing sparse data through a long-short-term memory autoencoder and then concatenating the data with fully connected neural networks (FCNNs) [143]. This trend is found in the churn prediction as well. Lee, Eunjo, et al. (2018) claimed that in the customer churn prediction competition hosted by the authors, the winning team showed significantly better performance than other teams by utilizing the autoencoder to compress the sparse data [10]. Zhang, Rong, et al. (2017) also used latent feature modification for churn prediction. They classified features into a memorization feature and a generalization feature depending on data attributes [11]. The memorization feature refers to data that is likely to be a latent feature among time series data. These types of data slowly reveal the characteristics of the churn/no-churn group usually over a period of time. This data is modified by a deep learning model LSTM, embedding, or autoencoder, and then processed by converting them into a dense-low dimension latent feature. As opposed to the memorization feature, the generalization feature exhibits the characteristics of the churn/no-churn group with only short-term attributes. This feature can represent the attributes of the churn/no-churn group with a shortterm section or value itself. The generalization feature data predicts results with shallow machine learning models such as logistic linear regression. The churn feature that combines the memorization feature and the generalization feature predicts churn by combining deep learning models and traditional machine learning techniques.

C. DEALING WITH IMBALANCED DATA
In a stable service, the number of churned customers generally account for a small proportion compared to the number of retained customers. For example, suppose that 96 percent of the data is composed of retained customers and 4 percent is churned customers. If a prediction model is trained using this data and it always outputs only the results that indicate customers being retained, the model would maintain a 96 percent accuracy. In this case, although the accuracy of the model may be said to be high, the model would not be effective in identifying characteristics of churned customers. Thus, the most ideal method is adjusting the churned data and retained data to have a similar proportion. This is because a balanced dataset composed of churned and retained customers has higher noise tolerance than a imbalanced dataset, hence it is more likely to be able to obtain decision boundaries for minor groups, which in this case denotes the churned customer dataset [91]. Meanwhile, there could be an issue where the prediction model may not have enough data for training if a simple undersampling method is used to obtain a balanced dataset. To address this, Burez, Jonathan, and Dirk Van den Poel. (2009) utilized a method called the CUBE method to improve the churn prediction performance. [95]. In addition, Amin, Adnan, et al. (2016) obtained the IBM telecommunication dataset and applied the oversampling method on the imbalanced dataset. Subsequently, the authors used the generated balanced dataset to train a churn prediction model [144]. The authors also compared and analyzed the performance of the churn prediction models that implemented widely used oversampling methods. According to the study, the Mega-Trend Diffusion Function (MTDF) method provided the highest accuracy when compared with other techniques such as Synthetic Minority Oversampling Technique (SMOTE), Adaptive Synthetic Sampling Approach (ADASYN), Majority Weighted Minority Oversampling Technique (MWMOTE), Immune Centroids Oversampling Technique (ICOTE) and couples Top-N Reverse K-Nearest Neighbor (TRkNN) algorithm. Gui, Chun. (2017) applied undersampling, oversampling, and SMOTE methods to an imbalanced dataset of telecommunication field and compared the performance of the derived churn prediction models [145]. In the study, the author suggested that the SMOTE sampling technique provided the best prediction performance. The use of undersampling technique when dealing with churn dataset has disclosed the possibility of discarding useful information that expresses the characteristics of retained customers. On the contrary, the use of oversampling technique has disclosed the possibility of raising the overfitting issue due to replicating insufficient variance size of the oversampled churned group data.

VI. CHURN PREDICTION MODELS A. BUILDING CHURN PREDICTION MODELS
There are four domains as a method of building a churn prediction model. They are traditional machine learning, statistics, graph theory, and deep learning. In Appendix B-A, we summarized the papers that built churn prediction models based on these four criteria.
The boundary is blurring between the above four disciplines in recent times. However, Breiman, Leo. described that machine learning techniques had developed with data mining since the advent of computers while statistics had focused on mathematics-based hypothesis tests [63].
In statistics, probability models have been mainly used for conducting churn predictions. In particular, probability models have been traditionally used for customer-base analysis [149]. In a customer-base analysis, churn rate is applied to the survival time estimation when calculating the CLV. Figure 3 illustrates that the appearance of churn prediction algorithms by year. CLV prediction algorithm is combined with calculating customer expected revenue and churn rate prediction model. To calculate CLV within the contractual settings, a shifted-Beta-Geometric (sBG) model is used. The sBG model uses beta distribution to make shifts for every instances of change in time t in order to fit the retention rate. Accordingly, the sBG model allows continuous interpretation of the conditions in which the customer retention is determined to be discrete-time contractual in a contractual service [150]. In the non-contractual settings, the repeat-buying behavior of customers have been previously expressed through negative binomial distributions (NBD) [151]. Further, the distribution of churn used gamma mixture of exponential, which is also known as the Pareto (of the second kind) distribution. By combining the buyer behavior and survival distribution, the CLV can be calculated. and this method is simply referred to as the Pareto/NBD model. The Pareto/NBD method has been actively used as a probability model for deriving the CLV until recently [19], [147], [148]. As another method, the beta-geometric/beta-binomial (BG/BB) model can be used. In this model, the beta-geometric model fits the retention rate and the beta-binomial distribution fits the consumer purchasing behavior [149]. In the non-contractual settings, customer churn tends to have a characteristic of continuous probabilistic. In the non-contractual settings, customer churn is not easy to define and trace. To define customer churn in the non-contractual settings, researchers used in timeseries modification techniques such as grouping customer id to make tidy data or calculate behavioral variances [114]. Utilizing the features produced from this processed data, researchers begin predicting customer churn. The statistical model used here is based on statistical inference and hypothesis testing, and survival analysis with hazard methods are used to build churn prediction models. Machine learning techniques also began to be utilized in customer churn for non-contractual settings. Compared to statistical methods, machine learning techniques have robust non-linear relationships between features and can learn heterogeneous effect when given diverse features. The graph theory identifies churn as a mathematical relationship. It configures graph attributes by feature and by customer, and expresses their relationship as edges. Once a graph is built, it searches churning customers through the graph correlation analysis. Recently, deep learning techniques have emerged as a method of predicting customer churn. Deep learning is an intense extension of machine learning with neural network algorithms. However, deep learning also has many variances and tend to be classified separately from conventional machine learning. Deep learning techniques often used for predicting customer churn mainly involves training via sparse customer data that have been densified, or a fully-connected neural network produced from extracting the latent vector from the autoencoder and concatenating those features with static features.
The deep learning model is a relatively recent analysis method of predicting churn. According to Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. (2016), deep learning is part of machine learning [64]. However, because its academic significance has recently grown, it has established itself as a single academic field. This is true for building churn prediction analysis models. Lee, Eunjo, et al. (2018) disclosed that a model using deep learning predicted customer churn with a higher probability than a traditional machine learning model for game churn prediction analysis [10]. Fig. 4 schematically represents the model of the team that won this competition. They summarized the features using the memorization and generalization techniques described in feature modification, and increased churn predictability by combining the deep learning model and the traditional machine learning model. Zhang, Rong, et al. (2017) compared the traditional machine learning model and the deep learning model in customer  churn prediction problems in the insurance industry [11]. Fig. 5 is a schematic diagram of the Deep and Shallow model they built. Zhang, Rong, et al. (2017) classified features to be applied to the deep learning model and processed them, and then combined the results. In the study, they compared the deep learning-based churn prediction method they developed with the traditional machine learning-based churn prediction algorithm. The Deep and Shallow model they built showed excellent churn prediction performance compared to other models. In this study, although deep learning is part of machine learning, it is used as a new breakthrough algorithm for churn prediction problems. Appendix B-A shows the classification of studies on churn based on churn prediction algorithms. Table 2 shows a summary of techniques that are classified by business field. We were able to confirm that preferred modelling techniques were different depending on the business field. Businesses with dense log data and easy access to customer information, such as the games and telecommunications industries, are applying relatively many deep learning techniques using big data, which is a fast trend. As for the financial and insurance sectors, since the log data is relatively small and the information obtained from customers does not change to a great degree, there are many statistical approaches using traditional machine learning models or survival analysis. The reason why the preferred model for each business field is different is that the types and cycles of log data used for each business are different. Apart from this, it seems to be different since the churn model that best interprets the relevant log data is applied.

C. PERFORMANCE EVALUATION
In general, the performance evaluation algorithms of machine learning model developed for churn prediction use the area under the curve (AUC) of receiver operating characteristic (ROC) curve or the lift. The ROC curve is drawn by plotting sensitivity values on the y-axis and false positive rate on the x-axis. A ROC curve is a very robust measurement criterion that measures classifiers independent of class distribution and misclassification error cost. In this way, xaxis denotes the proportion of non-churn cases that were incorrectly classified as churn, and y-axis is defined as the portion of churn cases that were classified correctly [121], [136]. Thus, the AUC close to 1 indicates that the churn prediction model accurately distinguishes difference between the characteristics of churn customers and non-churn customers [13], [45], [55], [95]. On the other hand, some churn studies have often used a top 10% decline lift performance metric [13], [67]. Lift is a performance measure obtained by dividing baseline lift by the response for each fraction. When using top 10% decline lift as in the above reference studies, the customer list sorted in the descending order based on the prediction rate is divided into ten fractions. Subsequently, the lift values for each fraction are derived and the descending speed of the curve is observed. Additionally, topdecile lift technique is also often used as it enables allocating marketing budgets proportionally to customers who are more likely to churn as predicted by the churn prediction model [95].

VII. CONCLUSION
In this study, we compared the churn prediction analysis techniques using log data. Churn analysis is used in the fields of Internet services and games, insurance, and management. Research on churn prediction usually begins to improve business outcomes. Therefore, the time window is used to select potential churning customers rather than measuring a customer's complete churn. Loss costs for customer churn are calculated by CAC or CLV. In the past, when predicting customer churn, researchers used survival analysis or time series analysis using statistics, graph theory, and traditional machine learning algorithms. Churn prediction analysis using deep learning algorithms has recently emerged. Deep learning algorithms have been found to outperform other algorithms. This is likely due to large quantities of customer log data being collected via computers and the churn prediction model utilizing the entire set of this acquired data to make    tomer's behavioral patterns from vast amount of data by layer-wise stacked neurons structure. Therefore, given minute timestamps and abundant observations, applying this data to deep learning algorithms for the generation of latent features is expected to produce better performance than conventional churn prediction models. This is because as the log data in these days is collected for a longer period and deep learning algorithm get an advantage to catch customers' latent status compared with older algorithms. In other words, the reason deep learning algorithms are receiving spotlight today is due to the vast amount of data used in modern churn predictions, and its ability to capture minute changes. As mentioned earlier in the text, traditional churn prediction algorithms including statistics methods are still actively used today. This is due to variations in which churn prediction model has the best performance depending on the data format. Churn prediction models using deep learning is a new solution with a good structure for predicting modern churn datasets. Therefore, to solve the problem at hand, readers will need to understand the format of the churn dataset and apply a suitable algorithm to solve the churn prediction problem.
Furthermore, we also outlined a performance evaluation method for comparing the various churn prediction algorithms used from the past to the present. Most churn prediction models are related to customer relation management. For example, there may be performance differences depending on whether the churn prediction model is robust against false positives or false negatives. According to the research of this paper, many articles use AUC as a performance measurement method aside from standard precision. In general, as there are fewer churn customers than non-churn customers, a performance specific method focused on churn customers will be needed. The ROC curve is a graph of the rate at which the model correctly predicts churn customers and the rate at which residual customers are predicted to VOLUME 8, 2020  be churn customers. Therefore, it is a performance measurement method that focuses on the prediction of churn customers. In this study, we comprehensively compared the churn prediction problems. This paper helps to find a method that meets the needs of researchers among various churn prediction algorithms. Furthermore, this paper is expected to VOLUME 8, 2020   be used to improve services and build better churn analysis models.

VIII. LIMITATIONS AND ISSUES FOR FURTHER RESEARCH
Churn studies on different fields are undoubtedly helpful in grasping the comprehensive view of churn and exploring various features to apply them to the churn models. However, as each study uses different sizes and types of features in the data, the set of studies provided in this paper has a limitation in comparing a common performance. Accordingly, although researchers may be able to discern whether if their constructed model is used widely through our study, they would not be able to determine which model is suitable and has the best performance for their study. Thus, in the future study, we intend to combine the feature engineering of fields introduced in this paper with the open churn datasets and construct various churn prediction models, including a deep learning model, and then conduct experiments on comparing and evaluating the suitability of each model.

APPENDIX A CHURN ANALYSIS IN VARIOUS BUSINESS FIELD
See Table 3.

APPENDIX B CHURN PREDICTION MODELS A. CHURN PREDICTION MODELS
See Table 4.