A Comparative Study on Contract Recommendation Model: Using Macao Mobile Phone Datasets

Bordering on the Mainland of China, Macao is a city with diverse population and is also a world-famous gambling city. The residents living in Macao are not only localresidents, but a large-scale of residents are from the Mainland of China. There are diversified living habits among different resident groups. Some of them live and work in Macao most of the time, while others cross the border between Macao and the Mainland of China frequently. These diversities result in different demands for mobile phone contract services. Machine learning and data mining are powerful tools used by telecom companies to monitor the behavior of their customers. This paper aims to build a contract service recommendation model suitable for Macao telecom companies, which can accurately recommend the best fit contract services according to call data records (CDR). Based on data mining algorithm and large number of comparative experiments, this study conducts a study on variety of factors that have effect on the contract recommendation for obtaining a composition structure of factors that have the greatest impact on contract recommendation accuracy so as to improve the operation efficiency of the model. In addition, this study makes comparisons among five classification algorithms, including Bayesian, logistic, Random trees, Decision tree (C5.0), and KNN (k-nearest neighbor). The results are analyzed by different metrics such as Gain, Lift, ROI (region of interest), Response, and Profile. Experimental results demonstrate that the best classifier is decision trees (C5.0), and the contract service obtained by this method can achieve a recommendation accuracy close to 90%, which successfully reaches the expected goal.

but the key is whether the recommended contracts are appropriate. In the past, marketers did not use scientific and reasonable methods to recommend package contracts. Such a recommendation method depends on human subjective factors, and surely the selected contracts are not necessarily the best. Therefore, this paper proposes a data mining method to help customers choose contracts.
Based on machine learning algorithm, there are many researches on CDRs. Previous studies mainly focus on customer clustering, urban population mobility, and customer churn early warning. Note that contract service recommendation is one of the important applications of machine learning and up to now less attention is paid to this issue. This paper uses the customer data of Macao as the dataset to carry out this study due to the typical research significance of Macao's diversified population. Studying the differences in communication behaviors between Macao residents and Chinese residents in Macao can help a mobile phone operator to better understand customers' choice preferences and choice basis for contract services. It is beneficial for telecom companies to recommend contract to customers more accurately and design more popular packages. In short, the main contributions of this paper are as follows: • According to the location information of CDR, CDR is transformed into local airtime, international direct dial (IDD), roaming, local data traffic, China main land data traffic, Hong Kong data traffic, and short message service (SMS). This data consolidation method is consistent with the design principle of a contract service, which can more accurately reflect the real preferences of customers.
• Using the decision tree model to get the ranking of factors' main score, and according to this ranking, it determines the best combination of input factors through large number of experiments.
• Five classification algorithms, including Bayesian, logistic, random trees, decision tree (C5.0), and KNN (k-nearest neighbor), are compared and analyzed. Through different evaluation methods, such as Gain, Lift, ROI, Response, and Profile, it is proved that decision tree (C5.0) is the optimal contract recommendation algorithm. This paper is organized as follows: Section 2 briefly presents the related work on several relevant concepts such as mobile phone datasets construction, city users' classification, and the machine learning classifiers used in this study. Section 3 describes the experimental setup and discusses the experimental results with comparisons. Section 4 concludes the study.

A. MOBILE PHONE DATASET CONSTRUCTION
With the rapid development of mobile communication in recent years, the popularity of mobile phones is increasing. The rapid growth of the number of mobile phones has a huge impact on people's economic and social life. With the development of mobile phones, mobile sensors have been widely used because of their convenience. These sensors record personal conversations, movements, and activities [2]. Due to the wide applications of smart phones, sensor information has become very popular. At present, many researches have been done [3] [4] for collecting large number of human behavior data sets to further understand human interaction. These data are used to monitor urban transport and human activities [5], [6]. For example, it helps to monitor population movement in urban areas [7]- [9] or understand the spread of diseases in real-time [10]- [12]. The widespread use of mobile applications provides opportunities for financial transactions [13] and entertainment applications [14] through mobile devices.
CDR has four major types [15]: voice call, SMS, MMS (Multimedia Messaging Service), and data traffic. In the past, voice call and SMS were the representative of traditional communication. With the popularity of mobile Internet and smart phones, MMS business has been eliminated, but data traffic has increased, meanwhile voice calls and SMSs remain stable with slight reduction. For a Voice call or SMS record [16], it contains the encrypted cell phone numbers of a caller and callee, timestamp about date and time of the call, duration and the initial cell tower ID [17]. With the latitude and longitude of a local cell tower dataset, the caller and callee's geographic locations can be recognized. Some operators separate a voice call into two independent records, outbound and inbound calls [18]. The former only contains a caller's initial location, while the latter contains callee's location [19]. When both caller and callee are clients of an operator, two locations are provided. For data usage records, in comparison with voice call records, a callee's phone number is not included, but data usage is added with KB as measurement units. Data usage records have a higher occurrence frequency than voice calls and SMSs, because smartphones keep online all the time, APPs connect to Internet every a few minutes in background, even users do not use cellphones [20]. In the last decade, studies on CDR focus more on voice call records, which is considered as an ideal dataset for social relationship [21] and mobility researches [22]- [24]. However, recently, with the highspeed development of social networks such as Twitter, Facebook, and Weibo, many researchers have published their social relationship work of using the data from social networks instead of voice call records [25]- [27]. Nevertheless, in the human mobility and urban activity research field, CDR with data traffic is still showing great value. In countries where smart phones are popular, data usage records provide an approach to get location data source with low cost and nearly in real-time.
Another kind of valuable information from CDR is the cell phone user individual data, including user name, age, gender, contract plan, registration ID certificate type, billing address, and average revenue per user (ARPU) [28]. However, for the personal privacy reason, most of CDR data used in the literature are anonymized, but some studies got individual social characteristics information with help of local operators. For example, Frias Martinez proposed a study on gender characteristics and automatic recognition of mobile phone users in developing economies based on behavioral, social and mobile variables, using CDRs of about 10,000 users from developing countries, whose gender is a priori [29], [30].

B. CITY USERS' CLASSIFICATION
For analyzing user's calling behavioral tendencies, Macao's mobile users can be divided into residents, commuters, and visitors, which realizes the convenient reading and reasoning of big data. The data come from the mobile network and record a user's position in a call. The categories of behavior derived from the definition of the demographic agency are described below. Given reference area A [31]: • A Resident is an individual who lives and works in A, so his/her presence in A is important for all time periods.
• A Commuter is an individual who lives in a different zone B but works/studies in zone A. It is expected that the presence in Zone A is almost entirely concentrated on work/study days and work/study hours.
• A Visitor is an individual who lives, works/studies outside of A and visit A once or occasionally.

C. CLASSIFICATION METHODS
Supervised learning is applied to classification problems. Its learning process is divided into two stages: training and testing. In the training stage, the training data set is used to construct a classification model with known target class labels. Then, in the test phase, the unknown instances of a target class label are classified by the generated model. Different classification algorithms are given in Table 1 and they are briefly introduced as follows.

1) BAYESIAN CLASSIFICATION
Bayesian classification [32] can predict the probability of class members. The effect of attribute values on a given class is independent of the values of other attributes, which is assumed by naive Bayes algorithm. Naive Bayes algorithm is expanding in the number of prediction and rows, and quickly building the model. The prediction probability is derived from naive Bayesian algorithms. The probability of occurrence of event X of given event Y (P (X |Y )) is directly proportional to the probability of occurrence of event Y of given event X multiplied by the probability of occurrence of event X (P (Y |X ) P (X )) [33]. The Bayesian formula is shown below:

2) RANDOM TREES
Random Trees is a classifier which contains many decision trees. Each node of a decision tree represents a resolution rule of the subset attribute. By voting on trees to produce results, this process minimizes overtraining. A research [34] revealed that random numbers are effective in cost sensitive learning, sampling techniques, and predicting customer churn.

3) K-NEAREST NEIGHBOR (KNN)
A classifier based on KNN instance operates on unknown instances [35]. According to some functions, one can classify unknown and known. In KNN classification, unknown samples are given, and the pattern space of K training samples is searched by the classifier. These samples are closest to the unknown samples. Proximity is defined by Euclidean distance. Unknown samples are assigned to the most common class among its K nearest neighbors. Euclidean distance is shown below:

4) DECISION TREE
This method uses tree structure to establish classification model. It divides the dataset into smaller subsets. Leaf nodes represent decisions. A decision tree classifies the cases according to their eigenvalues. Each node represents the characteristics to be categorized in the decision tree, and each branch represents a value. Examples are categorized from root nodes and sorted according to their eigenvalues. Classification and numerical data can be processed by decision trees [36].

5) LOGISTIC REGRESSION
Logistic regression is a probability statistical model, and two variables are used to predict categorical variables. The forecast variable depends on one or more variables, either numerical or nominal [37]. A study [38] shows that after data conversion, logistic regression performs well. Assume that there are m samples of the pairs (x i , y i ), i = 1, 2, . . . , m, y i ∈ {−1, 1} is a binary class label for each sample i = 1, 2, . . . , n. Then, by the logistic regression for binary classification, the occurrence probability of the class is modeled as follows: C5.0 is another new decision tree algorithm developed by Quinlan on the basis of C4.5 [54]. It contains all the functions of C4.5 and integrates a series of new technologies, the most important of which is ''boosting'' [55] technology to improve the accuracy of sample identification. However, C5.0 using boosting algorithm is still in progress, which is not available in practical application. C5.0 algorithm has many characteristics, such as: • Large decision trees can be regarded as a set of rules that are easy to understand. VOLUME 8, 2020 • C5.0 algorithm confirms noise and lost data. • C5.0 algorithm is used to solve the problem of over fitting and error pruning.
• In classification technology, C5.0 classifier can predict which attributes are related and which attributes are independent of classification.

E. INFORMATION GAIN
A decision tree is a collection of branches, leaves (indicating the quality of credit rating), and nodes (specifying the tests to be performed). Decision trees and rules classify observations identified by variables. They divide a region of variable space into sub regions recursively according to the variable with the largest amount of information. In order to determine the criterion of selecting the variable with the largest amount of information, the information gain ratio criterion is adopted [56]. Other methods for selecting variables include v-square contingency table statistics, Gini coefficient or g-statistics.
The information gain criterion is based on Shannon's information theory, which points out that the information of an event is inversely proportional to its probability, and can be measured in bits by subtracting the base 2 from the logarithm of the probability. The observed value belongs to group C i (i.e., good or bad credits) with an a priori probability. The information (number of bits) required to identify two sets of observations 1 is: where I (S) = entropy of information of data set S, P (C i ) = a priori probability for an observation that belongs to group C i . Next, a test is performed on Variable A to create branches and partition data set S into n subdivisions. The expected information requirement after partitioning is then: By branching variables, the two groups of good credit and bad credit need less information; this difference is called information gain or mutual information: In order to establish decision tree effectively, it is very important to branch the variables that get the most information. If one uses the above measures to select the variables with the largest amount of information, one tends to test partitions in multiple sub partitions. Therefore, the normalized gain criterion is normally preferred: The induction of a decision tree is a process of maximizing information entropy by selecting variables as branches, and it can also be regarded as the process of maximizing information gain. When a node of a tree contains observation values in the same group only, the entropy is equal to zero, which means that the classification decision is defined for the observation values belonging to the node [57]. This process is repeated until both groups are fully defined. A variable can be used multiple times in a tree.

F. SPSS MODELER
IBM SPSS Modeler is a data mining and text analysis software application developed by IBM. It is used to build prediction models and perform analysis tasks. Its visual interface allows users to utilize statistical and data mining algorithms without programming. The cross-industry process for data mining (CRISP-DM) methodology is a general modeling process, which can be applied to various industrial and commercial problems. According to CRISP-DM, we propose the following five modeling steps. Five-Phase Modelling Cycle From CRISP-DM Perspective: • Problem understanding, including problem objectives, assessment, data mining objectives, and project planning. This is the most important stage of data mining.
• Data understanding, including the collection of raw data, data understanding, data exploration and verification of data quality. Data preparation, including selection, cleaning, construction, integration and discretization of data.
• Modeling. This phase includes selecting modeling techniques, generating test designs, establishing and evaluating models.
• Assessment. Evaluate how data mining results help an analyst achieve their goals. This phase includes evaluating the results, reviewing the data mining process, and determining the next step.
• Deployment. Focus on integrating new knowledge into daily business processes to solve initial business problems. This phase includes planning deployment, monitoring and maintenance, generating final reports, and reviewing projects.

III. EXPERIMENTAL STUDY
This part uses the user data of Macao telecom operator to carry out the modeling process based on the decision tree model and evaluate the model.

A. MODELLING APPROACH
The modeling method is shown in Figure 1. It begins with exploratory data analysis, summarizes data visually, finds patterns and data exceptions, preprocesses data, formats incomplete, inconsistent and/or missing data, and then performs feature engineering to select features that most likely affect the attributes of the target class.

1) EXPLORATORY DATA
The single variable frequency analysis is used to describe the exploratory data analysis (EDA) and describe the key characteristics of each attribute, including minimum and maximum values, average values, standard deviation, etc. It is also used to generate value distributions and identify missing and outliers.

a: DATA SELECTION AND IMPORT
In this experiment, the data mining source includes two parts: customer information and call data records. In particularly, ARPU and CDRs in Table 2 are the average values of three months.  Data integration refers as to the combination of data from multiple data sources to form a complete data collection, which may include multiple databases or files. Data integration is not a simple data consolidation, but a unified and standardized processing of heterogeneous data. In this paper, we combine CDRs to facilitate data analysis. By data integration, we get Table 3 from Table 2.
c: DATA UNDERSTANDING First, this study compares the ARPU characteristics of different age groups, as shown in Figure 2. It shows that the users with age being 45-59 are the most part of the users, reflecting that the overall users are older in this batch of experimental data. In the ARPU distribution, the users of ARPU MOP100 to MOP150 are the most, and more than half of the customers have ARPU less than MOP150. Besides, this study compares the differences in communication habits between Macao residents and the residents of Mainland China. Macao residents belong to local residents, working and living in Macao, while the residents of Mainland China belong to commuter, working/studying in Macao and most of them do not live in Macao, but live in Zhuhai, a neighbor city of Macao in Mainland China. The residents of Mainland China often travel between Macao and Mainland China, their demands for mobile data and roaming calls in Mainland China are much stronger than Macao residents. As it can be seen from Fig. 3, in terms of data traffic and roaming call minutes in Mainland China, the demands for the residents of Mainland China are twice as many as that for Macao residents, while IDD and local air time ARPU are almost the same.
The reason for the above phenomenon is that telecom companies provide two different types of contracts: one is Macao local contract, which only contains local air time and Macao local mobile data, and in the other contract, the voice and data consumption can be used in both Macao and Mainland China. Each contract type is divided into four grades, as shown in Table 4. The larger the value is, the higher the gear is and the higher the corresponding cost is.

2) DATA PREPROCESSING
Data sets can contain noise, missing values, and inconsistent data. Therefore, data preprocessing is very important to improve data quality and time efficiency. It is not advisable to use the original data to build a prediction model, because it would produce weak results. In addition, some machine learning algorithms are not mature enough to extract meaningful information. The data preprocessing technology in this study includes data cleaning, integration, discretization, and reduction. Data transformation is a process of normalizing and aggregating data to further improve the efficiency and accuracy of data mining. The nominal to binomial operator is used to convert and map the selected nominal value to its equivalent binomial value. The transformation information from 39752 VOLUME 8, 2020 nominal to binomial includes gender, ID type, and contract type. Table 5 shows the value range and description of data after transformation.

c: DATA REDUCTION
Data reduction is a process of reducing data representation while still producing similar results. Discretization (or binning) is applied to numeric attributes by converting and grouping consecutive values into discrete categories (or interval levels), because some data mining algorithms accept nominal attributes only and cannot process numeric attributes. Similarly, using original numerical data in the model may have an adverse impact on the performance of the model. For example, ARPU, age, and net days of use in a dataset can be discretized. But, in the experiment, it is found that discretization has little effect on the calculation speed, so this experiment skips the discretization process.

3) FEATURE ENGINEERING
Feature engineering allows the transformation of raw data into features to help improve the overall performance of the prediction model. In most cases, in order to improve the accuracy of the algorithm, it is necessary to reduce the number of features. Features that may have an impact on the target attributes are retained, while features that may be overly suitable for the model are discarded. Feature selection is applied to reduce the number of predictors. By doing so, it reduces the number of errors and improves the accuracy. The data set is divided into training samples and test samples. The mining model is established by using the training set, and the correctness of the model is verified by the test set.

4) MODEL BUILDING
In this study, we use the automatic modeling function in SPSS Modeler to provide a fast modeling and verification process. Because of the manual preprocessing before modeling, including replacement of missing values, data conversion, etc., model validation is directly carried out.
SPSS Modeler software provides the relevant features which are important to the target attributes. The weights are computed using the Pearson correlation coefficient as displayed in Figure 4. We can see the rank of important factors that affect contract recommendation. By combining the content of a contract with the importance ranking of the influencing factors, it is found that the last several factors are of low importance and not considered in the contract. It has little effect on the accuracy of contract recommendation. If these input values are removed, will the accuracy of contract recommendation be affected? With this problem, the following tests are done in this paper: We use Under-sampling method to deal with class imbalance problems, to achieve a high classification rate and avoid the bias toward majority class examples [58]- [61]. C5.0 algorithm is used to model the problem, contract type is taken as the target value, and multiple tests are established to determine which input factors can achieve the highest accuracy by adjusting different input conditions. Test 1. Take contract type as the target value and all other factors as the input value. The accuracy is 86.33% Test 2. Excluding Hong Kong data traffic, SMS and gender, the accuracy is dropped to 85.2% Test 3. Excluding ID type and age, the accuracy is increased to 88% Test 4. Excluding local air time, the accuracy is increased to the highest 88.4% VOLUME 8, 2020  After determining the input factors, a decision tree is established with C5.0 classification algorithm.

B. MODEL EVALUATION
In this paper, we choose the best input values and the best classification algorithm by comparing different algorithms on contract recommendation accuracy. In addition, the rationality of the algorithms is verified by evaluating the values of Gain, Lift, ROI, Profit and Response.
The data set is divided into training samples and test samples. 70% of the data is used as training samples to establish data model, and 30% of the data is used as test samples to verify the correctness of a model. If the prediction results meet the needs of the predetermined business objectives, the prediction model can be applied to actual situations, through existing customer resources of the telecom enterprise  with the model prediction, to select the most suitable package for customers, in order to reduce customer churn.
This section compares the performance of different classifiers to select the best method for contract recommendation. In order to select a most suitable classification algorithm for contract recommendation, we compare five classification algorithms: Bayesian, logistic, random trees, C5.0, and KNN (k-nearest neighbor). By comparing these algorithms on the accuracy of contract recommendation, the comparison is done based on five indicators: gain, lift, ROI, profit and response. In this way, the optimal algorithm of contract recommendation can be determined.

1) GAIN
For gain, the ideal situation should be to quickly reach a very high cumulative gain and quickly tend to 100%. As shown in Figure 6, C5.0 is the algorithm that can reach the highest cumulative gain as soon as possible and approach to 100%.

2) LIFT
For lift, it is ideal to maintain a high lift value for a period, or slow down for a period, and then quickly drop to 1. As shown in Figure 7, C5.0 is slightly ahead of Random Trees, both of which maintain a long period on the higher cumulative lift, and then rapidly decline to 1. Other algorithms are not very good with respect to C5.0 and Random Trees.

3) ROI
For ROI, the ideal situation is to maintain a segment on the higher cumulative ROI and then rapidly drop to the general level. Similar to lift's results, from Figure 8, we can see that C5.0 and Random Trees are in the lead, and the other algorithms are not very good.

4) RESPONSE
For response, the ideal situation is to maintain a period of high cumulative response, and then decline rapidly. From Figure 9, it can be seen that C5.0 and Random Trees are in the lead.

5) PROFIT
The ideal situation is that it should rise rapidly in the early stage and descend rapidly after the vertical coordinate of 50%    quantile reaches the maximum value. As shown in Fig. 10, C5.0 performs best.

6) RECOMMENDED ACCURACY
The comparison results of the comprehensive recommendation models are shown in Table 7. It is shown that C5.0 decision tree classification method has a high accuracy, reliability and stability in modeling of contract recommendation, which can well recommend contracts according to customers' consumption behavior.
This section mainly introduces the method, processing flow and evaluation method of contract recommendation based on decision tree algorithm. First, the collected customer consumption data is selected, cleaned, integrated, transformed and discretized, and transformed into data suitable for mining. Then, we use SPSS modeler to model and evaluate the C5.0 algorithm. Finally, several other classification models are evaluated and compared. The experimental results show the effectiveness of this method.

IV. CONCLUSION
Based on the results obtained in this study, it is concluded that CDR can really reflect customer's communication behavior preference and affordable price level. According to this feature, we can help telecom enterprises to know customer behavior more accurately so as to recommend the best fit contract to improve customer satisfaction and reduce churn rate.
This study applies exploratory data analysis and feature engineering on a telecom dataset and use these techniques to improve the performance of five classifiers for contract recommendation. This study also discusses the prediction results of different classifiers. Experimental results demonstrate that C5.0 outperforms the other classifiers in almost all evaluation metrics. Finally, this study recommends C5.0 as an algorithm for contract recommendation modeling.
At the same time, there are still some problems to be studied. First, does ARPU really reflect customers' consumption level? In real life, we find that, for a kind of customers, there is a phenomenon called ''consumption inhibition''. For example, for this kind of customers, when the data traffic of a contract for them is going to be exhausted, they then try to reduce the mobile Internet behavior in the remaining days of the month, so as to avoid extra consumption beyond the contract. However, such customers do not resist spending more money to upgrade the contract service but are reluctant to exceed the cost of the contract. How to calculate the real data consumption demands and acceptable price of such customers? Second, if a customer has multiple phone numbers in different telecom operators, how to calculate the real ARPU of such a customer? Third, in the actual contract recommendation process, apart from the price and contract service factors, are there other important factors that affect the selection of contract service, such as signing a more expensive contract due to a new mobile phone? Due to the use of an APP, more data is needed, so does the idea of upgrading the contract arise?
In the future, we plan to conduct research on the behavior phenomenon of ''consumption inhibition'' and study an algorithm that can calculate the potential consumption level of customers. It also focuses on the correlation between APPs and mobile data usage, clustering and promotion recommendation based on the content data of mobile Internet.