Predicting airline additional services consumption willingness based on high-dimensional incomplete data

Prediction of the purchase willingness of passengers has great benefits for airlines to promote auxiliary services, however, the datasets stored in passenger travel information systems are often high-dimensional and incomplete. This study develops a prediction method of airline additional service consumption willingness based on high-dimensional and incomplete datasets with a triple-layer hybrid PSO-XGBoost model, which consists of an incomplete data processing layer, a high-dimensional data processing layer, and a predicting layer. The raw dataset is converted into a complete and low-dimensional dataset through the first two layers and inputted into the predicting layer to train and optimize the XGBoost model together with the PSO algorithm and 10-fold cross-validation. The experimental results show that the proposed method outperforms other traditional machine learning models, presenting the highest prediction score with 0.9879 in terms of AUC. The findings help predict airline additional services consumption intentions of passengers and are beneficial to an efficient and low-cost precise marketing for airlines.


I. INTRODUCTION
Conducting airline additional services brings new profit growth for airline companies. However, the attention and related studies are very limited. The existing literature about the consumption willingness of airline additional services mostly uses questionnaire method [1], [2] or scenario hypothesis method [3], [4] to collect data and adopt the theory of consumer psychology to build traditional statistical models. The literature [5], [6] mines customer opinion survey data to analyze the factors affecting customer satisfaction by methods such as exploratory factor analysis and classifiers such as Parsimonious Bayes. The real passenger travel data which is subjective and limited, however, has not been used in any form of study. The literature [7] uses the Potluck Problem method to predict cargo demand for a given airline on a given route, but the method is not applied to the field of exploring customer consumption intentions. With the development of modern information technology such as the Internet plus and big data, relying on the real travel records of a large number of passengers, it provides objective conditions for accurately mining the consumption willingness of additional services of passengers.
The problems of different passenger travel data storage structures and access methods among passenger service information subsystems have caused a large amount of missing and redundant data [8] . In addition, due to the small consumer group of additional services, the insufficient data collection leads to an incomplete and high-dimensional air passenger travel dataset, which further affects the accuracy of prediction. However, there are insufficient attention and sparse solutions to the above-mentioned problems.
Given the above situation, seat selection, being regarded as the most valuable additional service for long-distance passengers, is selected as the research object, and the specific dataset related to seat selection is used. The eXtreme Gradient Boosting (XGBoost) model is introduced as a machine learning model into the field of consumer behavior prediction of airline additional service, and a triple-layer Particle Swarm Optimization (PSO) algorithm optimized XGBoost model is proposed for prediction. An incomplete data processing layer and a high-dimensional data processing layer are respectively set up to address the defects of severe missing, unbalancedness, and high-dimensionality of datasets. Then the processed datasets are inputted into the predicting layer.
The three-layer PSO-optimized XGBoost prediction model proposed in this paper achieves accurate prediction of passengers' purchase of paid seat selection ancillary services on incomplete and high-dimensional airline passenger travel dataset, which has the advantages of fast speed and high accuracy. This study can provide effective decisions for airlines' accurate marketing about ancillary services. Additionally, this paper mines real airline passenger travel and paid seat selection service consumption data to build a paid seat selection passenger characteristics portrait, which provides a scientific basis for airlines' auxiliary service marketing campaign settings.
The main content of this paper is organized as follows. Section 2 describes in detail the incompleteness and high dimensionality problems of airline passenger travel datasets. Section 3 specifies the design details of the proposed threelayer PSO-optimized XGBoost prediction model, including the theoretical basis of the model, the data pre-processing process, and the working process of the proposed three-layer prediction model. Section 4 evaluates the effectiveness of the proposed three-layer prediction model with data incomplete processing layer, data high-dimensional processing layer, and PSO optimization settings. The superiority of the proposed method by comparing it with the single-layer traditional machine learning model is demonstrated. Conclusions are presented in Section 5.

II. AIR PASSENGER TRAVEL DATA
The study focuses on exploring the classification of passengers by their consumption willingness for seat selection with data obtained from 23,432 passenger travel samples collected by an airline from 2016 to 2020.
The dataset contains 652 features as inputs, such as passenger gender, flight number, count of seat selection, preferred flights, accumulated mileage, etc., and a binary output for whether a passenger has purchased seat selection services. The dataset includes five types of features, as illustrated in Table I. Table I shows that the input space is high-dimensional, containing more redundant features as well as various aspects of passengers' flight-related information. It not only brings difficulties for experts to find the exact critical effectors of consumption intentions for additional services but also reduces the prediction accuracy and efficiency. The missing data rate and variance distribution of the collected dataset are shown in Figure 1. It needs to be noted that the missing values in the raw dataset have already been replaced with 0 by the data storage system such that the 0 value in numerical features can not indicate whether it is missing or real. Therefore, the information loss in the dataset is reflected by both the missing rate of categorical features and the variance distribution of normalized numerical features. Figure  1(a) shows that only about 5 % of the categorical features are relatively complete, and the rest 95 % have over 60 % missing data. Figure 1(b) illustrates that about 25 % of numerical features have 0 variances (less than 10 −10 ), and most of them have small variance. Owing to its traits mentioned above, the dataset has serious information loss.
There are 1,475 positive samples of purchasing seat selection additional service in the dataset, accounting for only 6.29 %. This data amount is far less than the negative sample amount, presenting an imbalanced distribution among classes. The model's judgment of the target will be affected by the imbalance since the model is biased to classify the test sample into the category with high-cardinality to guarantee accuracy. However, this is not what airlines expect.
In summary, there are several prominent problems of high dimensionality and incompleteness including missingness and unbalance with the dataset. The accuracy and credibility of the prediction results will be greatly reduced if the above problems are not addressed.

III. METHODS
Owing to the incomplete and high dimensional dataset, a method based on a triple-layer Particle Swarm Optimization modifying the eXtreme Gradient Boosting (PSO-XGBoost) model is proposed to predict passengers' consumption willingness about specific seat selection additional services. The architecture is shown in Figure 2.
First, the raw dataset is roughly cleaned and encoded to make the model focus more on prominent problems and more universal. After that, the cleaned dataset is inputted into the triple-layer model where PSO-XGBoost is used as the base model. Second, in the triple-layer model, the incomplete data processing layer (IDP-layer) is set to handle the problems of data missing by imputation and imbalance by resampling. Then the dimension of the complete dataset is reduced in the high-dimensional data processing layer (HDP-layer) according to the feature importance obtained by XGBoost. Finally, in the predicting layer (P-layer), the processed dataset is used to train the XGBoost model. A state-of-the-art heuristic, Particle Swarm Optimization algorithm, is applied to tune the hyperparameters of XGBoost to enhance the accuracy obtained by 10-fold cross-validation. The optimal XGBoost classifier attained is adopted to predict the willingness of new passengers to purchase seat selection.

A. Design of the basic model
The base model PSO-XGBoost of airline additional service consumption willingness prediction method proposed is a hybrid model [9]- [11] , combining Particle Swarm Optimization algorithm (PSO) and eXtreme Gradient Boosting model (XGBoost).
XGBoost [12] modifies the traditional gradient boosting tree, which is widely used in the field of consumer behavior prediction with its superior generalization performance, prediction accuracy, and outstanding parallel computing rate [13] . The XGBoost is an additive algorithm consisting of CART model, the objective function of which comprises a loss function and a regularization term, defined as Where, (•) represents the loss function, (•) is a regularization term based on tree complexity, which is beneficial to reduce the risk of over-fitting.
Based on the loss function, the second-order Taylor expansion is introduced. Then, the optimal weight of each leaf node is solved iteratively to obtain the final objective function, as follows, The PSO algorithm [14] is a modern heuristic algorithm adopted to solve combinatorial optimization problems. The basic idea is to use the position of -th particle to represent a candidate solution to the problem, and use a fitness function value to evaluate the superiority of positions. All particles update their velocity through (4) and position through (5) by sharing the individual and global position with others, and will finally gather in the extremum area after multiple iterations.
(5) Machine learning models contain many hyperparameters to be set externally due to their complex construction. The appropriate hyperparameters are crucial to the results since they can improve performance while reducing the risk of the overfitting of models. Manual or traversal search are used as a conventional parameter tuning method, which requires, however, a time-consuming and arduous process. Many studies have verified that adopting the PSO algorithm to search near-optimal hyperparameters makes models more automated and robust [13] . Therefore, in this study, XGBoost optimized by PSO is used as the base model. Five hyperparameters of XGBoost are selected for optimization, as shown in TABLE II.

B. Data cleaning and encoding
Before being inputted into the triple-layer model, the raw dataset is firstly cleaned and encoded. The following steps were carried out: (1) Rough cleaning. Filling the features with an extremely high missing rate will introduce more noise, and retaining the features with low variance will get sparse information but lose more efficiency. Therefore, features with the above defects are directly deleted to preliminary reduce the dimension of the dataset.
(2) Date features handling. The dataset includes more daterelated features, such as travel time and preferred travel month, which contain rich information but do not have a direct relationship with consumption willingness. Considering that passenger flow and physical fatigue level will largely affect the passenger's requirement for seat comfort, departure year and month variables are mapped to the traveler flow with the samples share representing. Additionally, the departure day variable is divided into a fatigue-prone period (from 20:00 to 8:00 the following day) and a non-fatigued period (from 9:00 to 19:00) for binarization. Other year and month variables are replaced by the average of departure year or month values.
(3) Categorical feature encoding. The dataset after step (1) contains 34 categorical features, such as passenger gender, flight cabin, etc. Since the machine learning model has a better performance on numerical features, the categorical feature should be transformed into their numerical counterparts by encoding [15], [16] . Among them, the flight cabin is subjected to label encoding [17] due to the ordinal values. Passengers' gender and flight destination are subjected to one-hot encoding [17] which maps the 1-dimensional N-value feature to the Ndimensional binary feature because of the low-cardinality and nominal values. The rest features are processed by mean encoding [18] due to their high cardinality and disorder.
After the above processing, the number of features in the dataset is reduced from 652 to 137, with no extremely serious missingness, low variance, or categorical feature values. The cleaned dataset then will be inputted into the triple-layer model.

C. The incomplete data processing layer
The IDP-layer solves the problems of data missing and imbalance of the air passenger travel dataset.
In the first stage, the missing values are imputed. Uniform constant values are often used to impute missing data in past studies [13] , which does not perform well in severely missing datasets. Machine learning models are widely used in imputation recently, considering other complete features of samples to make the imputed values closer to real values [19] . Therefore, PSO-XGBoost is adopted to impute missing data in this stage. The following five steps are carried out: (1) Sort missing features in ascending order according to their missing rate.
(2) Take the least missing feature as the prediction target , use sample set with observations on as the training set to train the model, and then use the sample set with missing values on as the test set to predict the corresponding values by the model.
(3) During training and prediction, missing values of the remaining missing features are temporarily filled by 0.
(4) Update the dataset by imputing feature with the predicted values.
The optimal coefficient of determination 2 and the optimal mean-square error percentage for filling each missing feature is shown in Figure 3. The 30 features containing missing values from the preprocessed dataset are filled with missing values using the PSO-XGBoost model. As can be seen from the figure, the 2 of most of the features filled are close to 1, and the mean square error percentages are less than 1 %. With that indicated the method can fill the missing values of the features better and can improve the prediction accuracy of the model without affecting the overall performance of the missing data. In the second stage, the dataset is resampled to balance. In the related literature, random undersampling, random oversampling, or synthetic minority oversampling techniques (SMOTE) [15] are often used for resampling. Considering that the former two approaches may lead to the risk of data loss or model overfitting, SMOTE [20] based on the K-nearest neighbor idea is applied to balance the dataset. Figure 4 shows the sample distribution of the dataset before and after SMOTE processing on the two feature dimensions with the lowest missing rate, which shows that the number of positive samples for purchasing seat selection service increases and the sample distribution transforms into a balance after resampling by SMOTE. As a result, the cleaned dataset is changed from missing and unbalanced to complete and balanced.

D. The high-dimensional data processing layer
HDP-layer solves the problem of the high dimension of air passenger travel dataset and can be approached by the following three steps.
In the first step, the feature importance is calculated using XGBoost and ranked. Feature importance is a product in the XGBoost training process, which is calculated from the node splitting gain every time. The gain is given by In the second step, the loss of XGBoost on the dataset containing the number of features of all combinations from 1 to 137 is calculated from the ordered features. Then the number of features with the lowest error is selected as the optimal solution.
In the third step, the top important features to form the processed dataset are selected as the final input to the P-layer. Figure 5 illustrates the relationship between the number of features and the cross-entropy loss of XGBoost, and it can be seen that the loss of the classifier is the lowest when the number of features is 37 or 90. Finally, the top 37 important features are extracted considering the operation efficiency. The top 37 most important features ranking and their importance are shown in Figure 6.  As can be seen from Figure 6, the top 37 important features consist of 19 passenger history airline preference features, 8 passenger history consumption information features, 7 passenger history travel information features, 2 current flight information, and 1 basic passenger information, of which 51.13 % are passenger history airline preference features, indicating that passengers' travel habits and preferences largely determine their willingness to purchase paid seat selection ancillary services. The top two features are the times to sit in the middle seat in the past three months (seat_middle_cnt_m3) and times to select a seat in the past year (select_seat_cnt_y1), indicating that the passenger's seat preference has the most significant impact on the willingness to pay for seat selection. Therefore, airlines should focus on tapping passengers' airline preferences, in other words, to focus on passengers' seat preferences, and put targeted marketing advertisements for passengers with different flight preferences as well as seat preferences.
Meanwhile, statistics show that, in terms of quantity, the characteristics related to preferred travel route (pref_line) and preferred arrival city (pref_city) account for the largest proportion, accounting for 29.72 %; in terms of the time dimension, most of the characteristics are passengers' longterm travel preferences or information, accounting for 45.9 % of data within 2 years and 21.62 % within 3 years. It can be seen that airlines should focus on studying the long-term travel characteristics of passengers with different preferred routes and cities to set up accurate marketing of paid seat selection auxiliary services.
As a result, a complete and reduced-dimensional processed dataset with 37 input features and a binary label is formed and set as final input into the P-layer for training and prediction.

A. Prediction layer
Based on the complete and low-dimensional processed dataset, prediction of consumption willingness about seat selection additional services is performed through XGBoost in the Player. First, the hyperparameters of XGBoost are specified by the PSO algorithm, and the average model accuracy through 10-fold cross-validation is used as the fitness function value to guide the evolutionary direction of the particle swarm. The final optimal model obtained is applied to predict new airline passengers' willingness to purchase seat selection additional services.

IV. Results and discussion
The performance measures are essential instruments to evaluate the reliability and validity of models. Five commonly used evaluation metrics are selected to represent the effect of setting IDP-layer, HDP-layer, and using PSO to specify hyperparameters in the evaluation phase. Finally, the triplelayer PSO-XGBoost model is compared with other existing single-layer machine learning models.

A. Evaluation metrics and methods
Five commonly used evaluation indexes for classification problems are selected for various evaluations, including accuracy (Acc), precision (Pre), recall (Rec), F1 score (F1), and area under ROC curve (AUC).
However, Acc is impractical if the sample distribution is unbalanced as well as Pre, Rec, and F1 lack of comprehensive reflection of model performance. Apart from them, AUC is recognized as a reliable tool for evaluating several machine learning models in several situations. Therefore, among all these metrics, this study primarily focuses on AUC value.
To avoid the impact of dividing the dataset on model performance evaluation, the training time overhead and model performance of the model under different division ratios were considered [21] . A 10-fold cross-validation method was adopted to evaluate the performance of the model, dividing the dataset into 90% training set and 10% verification set. The model was trained and verified 10 times, and the average was used to represent the score of the model on each metric, respectively.

B. Effect evaluation of incomplete data processing layer
To verify the validity of the IDP-layer, datasets are processed by different methods and used to train the XGBoost model, and the experimental results are shown in TABLE III. According to the results, the Rec value of XGBoost is significantly improved on SMOTE-balanced datasets, which indicates that balanced datasets highly enhance the model. Meanwhile, it can be seen that the missing value imputation method based on PSO-XGBoost is superior to the traditional imputation method, especially with AUC score reaching 0.9587 on the balanced dataset, which shows that the quality of the dataset has been improved after being processed by the IDP-layer.

C. Effect evaluation of high-dimensional data processing layer
To evaluate the effect of dimensionality reduction on the dataset, the XGBoost model was trained using the dataset before and after the HDP-layer, and the model performance is shown in TABLE IV. The dataset after dimensionality reduction by the HDP-layer can significantly reduce the training time overhead while keeping the model performance stable.

D. Evaluation of the effect of PSO optimization
To evaluate the effectiveness of the PSO algorithm in optimizing the XGBoost model, the performance of models using different optimization methods on the processed dataset is shown in TABLE V. The graph shows that the performance of the model with default hyperparameter is poor, while all metrics of the optimized model are improved. Among them, the performance of the hyper-parameters obtained by the PSO algorithm is equivalent to that of the fine traversal search method, but the search time is greatly reduced, which improves the efficiency of model optimization.

D. Effect comparison of triple-layer PSO-XGBoost model
The triple-layer PSO-XGBoost model is compared with other widely used machine learning models, including Logistic Regression [22] , Random Forest [23] , BP Neural Network [24] , Naive Bayes [25] , Decision Tree [26] , Support Vector Machine, K Nearest Neighbor, Long Short Term Memory and single-layer XGBoost model to verify the effectiveness of the proposed method, and the basic dataset used is only cleaned, encoded and mean imputed. The performance of the model is shown in TABLE VII.
As can be seen from TABLE VII, the performance of the latter three ensemble tree-based models is excellent, with AUCs above 0.8. The proposed triple-layer PSO-XGBoost model shows the most outstanding performance with a better metric of 0.9879 in terms of AUC, which fully proves that the proposed method can accomplish the prediction task well on a high-dimensional and incomplete dataset. The AUCs of both the linear model-based Logistic Regression and the Naive Bayesian based on the assumption of inter-feature independence are lower, indicating that there is an obvious non-linear relationship between passengers' consumption willingness towards seat selection additional services, the 37 features mentioned, and correlation between the features. In the comparison models, there are better Acc and worse Rec for all models except the Naive Bayesian model, indicating that the model classifies the vast majority of samples as negative for not purchasing additional services, while the Naive Bayesian classifies the majority of samples as positive. The comparison of the models illustrates that the untargeted dataset causes great fluctuations in model performance, which further confirms the stability of the proposed triple-layer model.

V. CONCLUSIONS
In this paper, a prediction method of airline additional services consumption willingness based on a triple-layer XGBoost model optimized by PSO is proposed, and the following conclusions are drawn.
(1) In the IDP-layer, compared with the unbalanced dataset imputed by traditional methods, the dataset imputed by the PSO-XGBoost model and SMOTE balanced has significantly improved prediction performance.
(2) In the HDP-layer, the XGBoost model is used for feature extraction and dimension reduction, which greatly reduces the running time and improves the efficiency while ensuring the stability of the model.
(3) The modern heuristic PSO algorithm is used to optimize the XGBoost model. Compared with the traditional traversal search method, the time cost is greatly reduced, and the performance is significantly optimized based on the XGBoost model.
(4) Comparative experimental results show that the proposed triple-layer PSO-XGBoost model is better than single-layer XGBoost and other widely used machine learning models such as BP neural network. The proposed model not only greatly improves the prediction accuracy, but also reduces the training time expenditure, which can meet the demand of passenger additional service willingness prediction based on civil aviation passenger information big data system.

DECLARATION OF COMPETING INTEREST
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper significantly.