Data-driven Long-landing Event Detection and Interpretability Analysis in Civil Aviation

Long-landing events (LLEs) can lead to reduced available runway length and increased operating costs, which are primarily caused by the improper operation of the aircraft, i.e., by human error. The pilot’s operation data are comprehensively recorded in the quick access recorder (QAR) during aircraft takeoff and landing, and analyzing the QAR data can determine the causes of the LLEs. Traditionally, domain experts inspect LLEs by manually setting thresholds on uni-dimensional data. However, this approach cannot detect the defects caused by the pilot’s maneuvering technique because the potential mutual information between different features in the large amount of data is not considered. This paper proposes a data-driven LLE detection and causation analysis workflow, which can automatically mine and analyze this information, to overcome the existing problems. Firstly, a dataset is established based on the extracted QAR data from 2002 flights, considering the landing phase of the aircraft. Subsequently, a hybrid-feature-selection (HFS) method is implemented to select the features that are highly relevant to the LLEs using both unsupervised and supervised algorithms. A categorical Light Gradient Boosting Machine (LGBM) with a Bayesian optimization (LGBMBO) model is used to determine the performance improvement. Furthermore, the model is visualized to analyze the marginal effect of key parameters for the LLEs by using SHapley Additive exPlanations (SHAP). The experimental results demonstrate that the proposed HFS-LGBMBO model reduces computational cost and achieves better performance. Additionally, they demonstrate that LLEs can be avoided during the landing phase by maintaining the appropriate descent speed, the appropriate aircraft altitude, and proper descent angle.


I. INTRODUCTION
T HE aircraft landing phase is the period with the highest accident rate, since it is closely associated with the geographical environment of the airport, weather conditions, pilot's operational expertise, and level of maintenance of the aircraft parts [1]. The accident data analysis indicates that the Runway Excursion (RE) category is the most common type of accident reported each year, which occurs during the landing phase [2]. The RE accounted for 14.9% of all accidents and included 50% of all fatal accidents, with 44 fatalities in 2019 [3]. The long-landing event (LLE) is one of the corresponding RE accident indicators. LLEs occur when the aircraft's landing distance lies beyond the normal range, thus reducing the available length of the runway and considerably increasing the risk. Even worse, the aircraft may run off if the pilots lack the necessary proficiency, leading to unpredictable consequences [4], [5]. Therefore, the detection of LLEs and their causes can help in preventing RE accidents, benefit airlines management and pilot training, and help in the development of aviation safety regulations.
The Quick Access Recorder (QAR) is an onboard data recording device which collects the entire flight information via sensors placed throughout the aircraft during the flight phases. Research on the QAR has gained considerable attention from civil aviation safety experts for the evaluation of flight quality and detection of risk events due to its characteristics of low cost and portability [6]. The Exceedance Detection (ED) method is one of the most efficient methods to monitor the QAR data; it includes various trigger logic expressions based on operational manuals, training programs, and risk assessment procedures on uni-dimensional data [7]. However, this approach depends on the experience of domain experts, making it difficult to efficiently utilize the potential mutual information between different features in the huge amount of data; it thus fails to identify the defects caused by the pilot's maneuvering technique [8]- [11]. Consequently, the trade-off between the ability to detect anomalous events and the inference of their causality presents a considerable drawback in the analysis of large amounts of aviation data.
This study attempts to break down this problem into three sub-problems as follows: selecting the features most relevant to LLEs from a large number of features; determining whether the proposed anomaly detection model is optimal when compared with other machine learning algorithms; lastly, excavating the correlation between key parameters and LLEs. This study solves these problems by employing a datadriven workflow based on an interpretable machine learning approach. The main contributions of this paper are as follows: • The QAR data from 2002 flights is collected to build a dataset of LLE anomalies that occur during the landing phase. • A fast and accurate Hybrid Feature Selection (HFS) algorithm, which presents the advantages of both unsupervised and supervised algorithms, is adopted for the QAR data to reduce data redundancy and determine the most relevant features. • This study tunes the hyper-parameters of the Light Gradient Boosting Machine (LGBM) by incorporating Bayesian Optimization (BO) to improve the anomaly detection capability of the model in order to overcome the problem of sample imbalance in the abnormal data. • It employs the interpretable machine learning framework, i.e., SHapley Additive exPlanations (SHAP), to excavate the correlations between the parameter within the anomaly detection model.
The rest of the paper is structured as follows. Section 2 presents a description of the related works. Section 3 depicts the proposed system architecture, including dataset construction, feature engineering, anomaly detection model, and interpretability analysis. Section 4 presents the experimental results, and explains and analyzes the models using SHAP. Lastly, Section 5 presents the conclusion.

II. LITERATURE REVIEW A. ANOMALY DETECTION APPROACHES FOR AVIATION DATA
Extensive research has been conducted to overcome these limitations. The anomaly detection models in aviation data can be classified into three categories based on the findings of previous studies: clustering-based models, regressionbased models, and classification-based models. Clusteringbased models employ an unsupervised learning algorithm which analyzes the similarity between aviation data based on different rules. The Multiple Kernel-based Anomaly Detection (MKAD) algorithm, developed by Das et al. [12], is one of the first unsupervised methods designed for anomaly detection in aviation. The MKAD can detect potential safety anomalies in very large databases of discrete and continuous data based on kernel functions and One Class SVM (OC-SVM) [13]. Li et al. [14] applied a cluster-based Anomaly Detection (ClusterAD) method on multidimensional flight data, which is based on the DBSCAN algorithm. This algorithm can automatically identify multiple types of flight operation patterns unlike the MKAD. Puranik et al. [15] used the nearest Local Outlier Factor (LOF)-based neighborhood method to quantify the anomaly degree of flight data clustering results.
A regression-based model feeds multiple independent variable parameters as inputs to fit a dependent variable, and outliers are identified by comparing the difference between the model's predictive output and the ground truth of the dependent variable. The Semi-Markov Switching Vector Au-toRegressive (SMS-VAR) model proposed by Melnyk et al. [16], performs regression fitting of multidimensional flight data sequences, to detect anomalies based on the dissimilarities between the model predictions and the data observations. Chao et al. [17] selected 15 parameters that are most relevant to landing among 260 parameters of the QAR data, developed a deep prediction model based on Long Short-Term Memory (LSTM), and then used Mean Square Error (MSE) as an evaluation criterion to identify hard landing events. Tejas et al. [18] used Random Forest (RF) models to establish a framework for online predictive modeling of key landing parameters and identified anomalies using quartiles for the MSE between the output of the model and the true value.
The classification-based model varies from the clustering or regression models, which requires anomaly labels that are assigned by experts in the aviation industry. However, the classification model is advantageous as it can objectively compare the performance of each model. Consequently, it presents considerable potential in the anomaly detection of aviation data. Memarzadeh et al. [19] developed a Convolutional Variational Auto-Encoder (CVAE), which is an unsupervised deep generative model for anomaly detection in aviation high-dimensional time-series data. Rey et al. [20] implemented an XGBoost model in multiple airports to detect runway anomalies; this model was sufficiently accurate to select the contributing factors and present an extensive analysis.

B. ANOMALY ANALYSIS APPROACHES FOR AVIATION DATA
Research has also be conducted on mining aviation data for factors which contribute to aviation accidents. Wang et al. [4] identified the key flight parameter characteristics of LLEs by analyzing the QAR data while proposing preventive measures from the pilot operation perspective. They developed the logistic regression and the linear regression models to further analyze the correlation between touchdown distance and these flight parameter variables. In 2018, Wang et al. [21] analyzed the effects of the pilot's critical flare operation on long and hard landing events based on the QAR data of 293 flights by employing two regression models to analyze the potential correlations between the flare operation and the landing performance. David et al. [22] implemented Bayesian networks to the airborne recorded flight data, to effectively perform a quantitative risk assessment of aviation events by selecting and selecting features directly related to the known the risk causal factors. The approaches presented in the previous studies are difficult to implement uniformly in a single model. The anomaly detection approach can accurately detecting anomalies, but cannot interpret them. Conversely, the anomaly analysis approach can explain anomalies, but presents limitations such as small data samples and non-uniform evaluation indexes. Consequently, evaluating the effectiveness of the model for large-scale data applications is difficult. This paper combines the advantages of these two approaches to integrate risk event detection and cause analysis, to address this problem.

A. SYSTEM ARCHITECTURE
In this study, it is assumed that the aircraft is in good operating condition, that the pilots are able to maneuver the aircraft normally, and that the data recorded by the QAR are consistent with the pilot's maneuvers, so as to investigate the relationship between LLEs and pilot operation characteristics. In addition, the influence of external factors, such as runway conditions and weather conditions, are not the focus in this paper. The proposed anomalous event detection and analysis scheme consists of four steps: dataset construction, HFS-based feature selection, LGBMBO based anomaly detection, and analysis of experimental results, including performance evaluation and interpretability analysis, as shown in the Figure 1.

1) Parameters
The parameters are selected based on two aspects: the types of parameters in the QAR data and the parameters related to the landing phase. The QAR records four types of flight parameters: alarm parameters, environment parameters, status parameters, and operation parameters. Among these, the status and operation parameters are the most important indicators, which represent the pilot's operational expertise. Subsequently, the landing phase of the aircraft can be divided into three phases: the glide slope phase, the flareout phase, and the deceleration taxiing phase [23]. During the glide slope phase, the aircraft transitions from gliding to an approximate flying state, generally employing the small throttle sliding method. During the flare-out phase, the aircraft mainly adjusts its speed, attitude, and angle for landing. Lastly, during the deceleration taxiing phase, the aircraft's drag must be increased to reduce the speed as quickly as possible to ultimately reduce the landing glide distance. The glide slope phase is affected by the vertical height of the aircraft as it passes the runway threshold, the gliding angle, and other factors. The deceleration taxiing phase after grounding, is affected by the grounding speed of the aircraft, the deceleration device, the slope of the airport runway, the wind speed on the ground, and other factors. Consequently, the parameters which characterize the flight dynamics of the aircraft during the landing phase are selected from the aircraft status parameters and the operation parameters of the QAR data. These parameters can be classified into the velocity-type, position-type, status-type, engine-type, and deceleration-type. Subsequently, the relevant parameters are selected from more than 2000 parameters in the QAR data, with 151 columns; the selected parameter types and examples are presented in the Table 2.

2) Data extraction
The LLEs typically occur during the approach phase, which is the most common phase for aviation accidents. The aim of the data extraction process is to select the landing phase from the entire flight data frame. This process is performed according to the same rules for each flight in this study: each flight must have the same dimensions and a unified sampling point. During the landing phase, the unified point is defined as the maximum aircraft braking value. For each flight, 49 time-stamps are taken before the landing point, and ten time- stamps are taken after the landing point to establish a feature vector matrix for the flight. Therefore, a total of 60-time stamps are extracted per flight. The range of the values taken must include the aircraft's altitude, which is 600 feet from the taxiway. For each flight, f , a vector at the time t(t ∈ (0, 60)), is defined in the (1).
where k t p means the flight value of the p th flight parameter at time step t.

3) Labelling of anomalous event through regularized matching
The QAR database and the abnormal events database are obtained using the airline's decoding software, AirFase. The QAR database includes multi-dimensional time-series data such as flight registration numbers, airline registration numbers, aircraft type numbers, flight execution dates, parameters records, and flight operation time. The AirFase software creates the abnormal events database by analyzing the documents of the Flight Standards Department of the The unified point

FIGURE 2. Data extraction process
Civil Aviation Administration of China (CAAC) [24] and the airline QAR monitoring standards, which contains the flight registration number, aircraft type number, abnormal phase, abnormal events occurrence events and abnormal level, and so on. In the abnormal events database, an LLE is defined as the aircraft landing which occurs when the landing point is more than 700 meters from the start of the runway. Lastly, the QAR and abnormal events databases are matched to obtain the abnormal datasets through regularization filtering, VOLUME 4, 2016 matching flight numbers, determining abnormal time, labeling abnormalities, and so on.

C. HYBRID FEATURE SELECTION
Several sensors perform the same function in an aircraft. These sensors ensure that when one fails, the others remain in a working condition. This redundant mechanism ensures the safety of the aircraft. However, it adds unnecessary complexity in the data analysis. Therefore, this paper combines the Unsupervised Feature Selection (UFS) and Supervised Feature Selection (SFS) algorithm to design an HFS algorithm for high-dimensional aviation data.

1) Unsupervised feature selection to eliminate redundant features
The UFS evaluates the relevance of the features by analyzing the internal structure between the features [25]. The unsupervised filtering method of the Spearman rank correlation coefficient (SRCC) feature selection filters redundant features without using label information. The assumptions of Spearman's correlation are that the data must be at least ordinal and that one variable must be monotonically correlated with another variable [26]. The QAR data exactly satisfy the assumptions of SRCC. Therefore, the variables with high monotonicity can be filtered by SRCC as a way to remove the redundant features from the QAR data. The SRCC includes values ranging from -1 to 1. The values close to 1 represent a strong monotonic increasing relationship between two variables, while those close to -1 represent a strong monotonic decreasing relationship. When the correlation value is close to 0, it indicates that there is no correlation between two variables [27]. The absolute value of the correlation matrix is taken because both monotonic increasing and monotonic increasing correlations are relevant in the current data set. The QAR data exactly satisfied the assumptions of SRCC. The variables with high monotonicity are filtered out by SRCC as a way to remove redundant features from the QAR data. The SRCC is defined as follows, where d represents the corresponding subtraction of the elements in the two variables, a and b to obtain a ranking difference set, and d i = A i − B i . N represents the total number of individual elements, which in this instance, is the number of rows of the QAR data. If the correlation coefficient of the two variables is closer to 1, it indicates that the two variables have a strong correlation, which can be considered to be redundant. The correlation threshold is set to 0.95 based on the experiments conducted in this study.

2) Supervised feature selection to select the most relevant features
The SFS selects more relevant features based on the correlation between the features and class labels, resulting in better performance in the subset [28]. This study proposes a highperformance automatic supervised feature selection method, which is obtained by cross-validation using the eXtreme Gradient Boosting (XGBoost) and Recursive Feature Elimination (RFE) methods. XGBoost is an ensemble machine learning algorithm based on a gradient boosting decision trees (GBDT) framework [29]. GBDT is a type of integrated learning boosting, where the boosting method trains a bunch of individual models continuously, each model learning from the error of the previous model [30]. Assuming that the GBDT model input sample iŝ where F represents the set of all weak learners. Compared with GBDT, XGBoost introduces a regular term in the cost function for controlling the complexity of the model as follows .
The objective function of XGBoost is composed of two parts: training loss and regularization, as represented in the (5).
Then a Taylor second-order expansion of the loss function is performed to estimate the loss function. By approximating the objective function, the whole model has a faster convergence rate. Then, the optimal tree structure corresponding to the objective function is defined as follows, where G j and H j represent the cumulative sum of the firstorder partial derivatives and second-order partial derivatives concerning the approximate loss function, respectively. XGBoost then calculates the importance of the features by summing the number of splits of the features in each tree to select the important features.
RFE is a wrapper feature selection method, which aims to select the best or worst features based on the coefficients, remove the selected features, and repeat the process in the remaining set of features until all the features are traversed [31]. Firstly, the estimator is trained on the initial set of features to determine the importance of each feature. The least important features are then removed from the current set of features. This process is repeated recursively on the pruned set until the final number of features to be selected are obtained. The XGBoost and RFE feature selection methods form a set of embedding methods for feature selection. The procedure of the combined XGBoost and RFE feature selection algorithms are as follows: Step1. The XGBoost model is trained on the original features, and each feature is assigned a weight, i.e., feature importance.
Step2. The features with the smallest absolute value weights are moved out of the feature set space.
Step3. This process is repeated recursively until the number of remaining features reaches a set feature count threshold.
Step4. The performance of the selected feature subset is evaluated and lastly, the feature subset with the best performance is selected.

1) LGBM model for Anomaly detection
The LGBM algorithm is an integrated algorithm, which is based on the gradient boosting framework proposed by Microsoft Research. The LGBM presents a higher accuracy and faster model training time when compared with XG-Boost and is better optimized for the construction of decision trees [32]. Thus, the main computational principles of both algorithms are identical. The improvements made by the LGBM in the construction of decision trees are reflected in the proposed GOSS algorithm, EFB algorithm, Histogram algorithm, and leaf-wise leaf node generation method. The GOSS algorithm is used to reduce the training sample size, the EFB algorithm is used to reduce the sample dimension or reduce the sample features, and the Histogram algorithm is used to identify the best splitting point. Particularly, the leaf-wise growth approach is computationally less expensive and avoids overfitting when compared with the XGBoost approach of building decision trees using level-wise leaf node growth.

2) Bayesian optimization algorithm for hyperparameter tuning
The LGBM is tedious and complicated since the parameters can be only adjusted manually, which deteriorates the performance of the algorithm. Therefore, in this study, the parameters are automatically adjusted by using the Bayesian algorithm to ensure that the LGBM algorithm achieves the best results. The BO works by constructing a posterior distribution of functions (Gaussian process) that best describes the function that wants to optimize [33]. As the number of observations grows, the posterior distribution improves, then the model will focus on the region of the parameter space with the most promising validation scores.
The BO contains two core parts: the probabilistic surrogate model (PSM) and the acquisition function (AC). The PSM can be divided into parametric and non-parametric models based on the parameters of the model. The Gaussian process (GP), which is a non-parametric model with a variable number of parameters when compared with the parametric model, can better describe the unknown objective function and strongly fits the function due to its greater flexibility and scalability.
The acquisition function (AC) is an important basis to determine the next evaluation point using the Gaussian Process-Upper Confidence Bound (GP-UCB), as defined in the (7), where the parameter β t represents a constant for equilibrium exploration and exploitation, µ t (x) represents the mean, and δ t (x) represents the standard deviation.

E. MODEL INTERPRETATION
This paper uses the machine learning interpretability tool, SHAP, to analyze the LGBMBO model. SHAP consists of several machine learning interpretable strategies proposed by Lundberg and Lee in 2016 [34]. It is derived from a gametheoretic approach called Shapley values, which fundamentally interpret the output of any machine learning model in terms of whether it is parameterized or not. SHAP can explain the working concept of machine learning by reverseengineering the output of any predictive model. SHAP interprets the model's predicted value as the sum of the imputed values of each input feature, which is a linear model expressed as an additive feature imputation method. The additive feature imputation method is defined as follows, where f (x) represents the original model, g ′ (x) represents the explanatory model, x ′ represents the simplified input for variable x , and M represents the number of input features.
In the SHAP interpretable model, ϕ 0 represents the mean value of the mapped SHAP for all training samples, ϕ j represents the SHAP value corresponding to variable j , and x j ′ indicates the value of 1 when variable j is selected and 0 otherwise.
The SHAP value ϕ j is defined as follows, where N represents the set of all features, S represents a sequential subset of N , and M represents the number of input features in the (8).
The SHAP value is used to measure the degree of influence of a feature on the predicted result. A positive SHAP value indicates that the variable has a promoting effect on the prediction value in the sample. Whereas, a negative SHAP value indicates that the variable has a restraining effect on the prediction value in the sample. A larger absolute value of the SHAP value indicates that the variable has a greater influence on the predicted value in the sample.

A. EVALUATION MEASURES
In this paper, the confusion matrix description is shown in Table 3.  For unbalanced data, a single accuracy metric cannot sufficiently reflect the performance of the model because the accuracy (Acc) of the majority and minority sample classification must be considered together. Therefore, the F1 score (F1) and the Matthews correlation coefficient (MCC) is used to analyze the confusion matrix, and the area under the curve (AUC), which includes the receiver operating characteristic (ROC) curve and the precision-recall (PR) curve, is used to determine the performance of the classifier.
The following definitions give the evaluation indicators of the classifier: 1) Acc is defined as: 2) F1 is defined as: 3) MCC is defined as: where, N , S, P can be defined as follows, 4) The ROC curve presents a curve with FPR as the y-axis and TPR as the x-axis, PR curve presents a curve with precision rate as the y-axis and recall rate as the x-axis, which can be defined as follows, The area enclosed by ROC curve and PR curve are ROC_AUC and PR_AUC respectively.
Acc, F1, ROC_AUC and PR_AUC are in the range of [0,1], and MCC is in the range of [-1,1], with closer to 1 indicating better algorithm performance and vice versa.

B. EXPERIMENTAL SETTINGS
In this experiment, a total of 20 GigaByte (GB) QAR data elements were collected from 2002 flights of an airline during actual operation. Each flight obtained 40 time stamps during the landing stage through resampling, for a total of 120120 sample-points. In the constructed anomaly dataset, the ratio of long landing data to normal data is 1:17, indicating that this dataset is unbalanced. For the dataset processing, the original data set is first divided into the training set and test set through stratified sampling in the ratio of 8:2. Subsequently, the HFS algorithm is used to select the appropriate features in the training set, while the test set uses the feature matrix after the hybrid feature selection to ensure the objectivity of the test results. This experiment is conducted using a computer with Windows 10, 2.9 GHz, Inteli7 processor, RTX3080 GPU, python 3.8.10, and compiler as VScode environment using the jupyter notebook.

C. FEATURE SELECTION ANALYSIS 1) Depiction of the result of feature selection
The feature number is reduced from 151 dimensions to 62 dimensions after the SRCC deletes the redundant features from the raw data. The following figure presents the comparison between the raw data and the SRCC features. In the heat map, the darker the color, the greater the correlation between the features. It can be observed from Figure 3 (a), that there are more dark areas in the heat map of the raw data, indicating that there are more redundant values in the raw data. Subsequently, it can be observed from Figure 3 (b) that the dark areas become significantly reduced after the SRCC processing, and the color becomes significantly lighter, indicating that the redundant values can be effectively eliminated after the raw data is subjected to the SRCC processing.
The XGB+RFE, a supervised feature selection method, is used to select features associated with anomalous events after removing redundant values using the UFS approach. This study adopts a cross-validation approach which can automatically select the best subset of features to ensure that the proposed algorithm automatically selects the appropriate features. Furthermore, the ROC_AUC score with 5-fold crossvalidation is used as the objective function. Figure 4 presents the selected key feature parameters. It shows the number of different feature subsets and cross-validation scores; the point where the vertical line crosses the curve is the point with the largest cross score and the horizontal coordinate corresponding to the vertical line is the best feature subset selected. It can be observed from Tabel 4 that the algorithm selects the best feature subset number of 16. Figure 5 presents the hot map obtained after the HFS processing; it can be observed that there are fewer data dimensions and a lower correlation between features after the XGB+RFE processing when compared with the SRCC processing.

2) Feature selection performance comparison
This study compares different feature selection methods separately in terms of performance and computation time to 8 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  demonstrate the necessity of feature selection. Classical feature selection algorithms such as XGBoost, Random Forest (RF) [35], and Principal Component Analysis (PCA) [36] are selected and then compared with the algorithm proposed in this study. The features with a feature importance is greater than 0.01 are selected when using RF in the feature selection. The decomposition vector with cumulative explained Variance greater than 95% is selected, while n_components is equal to 12, when using the PCA in feature dimensionality reduction. The classifier model used, is the LGBM algorithm without hyperparameter optimization, while the parameter values of LGBM are the original model values. Table 5 and Figure 6 present the comparison results.
The two aspects of performance and feature selection time can be compared by using the information presented in Table 5 and Figure 6. Firstly, in terms of performance, it can be observed by comparing the original data with the SRCC processed data, that Acc, F1, MCC, ROC_AUC, and PR_AUC do not significantly vary. Furthermore, the comparison of HFS with other feature selection algorithms demonstrates that the HFS method has the best performance with 98.04%, 78.20%, 0.7863%, 0.9955%, and 0.9408% for ACC, F1, MCC, ROC_AUC, and PR_AUC respectively. Particularly, the HFS effect is most pronounced for F1 scores and MCC values. The F1 scores of HFS are 4.59%, 3.037%, 2.84%, 1.13%, 60.35%, and 0.4% higher than the raw data, SRCC_FS, XGB_FS, RF_FS, PCA, and XGB+RFE_FS. The MCC values of HFS are 0.04, 0.0268, 0.0252, 0.0098,0.5357, and 0.0032 higher than the raw data, SRCC_FS, XGB_FS, RF_FS, PCA, XGB+RFE_FS. Subsequently, the comparison of the feature selection time demonstrates that the PCA has the shortest feature selection time of 1.4 s, while XGB+RFE_FS is the algorithm with the longest feature selection time of 4301.8 s when compared with the time of XGB_FS, XGB+RFE_FS, and HFS, all three of which use the XGBoost feature selection algorithm, with 40.5 s, 4301.8 s, and 760.4 s, respectively. Therefore, the following conclusions can be drawn from the above data comparison: 1. For the data processed by using the SRCC when compared with the original data, the data dimension is significantly reduced, but the anomaly detection performance does not deteriorate; VOLUME 4, 2016 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.    2. The HFS method of the SRCC and XGB+RFE sig-nificantly improves the anomaly detection performance when compared with the SRCC algorithm and reduces the feature selection time by 82.3% when compared with the XGB+RFE algorithm; 3. The HFS method has the least number of features and presents better performance than other feature selection algorithms.

D. ANOMALY DETECTION MODEL ANALYSIS 1) BO hyperparameter
The LGBM is first hyperparameterized using a BO algorithm when using the LGBM algorithm for LLE detection. The BO obtains the optimal hyperparameters of the LGBM model through iterative calculation by presetting the hyperparameter space, as shown in the following Table 6.

2) Anomaly detection model performance comparison and analysis
The idea is to choose the best among the best, which could be divided into two steps. At the beginning, some baseline models are chosen without optimization and the best model is selected through experiment. Then this best model is optimized with hyperparameters. In this study, ten groups of basic machine learning algorithms without optimization are selected for comparison experiments to compare the effectiveness of those models. The parameters of the ten machine learning classifiers are as follows: (1) (7) The solver of LR is lbfgs, and the class_weight is balanced; (8) NB and MLP use default parameters. Table 7 and Figure 7 present the comparison results. It can be observed from the above data that these treebased model reach better performance than other machine learning models, indicating that these tree-based models are more suitable for aviation data anomaly detection; they include Tree<GBDT<RF<XGB<LGB. Among the ten groups of basic models, LGB has the best performance, whose Acc,F1,MCC,ROC_AUC and PR_AUC are 98.04%, 78.20, 0.7862, 0.9951% and 0.9408% respectively. Therefore, LGB is selected for further hyperparameter optimization. The F1, MCC, and PR_AUC values are significantly improved, increased by 20.35%, 19.47%, and 4.38% respectively by using BO for enhancing the LGB algorithm.

E. MODEL INTERPRETABILITY ANALYSIS
The SHAP tool is used to visualize the model feature importance and local dependency maps after using HFS and LGBMBO algorithm model on the training set. Feature importance is a global interpretive tool, which is defined as the magnitude of the SHAP value of all the feature value mapping, which indicates the degree of influence on LLEs.   In Figure 8, the vertical axis represents the name of each feature, and the horizontal axis represents the magnitude of the SHAP value. The feature names from top to bottom are obtained by ranking the importance of the predicted results. It can be observed that the distance from the start of the runway, The partial dependence plot (PDP) is a local interpretability tool which can be used to view each parameter's marginal contribution to the predicted outcome. On the PDP, each point represents a row of data; the vertical axis represents the feature value, and the vertical coordinate indicates the effect of this feature value on the final landing result. A value greater than 0 on the vertical coordinate indicates that it promotes the LLE, and a value less than 0 indicates that this point in time is in a normal event. The horizontal coordinates represent the range of values of the feature. The whole graph shows the mapping correlation between the feature distribution and SHAP values.
In this study, the trend of each parameter is observed in two dimensions: altitude and runway plane, to conveniently observe the flight status of the aircraft during landing. The altitude is referenced by the radar altitude, considering the airport plane as the altitude origin. Furthermore, the runway plane considers the starting point of the runway as its origin. A distance greater than 0 indicates that the aircraft flies to the runway, while a distance less than 0 indicates that the aircraft flies from the starting point of the runway to the endpoint. These two dimensions are analyzed based on several key parameters which affect the flight state: ground speed, brake, glide slope deviation, and glide angle.
The yellow line in the Figure 9, Figure 10 and Figure 11 represents the regression line of the parameter distribution. The yellow line represents a trend throughout the data, and the black dashed line is the horizontal line where SHAP value  is 0. In these figures, if the yellow line is below the black dashed line, it indicates that the LLE will be suppressed at this time; otherwise, the occurrence of the LLE will be promoted. It can be observed from Figure 9 that the slope of the regression line is greater when the aircraft is at high altitude than at low altitude, and the higher the aircraft speed, the more likely is occurrence of an LLE. Therefore, maintaining the appropriate airspeed during the landing phase is crucial.
Brake value is an angle variable, and the greater the brake value, the greater the braking force. It can be observed from Figure 10 that the regression lines of the brake value distributions are typically lower than black dashed line, indicating that braking has an inhibitory effect on the LLE, which is consistent with physical logic. In fact, the brakes serve to mitigate the effects of the LLEs, i.e., to increase the actual length of the aircraft available on the runway and reduce the scheduling risk. Figure 11 depicts the effect of glide slope deviation on the LLE. In fact, glide slope deviation and glide slope are the difference relation of vertical direction. Despite the multipath effect and near-field effect, the general trend of glide slope deviation can be observed. It can be observed from Figure 11 that, if glide slope deviation is less than 0 during the flare-out phase, i.e., the yellow line is above the black dashed line, it indicates that the aircraft is below the ILS glide path, which suppresses LLEs, and vice versa. The actual glide angle of the aircraft is related to the altitude of the aircraft and the distance from the ILS glide-path localizer, so the pilot should be pay attention to the aircraft altitude and the distance from glide-path localizer.   Figure 12 illustrates the glide angle of the aircraft, from which it can be observed that most aircraft land along a 2.8degree angle of glide. It is consistent with the rule in the flight manual that the angle of glide of the aircraft typically lies 12 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and  2.5 and 4 degrees. Therefore, the pilot must ensure that the glide angle lies within the appropriate range during the approach phase. Essentially, a low glide angle may result in an LLE.

V. CONCLUSION
Risk event detection and cause analysis are necessary and sufficient for flight safety and accident prevention. Risk event detection is a prerequisite for risk cause explanation, and a model can only convincingly explain the causes of risk with a high detection rate. Conversely, risk cause explanation is a supplement to risk event detection and can reflect the practical value of detection. In this paper, the proposed workflow makes a trade-off between the ability to detect anomalous events and the inference of their causality.
The proposed HFS-LGBMBO model reduces computational cost and achieves better performance. Besides, with the help of SHAP, it proves that LLEs can be avoided during the landing phase by maintaining the appropriate descent speed, the appropriate aircraft altitude, and proper descent angle, which is consistent with the requirements of aviation regulations and pilot training manuals.
Despite the strengths of proposed workflow in this study, some limitations should be noted. The current work does not allow for some external environmental factors, such as weather and runway conditions, which may pose a risk of misjudgment. Furthermore, the lack of temporal correlation among events makes it difficult to depict a causal chain of accidents. In future work, the aviation daily monitoring data will be used to explore more challenging scenarios, such as the assessment of the pilot's flight quality and the construction of aviation abnormal event knowledge graphs for causal inference of events.