Recognition of Aircraft Wake Vortex Based on Random Forest

The analysis of aircraft wake vortex is of great significance for the improvement of airspace utilization. To overcome the shortcomings of traditional manual methods which are unable to produce satisfactory results on the great number of wake vortex data with high accuracy recognition, a fast automatic method is proposed based on Random Forests (RF). The development of our model is outlined as follows: (1) A wake vortex dataset that consisted of various aircraft measured by Wind3D 6000 LiDAR was collected at Chengdu Shuangliu International Airport from Aug. 16, 2018 to Oct. 10, 2018. (2) The optimal parameters were determined by grid search by visualizing the characteristic values of wake vortices, to get the optimal RF model, allowing high efficiency as well as improved accuracy. In terms of evaluation metrics, the experimental results showed that the method can effectively recognize the wake data in different situations, exhibiting good robustness.


I. INTRODUCTION
Wake turbulence is a function of lift generated by aircraft that is invisible and harmful to the following aircraft, especially in the take-off or landing phase of flight [1]. The study of wake vortex can be used to standardize the separation standard of wake turbulence, which ensures flight safety and avoiding of wake encounters. Therefore, in the context of limited airspace capacity, conservative wake turbulence separation is the main factor affecting the increase of airspace capacity.
Computational fluid dynamics (CFD) and flow field measurement are two primary methods to study the behavior of aircraft wake [2]. CFD supports the consistent study of wake vortex behavior under various environmental conditions but cannot exhibit a significant impact on aircraft spacing that improves airport capacity [3]. The high precision and resolution of LiDAR make it capable of supporting the most effective field measurement. In particular, the new wake turbulence separation criteria rely on LiDAR measurements and techniques, which have become effective tools for predicting and characterizing aircraft wake vortices under different environmental parameters at different flight stages [4]- [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu . However, the wind field data generated by LiDAR also contains non-wake wind field data, which make direct use impossible. In addition, the high difficulty in the measurement of the relative position of aircraft and LiDAR makes the data set not match the correct flight. Particularly, even if ADS-B or QAR data are used, the large amount of data provided by LiDAR require considerable manual processing. Moreover, due to the influence of sensor scanning and environmental parameters, the traditional artificial recognition method cannot provide satisfactory results for the recognition of high accuracy great number of wake vortex data. Therefore, the development of an efficient technique for automatic annotation of wake vortex data is essential.
Machine learning is an effective method that can be employed to divide LiDAR data into non-wake and wake data, but its performance depends on the visualization of LiDAR data and the configuration of the color map at different speeds. In the process of converting data into images, inappropriate color mapping may lead to poor performance and information loss [8]- [10].
To address these issues, the k-nearest neighbor (KNN) [11] and the support vector machine (SVM) [12] methods were used. A new perspective to further understand LiDAR data classification model based on random forests (RF) is provided in this study [13]. RF is an ensemble supervised machine learning algorithm, which summarizes the output of multiple prediction models to improve the classification accuracy. Briefly, the integration of multiple classifiers yields decrease in variance, especially in the case of unstable classifiers, producing more reliable results [14]. Compared with other machine learning methods, the advantages of RF can be summarized as follows [15]: · No requirement for dimensionless processing makes RF capable of processing multiple forms of data, which is suitable for the treatment of certain missing attribute values · The ability of handling high dimensionality data and complexity allows it to overcome data multicollinearity · It is robust to noise and outliers · It can directly obtain the each-index contribution to the total risk, avoiding subjective human error assignment.
Since the results of LiDAR products can be supported by information content that extends the gray-scale data, RF is a good prospective tool for wake vortex recognition. In such situations, the proposed model is used in a complementary way. The present data set was collected at Chengdu Shuangliu International Airport (CSIA) from August 16, 2018 to October 10, 2018. The experimental study shows that the robustness to outliers and noise in our RF-based method is useful for the recognition of wake vortices.
The remainder of the paper is organized as follows. The data resources of this work and the proposed methodology are introduced in Section II. Experimental results are presented in Section III and conclusions are drawn in Section IV.

A. DATA SOURCES
The data was obtained by using pulse Doppler LiDAR. The LiDAR detection mode is divided into PPI (plant position indicator) mode and RHI (range height indicator) mode. The data was collected in the laser radar RHI mode, and the schematic diagram of RHI mode is shown in Figure 1. The LiDAR model was wind3D 6000, and the specific parameters are shown in Table 1.  As shown in Figure 2, the position of LiDAR was at point A, and the entrance of runway 02R was at point B. The extension line of the runway center line and RHI mode intersection detection surface was at point C. The distance from LiDAR to point C was 503m, and the distance from entrance to point C was 1,468m. More than 270,000 detection data were collected from August 16, 2018 to October 10, 2018, including A320 series, A330 series, B738 and GLF4.

B. METHODOLOGY
The LiDAR can only measure the velocity at the direction of the line-of-sight (LOS), limited by the principle of the Doppler effect. The LOS velocity, also called radial velocity, is discrete, so the set of the radial velocity of the wind field is defined as V r : where r i is the i-th range (i.e., the distance from the LiDAR), θ j is j-th angle (i.e., the pitch angle of the LiDAR), and v r (r i , θ j ) indicates the radial velocity of the point at coordinate (r i , θ j ); n and m are the number of discrete values of the pitch angle and the number of range gates obtained by LiDAR detection scanning, respectively, and the number of range gates of LiDAR used for field detection was 56. The azimuth angle is fixed when the LiDAR is in RHI mode. The fast Fourier transform of the pitch angle on the LiDAR received echo was used to obtain the radial wind velocity of the fixed range gate. In detail, the negative value is the velocity close to the LiDAR direction, the positive value is the velocity far from the LiDAR direction. When the LiDAR scans one circle from the set scanning elevation angle, the radial velocity wind field of a section is obtained. Figure 3 shows part of the wake data under the time evolution detected by LiDAR, where the color bar represents the radial velocity, the wake is shown as a pair of red and green regions, and (A-F) is the time evolution process of the same wake of A320 aircraft.

1) FEATURE EXTRACTION
Three features were used in this paper, namely velocity envelope of different angles or ranges and average background wind field. D r (r i ), D a (θ j ) and V R are introduced as follows: where D r (r i ) and D a (θ j ) are the max difference radial velocities at the i-th range r i and the i-th angle θ j , respectively. The core position of the vortices can be determined in terms of the difference between the maxima (for the positive envelope) and the minima (for the negative envelope). V R is average background wind field and it is the average value of the radial velocity wind field of a section.
In the construction of the labeled data set with groundtruth wake vortex annotations, the function D r (r i ) was used to determine the existence of the wake vortex in one scanning measurement, which is similar to Equation 1 of [16]. Figure 4 shows the data with three pronounced bimodal distributions for A333, GLF5 and A332, respectively. The wake vortex range bins where the two peaks appear are determined.

2) RF MODEL
RF is an ensemble learning method for classification that integrates several relatively simple evaluators (i.e., decision trees) to form cumulative effects [17]. Compared with the current algorithms, RF can of assessing the importance of each feature in classification, generate excellent accuracy and run efficiently on large datasets. At present, it has been applied in data mining, big data, bioinformatics, and other fields.
The RF algorithm employs the Booststrap sampling method in building decision trees, which repeatedly and randomly trains N groups with each group being about 2/3 (Bagging) of the original data. In the construction of each tree, m features called the random subspace of M feature variables (m ≤ M ) are randomly selected for the division of internal nodes. Finally, the predicted results of N decision trees are used for voting, to determine the new sample category. A RF model with N decision trees was built in this study, and the test sets scored the performance of the model and ranked the importance of each feature. The calculation method of feature importance score (IS) is summarized as follows: For each Decision Tree in the RF, the corresponding outof-bag data are used to calculate its out-of-bag data error (errOOB1). Then random noise interference is added to the features of all samples of out-of-bag data to calculate the error again, which is denoted as errOOB2.

3) WAKE VORTEX RECOGNITION
The flow chart of the aircraft wake vortex recognition algorithm based on RF model is shown in Figure 5, and each part of the algorithm flow is described in detail in Section C.

C. EXPERIMENT 1) EXPERIMENTAL ENVIRONMENT
The experiments were performed on a workstation with 48GB RAM and i7-9700 CPU, using PyCharm 2020.1.1, VOLUME 10, 2022 FIGURE 6. Visualization of cross-section data in gray scale. For all plots, the x-axis denotes the scan range and the y-axis denotes the scan pitch angle. (A) represents the imaging of weak wake, (B) is the imaging of strong wake, (C) denotes the imaging of strong wake sinking, (D) represents the background wind field behaving relatively uniform, (E) indicates the background wind field behaving relatively strong, and (F) represents the background wind field behaving uneven. Anaconda 3 and Python 3.8. The scikit-learn package was used for the classification tasks of RF model.

2) DATA PREPROCESSING
The detection models of aircrafts included A320 series, A330 series, B738, GLF4, and others. The LiDAR data of 270,000 samples under good weather was considered, and each sample indicated a given scan pitch angle in degrees. The effective detection data was screened on the pitch angles from 0 to 10 degrees. Therefore, a crosssection in LiDAR data can be visualized as a gray scale image, as shown in the Figure 6, where (A-C) are the images of wakes, and (D-F) indicate the images of non-wakes.

3) MODEL TRAINING
The data set used in the experiment included 2,500 crosssection data, and the eigenvalue data were extracted and labeled. It included 540 data of aircraft wake and 1,960 data of non-wake. According to the labeling ratio, 2,000 data were randomly selected as the training set of the RF model, and the remaining 500 data were selected to be the test set of the model.
To get the optimal parameters of the RF model, the grid search tool GridSearchCV [18] was used to determine the best parameters to achieve the optimal classification performance.
The key terms of GridSearchCV are briefly summarized as follows [19]: · Estimator is used for the implementation of scikit-learn estimator interface when the classifier is to be trained.
· Parameter grid denotes parameter keys and a list named as a python dictionary. All the parameter combinations are tested to check the best accuracy.
· Cross Validation is used to evaluate learning models by resampling the available data. The purpose of this process is to check the performance of learning models on unseen data, allowing less biased or less optimistic results compared to a simple train-test set split.
The specific simulation parameters are detailed in Table 2, where GridSearchCV was used to optimize RF model parameters.

4) PERFORMANCE ASSESSMENT
In this section, the Accuracy, Recall, Precision, F1_score, receiver operating characteristic (ROC) and area under ROC curve (AUC) [20]- [23] were used as performance metrics to evaluate the classification performance of the RF model when it is applied to the recognition of aircraft wake vortex. The Accuracy is calculated as the ratio between the number of correctly predicted samples to the total number of samples; Recall measures the ratio between the predicted true positive (TP) to the total number of positive predicted samples;  Precision is calculated as the ratio between the predicted TP to the actual total number of positive samples; The F1_score conveys the balance between Precision and Recall; The ROC curves explore the effects on the true positive rate (TPR) and the false positive rate (FPR) as the position of an arbitrary decision threshold is varied, and the AUC with a larger value indicates more accurate recognition. The formulas of Accuracy, Recall, Precision and F1-score are presented in Eqs. (6)(7)(8), respectively.

III. RESULTS
In this section, experiments on the wake vortex data set were conducted. To set a general comparison, the experimental results of KNN [11], SVM [12] and RF were considered and the results are summarized in Table 3.
For the default parameters, the classification effect based on the RF model was high, with a recall rate of 0.83, an accuracy rate of 0.89 and an F1-score of 0.90. The results presented by RF with the default parameters were the same as those of 5-k cross-validation and SVM in terms of recall, while the model exceeded the performance of 5-k cross-validation, KNN and SVM in the terms of other evaluation metrics.  The RF model with optimal parameters obtained by grid search performed better than its counterpart, giving consistently improved recognition accuracy. Figure 7 shows the confusion matrix of the RF model with the optimal parameters. The test set consisted of 500 data, including 108 wake data and 392 non-wake data. For the 108 wake data, 96 were predicted correctly, and 379 data of non-wake were predicted correctly among 392 non-wake data.
The ROC curve results are presented to evaluate the generalization performance of the RF model. Figure 8 shows the ROC curve of the RF model under the default parameters and the optimal parameters, respectively. The AUC values of 0.96 and 0.97 show that the RF model under the default parameters and the optimal parameters provides high robustness.
Finally, the experiment on the classification of 108 wake data in aircraft of three types: light aircraft, medium aircraft and heavy aircraft. Table 4 lists the results of the quantitative evaluation.

IV. CONCLUSION
An effective method for processing aircraft wake data and extracting features is proposed in the present paper. The aircraft wake data measured by pulse Doppler LiDAR was collected at Chengdu Shuangliu International Airport from August 16, 2018 to October 10, 2018. A total of 2,500 crosssection data were visualized as gray images, yielding obvious wake vortex characteristics and high recognition degree, by screening the effective detection data on the pitch angles. Experimental results demonstrated that the algorithm based on the random forest classification scheme can effectively recognize aircraft wake data. Using the parameters n_ estimators = 90, max_ depth = 7, min_samples_leaf = 13, min_samples_split = 48 and max_features = 'sqrt', the grid search resulted in the best classification effect: the Accuracy was 0.95, the F1_score was 0.93 and the AUC was 0.97. His research interests include big data, machine learning, artificial intelligence, and aviation safety.
YUANFEI LENG received the B.S. degree in traffic and transportation engineering from Chang'an University, Xi'an, in 2020. He is currently pursuing the M.S. degree in transportation with the Civil Aviation Flight University of China.
His research interests include artificial intelligence in air traffic management and aircraft wake interval reduction technology.
XIAOLEI ZHANG received the Ph.D. degree in control theory and control engineering from Nankai University, Tianjin, China, in 2014.
From 2015 to 2017, he was a Postdoctoral Research Fellow with Shantou University Medical College, Shantou, China. From 2017 to 2021, he was a Lecturer with the College of Air Traffic Management, Civil Aviation Flight University of China, Guanghan, China. He is currently working with the Second Affiliated Hospital of Shantou University Medical College, Shantou, China. He is the author of more than 25 articles. His current research interests include digital signal processing, biomedical engineering, machine learning, and computer vision.
Dr. Zhang was a recipient of the Postdoctoral Support Project of Yangfan Talent Plan from Guangdong Province, in 2016 and 2017. VOLUME 10, 2022