Application of a PCA-ANN Based Cost Prediction Model for General Aviation Aircraft

The major objective of this paper is to build a cost prediction model for general aviation aircraft using artificial neural network (ANN) and principle component analysis (PCA) methods. A total number of 22 samples of general aviation aircraft collected from the literature are utilized to train and test the model. In the PCA, eigenvalues of PC1 and PC2 are 6.987 and 1.529, respectively, indicating that they have the strongest interpretation of the original variable information and are retained as cost influencing variables to train the ANN model. The pure multiple linear regression (MLR), stepwise regression (SR) and ANN models are built respectively for comparison. The comparative results reveal that the ANN method has better estimation effect than MLR and SR models in case of multi-collinearity of data. Combined with PCA, the ANN model is optimized, with MAPE, MAE, R and RMSE values of training and testing samples to be 0.009 and 0.015, 1.222 and 3, 0.9999 and 0.9994, 1.667 and 3.416, respectively. Finally, a more accurate and practical prediction model is developed. More importantly, this research can provide an important reference for general aviation aircraft companies in term of product cost planning and corporate sales strategy.


I. INTRODUCTION
The competition of general aviation aircraft production and sales companies mainly comes from the control of aircraft product prices, product quality, and production cycles. Therefore, having an accurate development cost prediction model to quickly formulate reasonable product sales price is extremely important for good decision-making and improving the competitiveness of companies. Although the cost of early manufacturing and design (development cost) of general aviation aircraft only accounts for a part of the life cycle cost (LCC), as proposed by Yeh and Deng [1], it determines the overall trend of LCC. However, due to the difficulty of data collection, late start time, multi-collinearity between variables, the relevant cost prediction models are limited [2], and there is currently no specific model to forecast the general aviation aircraft cost.
In the real world, cost data has the feature of multicollinearity and is sensitive to noisy, variables number, and The associate editor coordinating the review of this manuscript and approving it for publication was Rongbo Zhu . specific point. Numerous studies [2]- [4] have demonstrated that the quality of data can affect the prediction accuracy of the model. Since the 1960s, multiple linear regression (MLR) models have been widely utilized to predict aircraft cost. MLR is a statistical analysis system to determine the linear effect of independent variables on dependent variables. In recent decades, researchers have developed many MLR models such as DAPCA, PRICE H and SEER H to predict the aircraft costs [5]. Although MLR is very popular and useful in many cost prediction problems, it still has some drawbacks. For example, as proposed by [6], it is not suitable for systems with a large number of complex variables, and the multi-collinearity among the variables will lead to deviations in the interpretation of the results. However, the authors in [7] reported that the aircraft is a relatively complex system with many parameters that affect the cost and are correlated with each other. In recent years, researchers have tried to overcome the disadvantage of MLR by utilizing artificial neural network (ANN) models. The previous studies [8], [9] have demonstrated that the ANN method can obtain more accurate prediction performance owing to the deep learning ability of the model, which makes the ANN method not affected by the multi-collinearity of the cost data. The authors in [1] utilized the ANN method to predict product life cycle cost. The results indicated that the use of ANN method is feasible in the application of cost prediction of airframe structure manufacturing. The authos in [10] utilized the ANN method to predict the equipment cost of liquid crystal display; They reported that the method is fast and independent of data types. The authors in [11] studied the feasibility of ANN method for new construction cost prediction, and their approach can provide satisfactory prediction results. However, in the previous research [5], we have determined that the prediction results of the ANN method is poor in case of small sample size; Other works in [12], [13] also put forward the same conclusion.
In the prediction model, when multiple cost influencing variables are related to each other, that is, the variable information has the feature of multi-collinearity, this will bring many problems to the utilize of regression models, thereby affecting the prediction accuracy of the model. To solve this problem, the easiest way is to remove one or more variables that are less explanatory to the dependent variable and have stronger auto-correlation among the variables. The remaining variables are utilized to build a prediction model, thereby effectively avoiding the effects of multi-collinearity and improving the prediction accuracy of the model. Currently, researchers have utilized forward selection method, optimization subclass selection method, backward selection method and stepwise selection (SR) method to reduce the number of variables. Among them, the SR method is probably one of the most commonly used research practices in substantive and effectiveness studies [14]. It has the advantage of saving time for the individual. The SR is utilized to find and eliminate the parameters that cause multi-collinearity, so that the explanatory variables retained in the model are both important and irrelevant. The authors in [15] presented a SR model to estimate dominant modes, and the results proved that the developed method to be effective. The authors in [16] utilized the SR method to estimate the target parameters; they observed that inaccurate predictions are a result of missing and incomplete variables information. Meanwhile, the problem with this approach is that simply performing linear regression on the retained variables is not fully applicable to changes in the general aviation aircraft market, and the authors in [14] concluded that SR method itself has problems such as incorrect results and the inherent bias in the process itself.
Based on the limitations of the SR method, we hope to explore a more practical and accurate variable filtering method to solve the multi-collinearity problem of data while preserving variable information as much as possible, thereby improving the prediction accuracy of the model. Principal component analysis (PCA) is such an analysis method that can effectively reduce the dimension of parameters and retain the effective influence components in all cost influencing parameters. At present, the PCA method plays an important role in image compression [17], face recognition [18], [19], and image representation [20], [21]. However, it has not received much attention and application in solving the problems of multi-collinearity of data and improving the prediction accuracy of model. To our knowledge, there is no literature that has reported the combination of PCA and ANN methods to solve the problems of cost prediction for general aviation aircraft. Thus, a PCA-ANN based model can be considered to predict the development cost for general aviation aircraft. The main objective of this study is to verify whether the cost influencing parameters obtained through PCA method is significant for ANN modeling, and to verify whether the proposed PCA-ANN model has higher prediction accuracy than other existing models.
This work is organized as follows. Section 2 briefly introduces three cost prediction models: MLR, SR and ANN. The combined method and the modeling procedures of this research is presented in Section 3 followed by a case study to compare the training results of different prediction methods in Section 4. In Section 5, the fitting and prediction results are summarized, and the feasibility of PCA+ANN method is determined. Finally, concluding remarks and future works are discussed in Section 6.

A. MULTIPLE LINEAR REGRESSION
MLR is a statistical method that simulates the relationship between two or more independent variables and a dependent variable through the form of a linear equation [6]. In practical problems, when the dependent variable is affected by multiple factors, and the correlation between each factor and the dependent variable can be approximately regarded as linear, then a MLR model can be established for prediction and analysis.
Assuming that X and Y denote the independent variables x ij with n samples and the dependent variable y with n samples, respectively.
The MLR model can be represented by the following equation.
where B and denote the regression coefficient vector and the random error vector, respectively.
To predict the value of matrix B, the least squares method denotes the values that VOLUME 8, 2020 minimize the sum of squares of , which can be expressed as follows.
Thus, we obtain Finally, the MLR model can be converted into the following equation.Ŷ The SR method is utilized to individually introduce variables into the model. After that, the model is subjected to the F test.
If the initial interpretation parameter is no longer significant after the introduction of the subsequent explanatory variable, it is eliminated. This process is repeated until no significant explanatory parameters can be introduced into the regression model. The detailed description of each phase has been introduced in our previous work [5].

C. ARTIFICIAL NEURAL NETWORK
ANN methods are complex network systems that are widely connected by lots of simple processing units (called neurons). In practical applications, the ANN models trained by backpropagation (BP) algorithms are most commonly used [5]. BP model compares the output and target values and utilizes the gradient steepest descent method to reduce the model error. A typical ANN structure is shown in Fig. 1. As shown in Fig. 1, the ANN method contains three interrelated layers namely input, output and hidden layers. Besides, each layer has its own initial weight, number of neurons, function of neuron and biases. Summation and activation are the two functions performed by neurons in each layer. First, the weighted input is accumulated, and then the accumulated amount is compressed to obtain the output. Information flows between neurons, with each layer receiving signals from one neuron and then transmitting its output to the next. The BP algorithm is utilized to train the ANN model. The algorithm generates a cumulative error between the actual and predicted outputs and propagates the error back to adjust the weight value. The main purpose of network training is to minimize the network error. The criterion for stopping the training is the mean square error (MSE) which is calculated at the output layer [1]. If MSE reaches the set value, the training of the ANN model is terminated; otherwise, the weights between the output and hidden layers are reverse-transmitted and updated using the BP algorithm. The weights between the hidden and input layers are updated in a similar manner until the MSE value reaches the set value. In this way, the purpose of minimizing the error between the desired and predicted output values is achieved.

III. OUR PROPOSED METHOD
The basic idea of PCA is to reduce the dimension of variables. It transforms the multi-indicator problem into fewer comprehensive indicators, and reflects as much of the original information as possible, thereby simplifying the statistical characteristics of the variable system. PCA method uses orthogonal transformation to transform a number of random variables with a certain correlation into a set of relatively small numbers of uncorrelated new variables through a linear combination [22], that is, the principal component (PC). Generally, the principal components (PC s ) have the following characteristics: (1) The number of PC s is much less than that of original variables. Substituting the PC s for the original variables to participate in modeling will greatly reduce the amount of calculation in the model training phase.
(2) The method of obtaining variables by PC s is not the simple filtering of original information like the SR method, but the result of reorganizing the original data. Therefore, most of the original variables information is retained.
(3) The PCs are not related to each other. New indicators (principal components) obtained by PCA can be utilized to deal with the problems caused by variable information overlap and multi-collinearity in data modeling.
In other words, we want to reduce the raw data of n × m (here, n denotes the sample number, m denotes the number of variables) dimension to a smaller dimension of n × k (k denotes the number of components). In this new dimension, the individual components will be uncorrelated to each other, while preserving the largest possible variance (the larger the variance, the more information the principal component contains) [23]. Generally, PC1 contains the largest amount of information, that is, PC1 has the largest variance among all original linear combinations, and is called the first principal component. When the information contained in the first principal component is insufficient to represent the original variable information, continue to select the second principal component PC2. It should be noted that in order to effectively reflect the original information, the existing information in PC1 does not need to be re-represented in PC2, that is, they must be kept independent and uncorrelated. This relationship can be expressed by the covariance matrix cov(PC 1 , PC 2 ) = 0. The mathematical expression of the PCA method can be expressed as follows: where, represents the coefficient of the jth variable in the ith component structure, and PC i denotes the ith principle component.
In order to predict the vector γ = (γ 1 , γ 2 , · · · γ k ) T , the covariance matrix can be used. We assume X n×m denotes the original data matrix, which can be expressed as, where, x j = x 1j , x 2j , · · · x nj T . The covariance matrix s mj k×k of the original data matrix X can be expressed as, Here,x m andx j represent the mean value of the mth and jth samples, respectively, as follows.
We assume that λ i and γ i denote the eigenvalue and orthogonalized unit eigenvector of covariance matrix s mj , respectively. Among them, the first k eigenvalues λ 1 ≥ λ 2 ≥ · · · ≥ λ k ≥ 1 of s mj are the variances corresponding to the first k PC s . According to the authors in [23], select the first k PC s to achieve the dimension reduction results from X n×m to PC n×k . The corresponding γ i of λ i is the coefficient of PC i on the original variable, as follow.
Take k and PC i as the nodes number and input value of the input layer, respectively. c ij represents the weights connecting the hidden and input layers, and α j denotes the biases on the hidden layer. The jth unit input value in the hidden layer can be expressed as: Consequently, the uth unit input value in the output layer can be expressed as, where, p denotes the nodes number of the hidden layer, y j and β u denote the input value and biases on the hidden layer, respectively. d ju represents the weights connecting the hidden and output layers. The output for hidden and output layers are expressed in Eqs. (14) and (15).
where, a j represents jth node output value in the hidden layer, f is the transfer function utilized for calculation, and b u is the output value of uth node in the output layer. Then, the network error (MSE) is calculated as, Here, e u denotes the desired output value, and l is the nodes number of the output layer. The back-propagation algorithm is utilized as the training algorithm to minimize the network error, µ denotes the learning rate of the network, the revisedc ij andd ju can be expressed as.c When the error value or the number of training epoch reaches the network setting value, the training is stopped and the prediction result is obtained. Otherwise, network training continues from Eq. (14). Fig. 2 shows the modeling procedure for combining PCA and ANN methods.
Based on the multi-collinearity feature of general aviation aircraft cost data, MLR, SR and ANN models were established by using pre-processed samples, and the best prediction model (ANN model) was determined by comparing their estimation results. On the other hand, PCA method is utilized to reduce the dimension of cost influencing variables, the applicability of PCA method and the practicability of the combined model (PCA+ANN) are determined by observing the prediction effect of the combination of PCA and a single optimal prediction model. Fig. 3 shows the analysis method adopted in this research.

IV. DEVELOPMENT COST ESTIMATION CASE STUDY
A. DATA COLLECTION Unlike military aircraft, general aviation aircraft have a late start and a short development time, and most of the cost data are only for internal reference of the company. This directly brings about the fact that the cost data is difficult to obtain, the lack of personnel specializing in cost prediction, and the lack of a cost prediction model applicable to general aviation aircraft. In this case, we focus on solving the cost prediction problem for general aviation aircraft. Cost data of 22 general aviation aircraft in the open literature [24] were collected. There are numerous cost drivers for general aviation aircraft. Combining the variables introduced by Qu [24] and Chen, et al. [5], and referring to the cost drivers of military aircraft, 10 variables are selected as the main driving factors. The collected cost data is shown in Table 1.

B. DATA PRE-PROCESSING
The general aviation aircraft cost data collected in Table 1 may be inaccurate or inconsistent. In the cost prediction problem, each sample is discrete and independent. Errors in individual cost data can also affect the whole trend of the data. It is crucial to pre-process the cost data and remove the incorrect samples before building the model. Data preprocessing can improve the quality of the cost data, thereby improving the prediction accuracy of cost prediction models. In this study, we use the judgment ellipse method in SIMCA-P software as the data pre-processing and screening method. 22 samples with 10 independent variables are subjected in the software with the confidence level set to 95%. Fig. 4 shows the 3D (a) and 2D (b) scatter plots of 22 data points obtained by the judgment ellipse method. As shown in Fig. 4, the 10th sample point is not in the judgment ellipse of 95% confidence interval, which belongs to a specific point and should be removed to improve the overall quality of the data.

C. PREDICTION PERFORMANCE CRITERIA
Several metrics have been utilized to measure the estimation performance of different methods. We adopted the root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), and correlation coefficient (R), and the mathematical definitions are shown in Table 2.

D. VARIABLE CORRELATION ANALYSIS
The correlation analysis method is utilized to determine the relationship between the cost influencing variables, and then to analyze whether the data is affected by multi-collinearity. The correlation coefficient (R) between variables ranges from −1 to +1, which is a value that determines the linear correlation between two parameters. When the negative correlation between the two parameters is stronger, the R value is closer to −1. On the contrary, the stronger the positive correlation, the closer R value is to +1. Table 3 listed the correlation coefficients between variables.
The higher the R value, the greater the correlation between the two parameters. When the cost impact parameters are strongly correlated, and these parameters are selected as independent variables to build the model, this will affect the model by multi-collinearity and the prediction accuracy will be affected. In general, R>0.6 and R>0.8 represent a certain and strong correlation between variables, respectively. As listed in Table 3, there are many correlation coefficients between pairwise parameters greater than 0. Therefore, great majority of the variables have a strong correlation. We can also draw this conclusion from the scatter plot between the variables. Fig. 5 shows the scatter plots established by some strong correlation variables.
Through the above analysis, it can be obtained that the cost influencing variables have a strong cross-correlation relationship, so the data has a multi-collinearity feature.

E. PCA MODELING ANALYSIS
Through the correlation analysis, we have obtained that the data has a muti-collinearity feature. PCA method reduces the number of variables by dimensionality reduction on the premise of preserving data information as much as possible, so as to solve the problem of multi-collinearity among variables. The eigenvalue, proportion of total variance and cumulative explained variance of each principal component are given in Table 4.
In Table 4, each of the ten PC s is a linear combination of the ten original cost influencing variables. In general, eigenvalue ≥1 indicates that this PC contains most of the original variables information, which has a strong interpretation of the original cost influencing variable and should be retained. The screen plot of eigenvalues of different PC is shown in Fig. 6.
As shown in Fig. 6, the eigenvalues of PC1 and PC2 are greater than 1, thus they are retained as new cost influencing variables. Table 5 shows the coefficient values (γ ij ) of eigenvectors of PC1 and PC2 calculated using Eqs. (7)-(11).  Thus, the mathematical equations of PC1 and PC2 can be expressed as follows.

F. COST MODELING
After data pre-processing, the 10th group of samples was removed. For the remaining 21 groups of samples, the first 18 samples were randomly utilized to train the models, while the remaining 3 samples were selected to test the model. In the training stage, the MLR, SR, ANN, PCA+ANN methods were utilized to established a development cost prediction model for general aviation aircraft.

1) MLR MODELING
In the MLR training stage, the predicted cost and 10 raw cost influencing parameters are selected as dependent and independent variables, respectively. Therefore, the mathematical  (21) Based on the MLR method, the comparison between predicted and actual costs is shown in Fig. 7. It can be seen from Figure 7 that the predicted cost of some sample points (such as samples 5, 6, 12, 13, 17) differs greatly from the actual cost, which also indicates that the multi-collinearity of the data has a negative impact on the prediction result of the MLR method.

2) SR MODELING
After screening the cost influencing parameters through the SR method, only W e is left to build the model. The mathematical equation can be expressed as: c pre = −34.001 + 0.32W e (22) Based on the SR method, the comparison between predicted and actual costs is shown in Fig. 8. As shown in Fig. 8, the predicted results can only be kept within the general range of accuracy. This is because only uses the filtered W e variable to establish the model, which avoids the impact of data multi-collinearity on the model, but inevitably leads to the loss of variable information. Therefore, MLR and SR methods have their own shortcomings, which affects the fitting effect of the model.

3) ANN MODELING
In the ANN training stage, ten cost influencing parameters and the predicted cost are include in the input and output layers. According to the hidden layer element selection principle (The minimum number of hidden nodes should be half of the total number of output and input nodes.) proposed by Sodikov [25], nine is selected as the number of hidden layer nodes to get the optimal effect. Purelin and tansig transfer functions are employed for the output and hidden layers, respectively. The mathematical expression of these two transfer functions are shown in Eqs. (23) and (24). Furthermore, the learning rate, maximum operation iteration epoch and target error (MSE) are 0.01, 1000 and 0.001, respectively.
Tansig fuction : Purelin function : Based on the ANN method, the comparison between predicted and actual costs is shown in Fig. 9.
Comparing the results of Figs. 8 and 9, we find that the estimation performance of ANN method is significantly better than the results obtained by the SR method. This is because the ANN model can not only better solve the problem of multi-collinearity of data by learning the relationship between variables, but also will not cause the loss of data information. However, this has not always been the case.  When the total number of samples is small, the ANN model will get poor prediction results due to insufficient training. The authors in [5] also mentioned that the ANN model is not suitable for prediction under small sample size.

4) PCA+ANN MODELING
In the PCA+ANN training stage, PC1 and PC2 are include in the input layer, while the number of hidden layer nodes is set to be 5. The remaining settings (output layer, transfer functions, the learning rate, maximum operation iteration epoch and target error) for training the PCA+ANN model are exactly the same as ANN model. Based on the PCA+ANN model, the comparison between predicted and actual costs is shown in Fig. 10. A comparison of Figs. 9 and 10 indicates that the fitting effect of PCA+ANN method is better than that of the ANN method. In addition, the fitting accuracy of ANN and PCA+ANN models can be compared by establishing the regression graphs of the training data. Figure 11 shows the regression plots of training data for these two models obtained by using matlab software. A higher R value indicates a better fitting effect. As can be seen from Fig. 11, the R value of PCA+ANN model is 0.999, which is higher than that of ANN model, indicating that its fitting effect is better. When the data has a multi-collinearity feature, ANN method can solve this problem through in-depth learning analysis, and then establish the corresponding nonlinear model to predict the cost of test samples. However, the existence of multi-collinearity is bound to take the model much time to learn the training samples. The variables obtained by PCA are independent of each other, which avoids the influence of multi-collinearity on the model. Therefore, we hypothesized whether the combination of the PCA and ANN methods will reduce the training time of the model, thereby playing an optimization role on the ANN model. The training curves of ANN and PCA+ANN models are shown in Fig. 12. The training stage is completed when the model reaches the MSE value of 0.001. It can be seen from Fig. 12 that ANN and PCA+ANN models require 680 and 386 epochs to stop the training phase, respectively, which means that the PCA+ANN model spends less time to train the data. This is consistent with our guess: Based on the training data, combined with PCA method can not only get a better fitting effect, but also shorten the training time, so as to achieve the effect of optimizing the ANN model.  Table 6 lists a comparability of the prediction performance of these four models.
As can be seen from Table 6, the PCA+ANN model has the minimum values of MAPE, MAE and RMSE, and the R value is the largest, indicating that the fitting effect on the training data is the best. Instead, the results show that the MLR and SR models reflect relatively poor prediction accuracy on the training samples. This is because the MLR and SR models are affected by the multi-collinearity of data   and severe loss of variable information, respectively. On the contrary, the ANN model can obtain better prediction accuracy, and is not affected by the multi-collinearity feature of the data and the number of variables.
In the testing stage, the remaining 3 testing samples are predicted by the MLR, SR, ANN, and PCA+ANN models, respectively. The comparison between the predicted and actual cost based on the testing data is shown in Table 7.   Table 8 lists a comparability of the prediction performance of these four models.
As shown in Table 8, the MAPE, MAE and RMSE values of the PCA+ANN model are 0.015, 3 and 3.416, respectively, which are lower than that of the other three models. This indicates that the PCA+ANN method has the best prediction effect on the testing data among the four models. Meanwhile, the R value of PCA+ANN method is 0.9994, which is much closer to 1 than the other three models. This also implies the best prediction effect. Based on Tables 6 and 8, we make a comprehensive comparison of the prediction performance of VOLUME 8, 2020 each model. Fig. 13 shows the estimation performance of the four models to compare the predicted results of each model more directly.
It can be seen from Fig. 13 that when the MAPE value is smaller, it is accompanied by a smaller MAE value, a smaller RMSE value and a larger R value, which represents a much better prediction effect. Among the four models, PCA+ANN method has the best prediction effect on both the training and testing samples. This is because the combination of PCA method not only greatly shortens the time for the ANN model to study the training data, but also reduces the complexity of variables, so that the information of variables can be expressed more accurately under the condition of limited sample size, thus improving the prediction accuracy. In addition, the fitting and prediction effects of a single ANN model are superior to MLR and SR models, which reflects the superiority of the ANN model in solving the problem of multi-collinearity of data.
Through the above analysis, we find that when the data has the characteristics of multi-collinearity, compared with the linear model, ANN can better solve this problem and obtain more accurate prediction performance. By combining the ANN model with PCA method, the R values of the training and testing samples increased from 0.9935 and 0.9946 to 0.9999 and 0.9994, respectively, which reflects the optimization of the prediction performance. This proves that the PCA+ANN method is practical and can be utilized to accurately predict the general aviation aircraft development cost.

VI. CONCLUSION
In this research, a distributed PCA+ANN based method is developed for estimating the general aviation aircraft development cost. Through comparative analysis with the prediction results of the single MLR, SR, ANN model, the following conclusions are drawn.
(1) The prediction effect of MLR method is affected by data multi-collinearity and the prediction accuracy is poor. At the same time, SR method can solve the problem of multi-collinearity of cost data, but it will inevitably cause a large amount of loss of variable information by simply deleting variables and then modeling, which will affect the prediction accuracy. Compared with the above two methods, the ANN method is not affected by the multi-collinearity of the data and the number of variables, a more satisfactory prediction result is obtained through deep-learning of the data information.
(2) Combined with PCA method, the complexity of variable information is reduced, and the ANN model is optimized. The training time of the PCA+ANN model is shortened from the original 680 epochs to 386 epochs, which greatly improves the model fitting speed. At the same time, compared with the ANN model, the combined model has more accurate prediction performance for the training and testing samples, which is reflected by having smaller MAPE, MAE, RMSE values and higher R value.
As part of future work, more cost data will be collected to explore whether the PCA+ANN method is common and whether it will be limited by data information or sample size. In addition, the SR+ANN model will be established to explore the applicable scope of SR+ANN and PCA+ANN methods. We are committed to building a more accurate and convenient general aviation aircraft cost estimation software to help companies improve their competitiveness while contributing to the promotion of a green society.