A Novel Fault Prediction Method of Wind Turbine Gearbox Based on Pair-Copula Construction and BP Neural Network

,


I. INTRODUCTION
According to the structure of the transmission system, wind turbines (WTs) are mainly divided into two categories: doubly-fed WTs with gearbox, and direct-drive WTs without gearbox [1]. Because of the WTs operate in harsh environments for a long time [2] and are affected by factors such as random wind loads, acceleration and deceleration shocks [3]- [6], the gearbox has become one of the key components with a high failure rate [7]. The maintenance cost of the gearbox is also relatively high, the repair time and cost are at the forefront of all major equipment. An effective way to reduce the cost of breakdown maintenance is to use condition monitoring technology for early detection of faults. Therefore, carrying out research on the condition monitoring and fault early warning of WT gearbox, and rationally adjusting the operation and arranging maintenance according to the healthy The associate editor coordinating the review of this manuscript and approving it for publication was Gongbo Zhou . degeneration trend of the gearbox are great significance for improving the reliability and reducing the maintenance costs.
The gearbox condition monitoring methods mainly include oil analysis, vibration analysis and SCADA data analysis. The essence of oil analysis is to analyze the physical and chemical properties of lubricating oil. The health state of the gearbox is monitored by indicators such as metal abrasive particles, which can directly reflect its mechanical deterioration and has a high accuracy [8]- [10]. However, the cost of metal abrasive particle sensors is high and the real-time performance of oil analysis method is poor. Thus, It is necessary to further investigate low-cost solutions [11]- [13].
Vibration analysis is to collect vibration signals from the sensitive parts of the gearbox's fault features. The characteristic factors are extracted and analyzed by timedomain and frequency-domain signal processing methods. Vibration analysis has high sensitivity and good real-time performance, and it can perform full life cycle condition monitoring and fault diagnosis on the gearbox. Li Z.X. et al.
proposed a periodic potential underdamping stochastic resonance (PPUSR) in the paper and used PPUSR to extract the characteristics of the gearbox vibration signals. The results showed that the proposed method can detect gear wear faults and tooth broken faults [14]. In the research of Li Y.B. et al., the vibration data was filtered by using the Vold-Kalman filter (VKF), and the fault features were extracted by using the refined composite multi-scale fuzzy entropy (RCMFE). Rolling bearing failure was adopted as an example to prove the effectiveness of the VKF-RCMEF method. Experimental results showed that the performance of the proposed method was better than that of RA-RCMFE and VKF-MFE, and it could accurately diagnose inner race fault, outer race fault and ball fault [15]. However, the early fault signal is weaker than when the fault is heavy, so the early fault signal is difficult to detect. Therefore, the above feature extraction method cannot achieve early fault prediction. In response to this problem, Lu L. et al. proposed to use deep belief network (DBN) to extract early fault features in vibration signals. Least squares support vector machine (LSSVM) was used to establish a fault prediction model, and early faults of gearbox were detected [16]. But most of the diagnostic methods based on vibration data are after-the-fact diagnosis. So, it is difficult to provide early warning for more than one day if the gearbox has a potential failure. In addition, additional dedicated sensors must be installed to acquire vibration signals, which will cause significant upfront investment and subsequent maintenance costs.
Among wind farms,the supervisory control and data acquisition (SCADA) system is widely adopted to monitor the operating status of WTs and components. The rich data foundation is provided by SCADA system for the research of fault prediction technology. So, mining fault information contained in SCADA data is gradually becoming a current research hotspot. SCADA data-based condition monitoring methods are mainly divided into three categories: The first category is based on temperature monitoring. The second category is based on power curves. The third category is based on machine learning methods.
In the studies of condition monitoring based on SCADA data, many studies take temperature as the monitoring target [17]. In [18], Gearbox oil temperature, nacelle temperature and rotor rotations were used as inputs to predict output power. In [19], rear bearing temperature, active power output, nacelle temperature, turbine speed were used as inputs. Rear bearing temperature was used as output. A normal behavior model was established based on artificial neural network (ANN). In [20], the bearing temperature was used as the predicted target variable, and Time Delay Neural Network (TDNN) was used to establish the generic models. In reference [21], bearing temperature was also selected as the prediction target variable, abnormal level index (ALI) was defined to quantify the abnormal level of prediction error of each selected model, and a fuzzy synthetic evaluation method was used to integrate the identification results. In [22], A kind of health index from the perspective of temperature-related parameters was developed by separating the statistics concerning the conformity of the predicted values of key temperature parameters within a certain time window from the measured values.
In recent years, there have been many studies based on SCADA data to achieve condition monitoring by comparing power curves. In [23], a method of evaluating the WT performance based on a kernel method was proposed. First, the kernel method was used to estimate the distribution of power data, and then the similarity index was used to evaluate the abnormality of power. In [24], Pandit used the Gaussian Process (GP) to model the power curve, and monitored the abnormal state of the WT by comparing the difference between the normal power curve and the abnormal power curve. In [25], Pandit used GP to model the power curve and realized the detection of absolute yaw error. The experimental results showed that the performance of the proposed method is better than the online power curve model and probabilistic assessment using binning. In [26], Sun Q.L. found that using the rotor speed-power and rotor speed-pitch angle curves can accurately monitor the abnormality of WT. The fault of the pitch system was accurately identified by the distance from the actual operating point to the theoretical curve. Although the functions of fault prediction and diagnosis can be realized through the above methods, fewer variables are used in the methods. A single variable is easily interfered by the external environment, which result in the consequence that a lot of noise is contained. Consequently, the weak fault information will be overwhelmed. The timeliness and accuracy of fault prediction are affected by above problems.
In order to make full use of SCADA data, mine fault information from multiple variables and achieve more accurate fault prediction, a variety of machine learning algorithms have been applied to the condition monitoring of WT. In [27], a WT fault detection method based on expanded linguistic terms and rules using non-singleton fuzzy logic was proposed. In [28], a multivariable power curve model was constructed with a modified Cholesky decomposition Gaussian process (GP) and validated with Supervisory Control and Data Acquisition (SCADA) data. In [29], through irregular space-division and nonlinear space-mapping, stepwise data cleaning procedure was proposed. On this basis, the optimized least squares support vector machine (LSSVM) was selected to model the WT power curve (WTPC). In [30], a new condition monitoring approach was introduced for extracting fault signatures in WT blades by utilizing the data from a real-time SCADA system, and a hybrid fault detection system based on a combination of Generalized Regression Neural Network Ensemble for Single Imputation (GRNN-ESI) algorithm was proposed. In [31], a first attempt to use Dempster-Shafer (D-S) evidence theory for the fault diagnosis of WT on SCADA data was presented. Due to the traditional parameter methods such as artificial neural networks and mixed Gaussian models, there are shortcomings such as poor generalization ability and limited ability to mine complex correlations among variables, which resulting in low modeling accuracy. Deep learning can make up for the above deficiencies and effectively use massive SCADA data. In [32], Wan proposed a deep feature learning (DFL) approach for wind speed forecasting because of its advantages at both multi-layer feature extraction and unsupervised learning. In [33], a framework based on deep neural network (DNN) was developed to monitor the conditions of WT gearbox and identified the impending failures. In [34], Jiang proposed a novel fault detection method based on denoising autoencoder (DAE). In [35], Wang established a normal behavior model of WTs based on deep belief networks. The optimized modeling method can capture the sophisticated nonlinear correlations among different monitoring variables, which is helpful to enhance the prediction performance. However, the principles of deep learning models at this stage are poorly interpretable and good parameter tuning skills are required to ensure the performance of algorithm, and further research on practical engineering applications is necessary. The disadvantage of deep learning is that the data samples of full fault types and full operating conditions are necessary to train a highprecision prediction model. There are two reasons why longterm data is difficult to save. On the one hand, the storage capacity of the SCADA system is limited. On the other hand, a large amount of data needs to be collected during operation. Therefore, the data samples of full operating conditions and full failure types are scarce. The actual data are mostly smallscale data sets close to the failure time, which makes the research of condition monitoring methods based on smallscale data sets have great significance and application value.
The innovations and main contributions of the proposed method are listed below: In this paper, the conditional mutual information and Pair-Copula model are introduced to tackle with WT fault prediction. With the powerful variable filtering capability of conditional mutual information, redundant variables can be removed while retaining useful variables to the greatest extent. This method is used to solve the problem of filtering out the model input variables from multiple variables, thereby modeling accuracy and failure prediction accuracy are improved. The Pair-Copula model can handle multiple variables and mine the correlation between multiple variables, the limitation that conventional Copula model can only handle two dimensional variables is overcome. The Pair-Copula model combines three input variables into one variable, which greatly reduces the complexity of modeling. A complete fault prediction model is established based on the combination of Pair-Copula model and BP neural network. In order to solve the problem that the conventional Pair-Copula model cannot process real-time data which must be required in fault prediction, a kind of improved Pair-Copula model combined with the kernel density estimation is used to calculate the real-time data. Finally, the method in [42] has been modified and applied to the determination of fault alarm threshold. This method can continuously update the threshold with change in operating conditions and has better accuracy and flexibility.
The rest of this paper is organized as follows. In Section 2, the proposed method is introduced, and the basic knowledge of each part in this method is explained in detail. Section 3 describes the modeling process and validates the effectiveness of the proposed method with actual SCADA data, the experimental results are discussed in section 4. At last, the conclusions are described in Section 5.

II. METHODS INTRODUCTION
According to Sklar's theorem, the marginal distributions of the random variable X = [X 1 , X 2 , · · · , X n ] is denoted as F 1 , F 2 , · · · , F n and the corresponding joint distribution function is denoted as F, then there is a Copula probability distribution function C such that the equation (1) holds for any X ∈ R n .
In equation (1), [X 1 , X 2 , · · · , X n ] is the input vector and [x 1 , x 2 , · · · , x n ] is a data point in the vector. n is the number of variables in the vector.

A. PRINCIPLE OF COPULA
Copula functions include the Normal-Copula, t-Copula, and Archimedean-Copulas function cluster. Only the Archimedean-Copulas function cluster can describe asymmetric correlation. The Archimedean-Copulas function cluster includes three functions: Gumbel-Copula, Clayton-Copula, and Frank-Copula. The principle and characteristic of each function are as follows [36]: Gumbel-Copula: In equation (2), U = (u 1 , u 2 , · · · , u n ) is the input vector and n is the number of input variables. θ = [1, +∞], when θ = 1, the random variables are independent of each other. When θ → +∞, the random variables are completely related. The distribution of Gumbel-Copula function is relatively sensitive to the upper tail correlation among variables. Frank-Copula: In equation (3), θ = (−∞, +∞) \ {0}, When θ is a positive number, the random variables are positively correlated. When θ is a negative number, the random variables are negatively correlated. The distribution of Frank-Copula function is relatively average and can well describe the overall correlation among variables.
Clayton-Copula: In equation (4), θ = (0, +∞), If θ → 0, random variables tend to be independent. If θ → +∞, random variables tend to be completely related. Clayton-Copula function has the opposite characteristics of Gumbel-Copula function and is sensitive to the lower tail correlation among variables.

B. PAIR-COPULA CONSTRUCTION
Aiming at the problem that the correlation among WT parameters is complex and difficult to accurately modeled, Pair-Copula is used to model the correlation between multiple parameters.
The Pair-Copula method is first proposed by Aas et al. [37]. The n dimensions Pair-Copula model has n − 1 layers, and the number of layers is denoted as T p (p = 1, 2, · · · , n − 1). Each layer has a root node which is connected to the other nodes, and the multivariate probability distribution is constructed by merging these nodes hierarchically. The structure is shown in Fig. 1. Each node in the figure is a binary Copula function. u i is the distribution function value corresponding to the input variable In equation (5) and (6), j = 2, 3, · · · , n − 1 and t = 1, 2, · · · , n − j. The structure of Pair-Copula contains multiple nodes corresponding to multiple types of binary Copula functions, which makes the flexibility and fitting accuracy of Pair-Copula model higher than conventional Copula functions [37].

C. FITTING ACCURACY EVALUATION
It is necessary to calculate the fitting accuracy in order to quantitatively evaluate the fitting effect of the probability model. In this paper, Euclidean distance (ED) is used as the evaluation index.
the calculation process of ED is shown in equation (7). One of data samples in the n-dimensional input vector X is denoted as (x 1i , x 2i , · · · , x ni ) (i = 1, 2, · · · , m). m is the number of data samples contained in X .
C em is the empirical distribution function, and C is the Copula function obtained by training. The ED is smaller, the distribution of the input vector is better fitted by the Copula function.

D. OVERALL FRAMEWORK
The overall framework of this paper is shown in Figure 2. During the training model phase, the Pair-Copula model is used to fit the complex correlation among the three input variables and construct a multivariate joint distribution based on historical SCADA data of normal state. The output value of the Pair-Copula model is called ''state parameter''. The gearbox bearing temperature at t − 1 moment T B (t − 1) and state parameter are taken as the inputs of BP neural network and the gearbox bearing temperature T B (t) is used as the prediction target. The BP neural network model is trained with historical SCADA data in normal state.
During the real-time monitoring phase, the Pair-Copula model and the BP model trained in the training model phase are used to predict the gearbox bearing temperature. The realtime SCADA data is used as input. The residual between predicted value and actual value is calculated. Whether the WT has potential faults can be judged by the residual exceeding threshold. Specific steps are as follows: (1) Variables selection: wind speed (WS), main shaft rotation speed (RS), active power (AP) and gearbox bearing temperature at t − 1 moment T B (t − 1) are selected as input variables, and gearbox bearing temperature T B (t) is selected as the predicted target.
(2) The Copula function's types of each node in the Pair-Copula structure are determined, and the optimal model parameters are calculated. The Pair-Copula model of the WT in normal state is trained, and the state parameter is calculated.
(3) BP neural network parameters are determined. State parameter and T B (t − 1) are taken as the inputs of BP neural network and the gearbox bearing temperature T B (t) is taken as the prediction target variable, the BP neural network is trained to produce the prediction model of the WT in normal state.
(4) The real-time state parameter can be calculated by inputting the real-time SCADA data into the Pair-Copula model trained in step (2). The predicted value of gearbox bearing temperature can be calculated by inputting the realtime T B (t − 1) and the real-time state parameter into the BP model trained in step (3).
(5) Calculating the residual of predicted value and actual value.   (6) The threshold is calculated from the residual in normal state, and is used to determine whether the gearbox has a potential fault.

III. CASE STUDY
Real SCADA data of a doubly-fed WT in Fujian, China is used to verify the actual effect of the proposed method. The rated power of the WT is 1.5 MW, the cut-in wind speed is 3 m/s, and the cut-out wind speed is 25 m/s. The SCADA system records operational data per 10 minutes. At 10:20 on July 13, the WT was shut down due to a gearbox bearing failure. The SCADA data was divided into two parts. The data of the first part was collected before the shutdown caused by the fault, there are 14,000 data in total, this part is called the fault data set (FDS). The data of the other part was collected after the WT was repaired, a total of 14,000 data, this part is called the health data set (HDS). When the two data sets are used for this research, the samples close to the cut-in wind speed and the samples exceeding the rated wind speed are removed. The number of remaining data samples is 9000. According to the order, 3500 samples are selected as training data samples, 1000 samples as testing data, and 4500 samples as verifying data. The two data sets are allocated in the same way, as shown in Table 1.

A. VARIABLES SELECTION
There are 19 parameters in the SCADA system, as shown in Table 2. The gearbox bearing failure is the actual example in this paper, furthermore, gearbox bearing temperature is the most relevant variable that can directly reflect the health state of the gearbox bearing. Hence, gearbox bearing temperature is selected as the dominant variable in the variable screening process. This is also why gearbox bearing temperature is selected as the predicted target variable in Fig.2. Then the conditional mutual information (CMI) is used to screen out the effective auxiliary variables from the other 18 parameters.
Information entropy is an index to measure the uncertainty of random variables. The information entropy H (M ) of random variable M is calculated by follow formula [38]: In the equation, f (m) is the probability distribution function of M . The information entropy of two-dimensional random variable (M , N ) is called joint entropy H (M , N ), and its calculation formula is as follows: f (m, n) is the joint distribution of (M , N ). If M is known, the information entropy of N is called conditional entropy H (N |M ), and the calculation formula is equation (10): In the equation, f (n |m ) is a conditional probability distribution of N . The mutual information I (M , N ) reflects the amount of information in random variable M containing random variable N . The calculation formula is equation (11): The larger the mutual information, the more N information is contained in M , and the stronger the correlation between M and N .
If the auxiliary variable set is a multidimensional random variable M = {M 1 , M 2 , · · · , M s }, the dominant variable is a one-dimensional random variable N . Then the conditional mutual information between the auxiliary variable M s and dominant variable N is I (N ; M s |M 1 , M 2 , · · · , M s−1 ). The calculation formula is as follows: In this paper, the dominant variable N is determined as gearbox bearing temperature. The auxiliary variables M 1 , · · · , M s correspond to the 18 variables in Table 2 except the dominant variable. According to formula (12), the conditional mutual information between 18 parameters and gearbox bearing temperature can be calculated, s = 1, 2, · · · , 18. The variables with conditional mutual information I > 0.4 are selected as the effective auxiliary variables. A total of 3 variables are screened, namely main shaft rotation speed, wind speed and active power. In addition, due to the specific heat capacity of the solid, the change of temperature will be affected by the temperature at the previous moment. Therefore, when gearbox bearing temperature T B (t) is predicted, the influence of the temperature at t − 1 moment T B (t − 1) should be considered.
Summarizing the above, a total of 4 variables (wind speed, active power, main shaft rotation speed, gearbox bearing temperature) are chosen to establish the normal behavior model for WT's gearbox bearing. Wind speed, active power, main shaft rotation speed, gearbox bearing temperature at t − 1 moment T B (t − 1) are used as input variables, and gearbox bearing temperature T B (t) is used as the predicted target variable, as shown in Fig.2.

B. ESTABLISHMENT OF PAIR-COPULA MODEL
The modeling process of Pair-Copula model is described in this section. The premise of Pair-Copula model is that the distribution and probability of input variables are known. However, the distribution of the entire samples cannot be reflected by a small number of data samples. If a large number of data samples are accumulated, the timeliness of fault prediction will be seriously reduced. To solve the above problems, the kernel density estimation (KDE) is proposed to estimate the probability of the real-time input data. Suppose x 1 , x 2 , · · · , x N is a set of normalized wind speed data with N samples. f (x) is the original probability density function of wind speed, andf (x) is KDE of f (x), expressed as:f where, h is the window width; K (u) is the kernel function; x i is the i-th wind speed data sample. The Gaussian kernel function is chosen in this paper, that is.
The estimatef (x) is expressed by equation (15): The estimation error is calculated by the root mean square error. The window width h plays a local smoothing role tô f (x). If h is too small, the estimation bias can be improved but the randomness will be increased, which resulting in an irregular shape off (x). If h is too large,f (x) will be too smooth and the detailed features of f (x) cannot be displayed sufficiently [39]. In this paper, the iterative method is used to determine the window width of KDE.
As shown in Fig.3, the probability densities of the three input variables (wind speed, active power, and main shaft rotation speed) are estimated by KDE, and then the three probability densities are input to the Pair-Copula model to calculate the intermediate variable ''state parameter'' which is a joint distribution of the three input variables and has no practical physical meaning.
If a key component of WT has potential fault, the relationship between the three input variables changes from the normal state, which is reflected in the distance that the threedimensional joint distribution deviates from the normal state. If the distance is large, it means that the health state of WT  is worse. If the distance is small, it means that the health state is good. Therefore, although state parameter has no practical physical meaning, it can represent the health state of WT. Fig.4 is the distribution of state parameter corresponding to HDS and FDS. It can be seen from the figure that when there is potential fault in the WT, the distribution of state parameter will change significantly. It indicates whether the WT is faulty can also be judged by comparing the distribution. However, this method needs to accumulate many data samples for each comparison. The accumulation of data samples takes a long time. The fault may have evolved from a minor fault to a serious fault within this period. Therefore, the distributed comparison method seriously reduces the timeliness of fault prediction and cannot meet the requirements of real-time fault prediction. Aiming at this problem, a combined model of Pair-Copula and BP neural network is proposed, which uses the residual change of predicted value and actual value to realize the real-time fault prediction function. The Pair-Copula model is used to fit the relationship among the three input variables and the three variables are merged into one variable, which is equivalent to that the number of input variables is reduced. The relationship between the inputs and output of the prediction model is simplified. Based on the above analysis, a learning algorithm with strong learning ability and simple structure can meet the modeling need. BP neural network is the basis of all neural networks and has good learning performance. Therefore, BP neural network is selected as the complementary model.
In the rest of this section, the modeling process of Pair-Copula model is depicted in detail. First, wind speed, main shaft rotation speed, and active power are used as inputs to the Pair-Copula model, and the correlation coefficients between two variables are calculated, as shown in Table 3. Main shaft rotation speed has a strong correlation with wind speed and active power, so main shaft rotation speed is selected as the input variable corresponding to the root node in Fig. 3. The three input variables are combined into one variable (state parameter) by the Pair-Copula model.
To determine the function types of the node Copula functions c 1,2 , c 1,3 , c 2,3|1 and the corresponding optimal parameters is the key process for establishing the Pair-Copula model.

1) CORRELATION BETWEEN MAIN SHAFT ROTATION SPEED AND ACTIVE POWER
The node c 1,2 is the connection point between u 1 and u 2 , u 1 corresponds to main shaft rotation speed, and u 2 corresponds to active power. The frequency histogram of main shaft rotation speed and active power is shown in Fig. 5. It can be seen that the correlation is asymmetric. According to the characteristics of the three types of Copula functions in section 2.1, only Gumbel-Copula, Frank-Copula and Clayton-Copula can be used to fit the asymmetric correlation. The correlation between main shaft rotation speed and active power is fitted by the three functions, respectively. Table 4 shows the optimization results of the parameters which are optimized by maximum likelihood estimation.  Before comparing various Copula functions, the definition of empirical Copula function must be introduced. Experience Copula's definition: Let (x i , y i ) (i = 1, 2, · · · , n) be a data sample taken from the two-dimensional vector (X , Y ), and the empirical distribution functions of X and Y are F (x) and G (y), respectively. The empirical Copula of u and v is calculated by the following formula [40].
where n is the number of data samples in the input vector; C e ( * ) is empirical Copula function; C ( * ) is the evaluated Copula function. The value of d is smaller, the distance between fitted distribution and empirical Copula is smaller, the fitting effect is better.
Analysis of Fig. 6 shows that the difference among each figure is small, indicating that the fitting results of the above three functions are very close to the experience Copula. Therefore, it is necessary to further evaluate the fitting effect VOLUME 8, 2020  where d G is the smallest, it can be determined that Gumbel-Copula is the most suitable function to fit the correlation between main shaft rotation speed and active power. It can be seen from Fig. 5 that the distribution of main shaft rotation speed and active power is denser at the upper tail, and the Gumbel -Copula function is characterized by a better description of the upper tail characteristic. This conclusion is consistent with the analysis result by the SED values.

2) CORRELATION BETWEEN MAIN SHAFT ROTATION SPEED AND WIND SPEED
The Copula function selection process for the node c 1,3 is similar to part (1). It can be seen from the frequency histogram of Fig. 7 that the correlation between main shaft rotation speed and wind speed at the lower tail is strong, and the correlation is also asymmetric. The fit results of the The d C is the smallest, indicating that the Clayton-Copula function has the best fitting effect. According to the characteristic of Clayton-Copula, the lower tail correlation can be described better by the function. The judgment result of the SED is consistent with the characteristic of Fig. 7.
The Copula function of the second layer node c 23|1 is determined in the same way as in part (1) and part (2). The frequency histogram of the first layer nodes c 1,2 and c 1,3 is shown in Fig. 9. The SED values determine that the optimal function of the node c 23|1 is Frank-Copula.

C. MODEL PERFORMANCE EVALUATION
In this paper, the combination of Pair-Copula model and BP neural network is used as a gearbox fault prediction model for WT. In order to prove the effectiveness of the proposed method, four methods are adopted to establish four models, and their respective performances are compared. The modeling methods and model types are as follows: (1) BP model. Wind speed, main shaft rotation speed, active power and T B (t −1) are taken as inputs, and gearbox bearing temperature is used as the target variable to train BP neural network and obtain corresponding prediction model. (2) SVM model. Wind speed, main shaft rotation speed, active power and T B (t − 1) are taken as inputs, gearbox bearing temperature is taken as the predicted target variable. The SVM is trained to obtain the corresponding prediction model.  In order to evaluate the accuracy of each model, three indexes are introduced, namely RMSE, MAE, and R2.The respective definitions are shown in following formulas [41].
where o i is the i-th actual value,ô i is the i-th predicted value, andō is the mean of actual values. If the values of RMSE and MAE are smaller, the error between predicted value and actual value is smaller. If the value of R2 is larger, the overall fitting effect is better. The prediction accuracies of the four models are shown in Table 5. It can be seen from the table, the prediction accuracy of BP is similar to that of SVM, and the accuracy of C-BP is similar to that of C-SVM. Moreover, the prediction accuracies of C-BP and C-SVM are significantly higher than the other two models, which indicates that the addition of Pair-Copula model can improve the prediction accuracy. The main reasons are summarized in two aspects: First, the multi-parameter joint distribution density function model is constructed and the complex correlation between multiple variables is captured by Pair-Copula model. The shortcomings of conventional learning algorithms in capturing complex correlations are compensated, and the prediction accuracy of prediction model is improved. Second, the dimension of the input variables is reduced by Pair-Copula model, which greatly reduces the complexity of the supplementary model. The prediction accuracy is indirectly improved. According to the results in Table 5, the prediction accuracy of C-BP is slightly higher than that of C-SVM, so C-BP is selected as the final prediction model.

A. THRESHOLD CALCULATION
The MD-Weibull method was proposed in [42] and used for the fault detection in notebook computer. In this paper, the method is modified to make it suitable for the need of this study. The modified MD-Weibull method is used to calculate the threshold.
MD is an unitless distance measurement method that can be used to calculate the degree which a single data sample deviates from the mean of sample set. In this paper, the residual data is used as the measurement object, and the MD value of each residual data sample is calculated. The residual data sequence is denoted as E, and the i-th data sample in the sequence is denoted as e i , where i = 1, 2, · · · , n. n is the number of data samples in the sequence. Before the MD value is calculated, the data needs to be normalized by using formula (21).
whereĒ is the mean of E and S i is the standard deviation. The calculation method is shown in the following formula: Since the residual data is a 1-dimensional vector, the calculation formula of MD can be simplified to following formula: The MD values of all elements in E are calculated by formula (23), and the MD data set is obtained. It can be seen from the histogram in Fig.10 that the distribution of the MD data set is basically in line with the Weibull distribution, so the 2-parameters Weibull distribution is used to fit the MD data set. The 2-parameters Weibull distribution can be expressed by formula (24). where β is the shape parameter and determines the graph form. η is the scale parameter, which determines the spread of distribution. The parameters ( β and η ) of Weibull distribution can be estimated by using maximum likelihood estimation.
The Weibull distribution's CDF is used to calculate the fault alarm threshold. The CDF is defined as Fig.11 is the CDF curve. V % in the figure is defined as the normal interval. That is, the residual value within this range is normal, and the residual value outside the range is abnormal. Fig.11 is taken as an example, assuming V % is equal to 70%, the corresponding MD value D V can be obtained by the following formula: In formula (26), v a is the corresponding fault alarm threshold when the normal interval is set to 70%. The normal interval V % can be set to any value within 0 − 100% according to actual situation. The three thresholds in Fig.12 correspond to three normal intervals of 70%, 80%, and 90% respectively. Fig.12(a) is the original residual data which is calculated by the C-BP model with HDS. Fig.12(b) is obtained after the original residual data is processed by sliding window method, where the window width w = 1000, the number of original residual data n o = 4500, The number of the data after processing n p = 3500, n p = n o − w. The definition of sliding window method is as follow: In formula (27), i = w, w + 1, · · · , n, r = i − w, e w is the residual value after sliding window processing. e is the original residual value. w is the window width. n is the number of original residual data. In engineering applications, the system can calculate the normal interval V % based on the peak value of the residual fluctuation when the WT is in a normal state, and then the alarm threshold is automatically calculated. This function can be achieved through simple programming without theoretical difficulty. Since different faults correspond to different monitoring index variables, the corresponding residuals and thresholds are different. In other words, the threshold and the type of failure are related. In this paper, the normal interval and threshold can be determined by analyzing Fig.12 (b). For the C-BP model, when the normal interval is set to 70%, it can ensure that the residual value of HDS is below the threshold. That is, when the WT is in healthy state, the false alarm will not occur. But the residual fluctuation of the BP model is relatively large. So, a higher threshold needs to be determined. In order to make an unified comparison of the two models, the normal interval is set to 95%, and the corresponding threshold is 0.0187, as shown in Fig.14 and Fig.15.

B. RESULT ANALYSIS
The actual value and predicted value of BP model based on HDS are compared in Fig.13(a). In order to clearly show the fluctuation of curves, 1-500 data points are selected for comparison. It can be seen from Fig.13(a) that although the overall trend of the predicted value and the actual value is consistent, the error between each data point is large and the prediction accuracy is also low. The predicted values and actual values of C-BP model are compared in Fig. 13 (b). By comparing Fig.13 (a) and (b), it can be found that the prediction accuracy of C-BP model is significantly higher than that of BP model, and the prediction error is significantly reduced. This result indicates that the accuracy of prediction model can be effectively improved by Pair-Copula model. Fig. 14 is the comparison of BP and C-BP prediction residuals based on HDS. The original residual data is processed by the sliding window method. As shown in Fig. 14, there is a large fluctuation in the residual curve of BP model, which seriously affects the result of fault prediction. In comparison, the variation of C-BP is small, and the overall curve is relatively stable. Since the current WT is in normal state, the residual curve should theoretically be stable overall and there is no upward trend. So, the residual curve of C-BP is more authentic and representative.
The residuals of BP and C-BP based on FDS are compared in Fig.15 and the threshold is same as Fig.14. The residual curve of C-BP shows an increasing trend, this is due to the fact that the deviation between predicted value and actual value of the gearbox bearing temperature increases with time. The upward trend of the curve indicates that the gearbox bearing temperature gradually deviates from the normal state, which makes the degradation tendency of the gearbox bearing is intuitively reflected. The residual curve corresponding to C-BP model exceeds the threshold at the 843th data point and then rises continuously, which can be used as an early alarm point for the WT. In contrast, the BP model's prediction residual curve fluctuates greatly, which cause the fault can not be predicted effectively. Although the curve exceeds the threshold at the 549th point, the hold time is very short and then falls below the threshold. So, this point cannot be used as a valid alarm point. At the position of the 2526th data point, the curve again exceeds the threshold and remains for a longer period of time and then falls below the threshold again. So, the 2,526th data point can barely be used as an alarm point  for the BP model. In summary, although both BP model and C-BP model can implement fault prediction, the performance of C-BP model is significantly better than that of BP model in terms of prediction accuracy and timeliness.
The SCADA data sampling interval is 10 minutes. According to the above paragraphs, the alarm point of the C-BP model is the 843th data point. From this point to the WT's shutdown, there are 2657 sample points, which is converted into 443 hours, about 18.5 days. The alarm point of the BP model is the 2,526th sample point, there are 974 sample points between the point and the shutdown, which is converted into 162 hours, about 6.8 days. The analysis result in this paragraph shows that the prediction effect of C-BP model is significantly better than that of BP model, which once again proves the effectiveness of the proposed method.

V. CONCLUSION
Based on the Pair-Copula model, a novel WT gearbox fault prediction method is proposed in this paper. The superiority and effectiveness of the proposed method are verified with actual SCADA data. The following is a summary of the paper work: (1) Conditional mutual information is introduced for variable screening, and 3 variables are selected from 18 variables. Through comparative analysis, it can be known that all 3 variables are effective auxiliary variables. Generator rotation speed is accurately eliminated as a redundant variable corresponding to main shaft rotation speed. The above results show that conditional mutual information can retain useful variables and accurately eliminate redundant variables.
(2) The good performance of Pair-Copula model in mining correlation among multiple variables and establishing multidimensional joint distribution is shown. Pair-Copula model is combined with SVM and BP neural network to form two combined models. The experimental results show that the prediction accuracies of the two combined models are significantly higher than the original models, and the key functions of Pair-Copula are fully reflected.
(3) The conventional Pair-Copula model cannot process real-time data, but fault prediction must require the function of real-time calculation. In order to solve this problem, kernel density estimation is adopted to modify the Pair-Copula model, so that it has the function of real-time data calculation.
(4) The proposed method is more suitable for small-scale data samples. Experimental results show that the proposed method has higher prediction accuracy and can identify potential faults earlier than conventional learning algorithms. Finding potential faults early and taking preventive measures can ensure the safe operation of WTs and reduce maintenance costs.
Although the gearbox bearing fault is taken as the case in this paper, the proposed method may also be effective in predicting other faults. It is the follow-up work to further explore the application of the proposed method in other faults. In addition, the research results of reference [43]- [48] show that air density has a significant effect on the output power of WTs. So, the SCADA data processing method which takes into account the influence of air density is also one of the future research directions. He is currently a Lecturer with North China Electric Power University. His current research interests include wind turbine condition monitoring, fault prediction, and fault diagnosis.