Research on Fault Diagnosis of Wind Turbine Based on SCADA Data

Effective early warning of wind turbine failures is of great significance to reduce the operation and maintenance costs of wind farms and improve power generation efficiency. At present, most wind farms are installed with supervisory control and data acquisition (SCADA) system, and SCADA data contains a lot of hidden information, which can be used for fault early warning. This paper uses the generator temperature and gearbox oil temperature in the SCADA data as the entry point for fault warning. Firstly, the eXtreme gradient boosting (XGBoost) algorithm is used to establish the normal temperature regression prediction model of wind turbine components. Then, the residual between the predicted value and the actual value is calculated, and the change trend of the residual is monitored by the principle of exponentially weighted moving-average (EWMA) control chart. Finally, by setting an appropriate threshold, the variation trend of the residual is judged to determine the occurrence and development of the fault. This paper uses two fault detection methods: fixed threshold and dynamic threshold based on adaptive algorithm, and compares the advantages and disadvantages of the two methods. Based on the SCADA data of a wind farm in Inner Mongolia (China), this paper designs the fault early warning test of the wind turbine generator and gearbox. The experimental results show that for the generator, the fixed fault threshold method can give the fault alarm 3 hours in advance, while the dynamic fault threshold determination method can give fault alarm 4.25 hours in advance. For gearbox, the fixed fault threshold method can give the fault alarm 2 hours in advance, while the dynamic threshold fault diagnosis method can send out the fault alarm 2.75 hours in advance.


I. INTRODUCTION
In recent years, with the continuous deterioration of global ecological environment and the gradual depletion of fossil fuels, countries all over the world have increased the research on renewable energy [1], [2]. As a clean and pollution-free renewable energy, wind energy has the advantages of wide distribution and huge reserves [3]. Therefore, the use of wind power generation has gradually become a new way to replace the traditional power generation [4], [5]. The global wind power industry has entered a period of rapid development, and the cumulative installed capacity of global wind power is increasing year by year [6]. The traditional wind power plant maintenance strategy relies heavily on regular maintenance and after-maintenance, and the deployment of spare parts has a long cycle, which leads to the high cost of failure The associate editor coordinating the review of this manuscript and approving it for publication was Fanbiao Li . maintenance and has a huge impact on the operation and maintenance economy of wind power plants [7]. Therefore, how to make early warning before the occurrence of wind turbine failure is of great significance to reduce the operation and maintenance cost of wind farms and the long-term development of wind energy industry [8], [9].
At present, many achievements have been made in the research on fault prediction and diagnosis of wind turbine [10], [11]. The research on fault diagnosis of wind turbine based on condition monitoring system (CMS) is relatively mature [12]. [13] analyzes the vibration signals in time domain and detects the fault of components. However, the CMS system relies on high-performance sensors, acquisition cards and other hardware equipment, so it needs to add corresponding interfaces on the wind turbine, which increases the cost and brings inconvenience to the actual operation. The fault diagnosis method based on wavelet theory and neural network is mature and widely used in wind turbine VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ gearbox and blade fault diagnosis [14], [15]. However, this method needs additional vibration sensors and is only suitable for fault diagnosis of limited parts. The fault diagnosis method based on analytical model needs to fully understand the operation mechanism of the system, and can achieve the ideal diagnosis effect under the construction of appropriate mathematical model [16]. Wind turbine is a typical nonlinear dynamic system with complex structure. It is difficult to establish an accurate system mechanism model, and the model can't be applied to other industrial systems. How to use more convenient data to study the fault of wind turbine has become a new research hotspot. In recent years, more and more wind farms use supervisory control and data acquisition (SCADA) system to record the operation status of wind turbines. SCADA data is widely used in wind turbine condition monitoring [17]. Hong Wang et al. proposed a deep confidence network based on SCADA data for feature learning and classification to realize sensor fault detection [18]. In addition to learning spatial correlation information between several different variables, this method can also capture the temporal characteristics of each variable. However, this method is only verified on the general benchmark model, which requires a large amount of real SCADA running data for further verification. [19] fits a support vector machines (SVM) regression to model gearbox oil temperature using selected variables in SCADA data as predictors, and uses the residual between the predicted value and the real value to predict the gearbox failure in advance. However, this method has two shortcomings. Firstly, SVM is a non-linear method in essence. When processing a large number of sample data, SVM will become obviously weak and sensitive to missing values in the data. However, the wind farm SCADA data accumulates quickly. In the long run, the regression prediction model of SVM algorithm is not suitable for processing a large number of SCADA data in the future. On the other hand, the conclusion given in this paper is that the wind turbine fault can be warned ten days in advance, which is obviously lack of practicability in practical application. If the wind turbine is stopped for maintenance ten days in advance due to the possibility of early failure, the wind turbine will stop for a long time and a large amount of power generation time will be lost. Therefore, when using SCADA data for fault warning, attention should be paid to the large amount of SCADA data and the setting of fault warning thresholds.
The development of machine learning and multi-sensor data fusion technology provides new ideas for wind turbine status monitoring and fault warning [20], [21]. Aiming at the problem of early warning of key parts of wind turbine, this paper proposes an early warning method for key parts of wind turbine based on SCADA data. This method uses XGBoost to establish the normal temperature regression prediction model of wind turbine components, and uses the residual change trend between the predicted value and the actual value as an early warning indicator. When setting the alarm threshold, this paper adopts the EWMA principle to control and monitor the change trend of the residual error, and divides the normal, early warning, and alarm intervals by setting the control line to realize the gradual and multilevel early warning of the key parts of the wind turbine. However, since the complex operation conditions of wind turbine, the normal value deviation of SCADA parameters of wind turbine may occur in acceptable range under the influence of some uncontrollable objective factors. If the fault alarm system adopts fixed threshold setting, the residual extreme value may be higher than the threshold, which may lead to false alarm. In order to solve the problem of false alarm caused by fixed fault threshold, this paper designs a dynamic threshold setting method based on adaptive principle. In order to verify the effectiveness of the algorithm, this paper uses the SCADA of a wind farm in Inner Mongolia (China) in 2019 to conduct early warning experiments on wind turbine generator and gearbox failures. The results show that the fault early warning method based on fixed threshold proposed in this paper can detect the early fault characteristics of wind turbine generator 7.25 hours in advance, send out the generator fault alarm 3 hours in advance, find the early fault characteristics of gearbox 3.75 hours in advance, and send out the fault alarm of gearbox 1.75 hours in advance. While use the dynamic fault threshold method determined the fault, not only can adapt to the sudden residual extreme value and avoid false alarm, but also, it can send out generator fault alarm 4.25 hours in advance and gearbox fault alarm 2.75 hours in advance, which is 1.25 hours and 0.75 hours earlier than the fixed fault threshold.
The rest of this paper is organized as follows. Section II introduces the overall framework and main algorithm of fault early warning method. Fault warning of generator and gearbox is described in Section III and section IV respectively. Section V discusses the advantages of XGBoost compared with SVM regression prediction model. Finally, section VI is the conclusion and prospect of the whole paper.

II. ANALYSIS OF FAULT EARLY WARNING ALGORITHM A. ALGORITHM OVERVIEW
The generator and gearbox of the wind turbine are the most frequent fault locations [17], [22]. In order to ensure the normal operation of the generator, the wind turbine generator set will set the cooling system and lubrication system to prevent the generator from overheating. The cooling system mainly uses the high-power cooling fan to draw out the heat generated by the generator operation to the outside of the engine room to achieve the cooling effect. In the whole process of heat dissipation, the temperature of the inner winding of the generator will be high when occurs some failure in generator such as the power of the fan motor decreases, In theory, when the ambient temperature and wind speed and power are in the same condition, if the temperature of a certain wind turbine generator is higher or the temperature rises faster than other turbines, it is necessary to stop and check if there is a fault in the generator. On the other hand, through the lubrication system, the bearing wear of the generator is minimized to ensure the stable operation of the generator. Generally, when the bearing is damaged, it is often accompanied by bearing high temperature. The hidden danger of gearbox can be detected by the operator on duty by observing the temperature and temperature rise of the gearbox or comparing the temperature of the same position with other wind turbines. Usually, the wind turbine is cooled by radiator and fan when the wind turbine runs at high speed with strong wind. The oil temperature of gearbox is usually controlled at 60 • C. However, if it is observed that the gearbox oil temperature reaches 75 • C or even close to 80 • C, It is necessary to check whether the gearbox has radiator blockage, temperature control valve failure, fan blade damage, bearing wear and other problems. The generator and gearbox have complex internal structures, and the failure principles are diverse and coupled. However, most of the failures will eventually lead to abnormal temperature rise of the generator and gearbox oil [16], [23].
Therefore, the generator temperature and gearbox oil temperature in SCADA data can be used as the entry point of fault warning. This paper designs a fault warning algorithm based on temperature prediction of key components of wind turbines. The specific fault warning algorithm diagram is shown in Fig.1. The algorithm flow is as follows: 1) Get the modeling data set. The SCADA historical data of a normal operating wind turbine is selected to establish the normal temperature prediction model of the key parts of the wind turbine. 2) Select the characteristic quantity. Pearson correlation coefficient (PCC) is used to determine the monitoring items related to the temperature changes of wind turbine components. 3) Establish a normal temperature regression prediction model based on XGBoost algorithm. According to the temperature related characteristic parameters selected in step (2), a normal temperature regression prediction model of wind turbine components based on XGBoost algorithm is established. 4) The established model is used to predict the temperature of the key parts of the wind turbine running in real time. 5) Fault diagnosis. By setting the threshold to judge the change trend of the residual error, so as to judge the occurrence and development of the fault.

B. SELECT THE CHARACTERISTIC QUANTITY
The first work of constructing the temperature regression prediction model is to determine the input characteristic quantity [24]. Since the working state of the wind turbine is easily affected by weather conditions such as wind speed and wind direction, the temperature changes of the internal components of the wind turbine have the characteristics of randomness and volatility. In this paper, the PCC of internal component temperature data and other monitoring project data of wind turbine is calculated to determine the relevant monitoring data of the component temperature change. This paper calculates the correlation coefficient by uses the SCADA data of wind turbine under normal operation condition as the experimental data set. The temperature data of a certain part of wind turbine in the experimental data set is taken as X , and the other observation characteristic data series are respectively taken as Y . The PCC R i of X and Y i can be obtained by substituting the temperature series X of a certain part of wind turbine and other observed characteristic series After n times of calculation, n correlation coefficient values can be obtained. According to Eq.1, the value range of correlation coefficient is in the interval [0,1]. The greater the absolute value of correlation coefficient of any two vectors, the stronger the correlation between the two vectors. The relationship between the correlation value and correlation strength adopted in this paper is shown in Table 1.
In this paper, the monitoring items with correlation coefficient greater than 0.6 are selected as part of the input characteristics of the regression prediction model of wind turbine component temperature. In addition, due to the time continuity of the temperature signal, this paper uses the idea of time series prediction to add the component temperature value of the previous period to the input vector of XGBoost temperature regression prediction model. Finally, the input vector of the temperature regression prediction model consists of two parts: the observed characteristic parameters of SCADA system related to the temperature of the component and the average temperature value of the component within 15 minutes.

C. REGRESSION PREDICTION MODEL BASED ON XGBoost
XGBoost is a special gradient boosting decision tree (GBDT) algorithm, which is an improvement of the basic GBDT algorithm. GBDT is a machine learning algorithm composed of multiple classification and regression tree (CART) iterations in accordance with the gradient lift method and integrated learning ideas [25].
Like GBDT, XGBoost is composed of several CART trees. The negative gradient (the first derivative) of the loss function of the first mock exam model is needed in the GBDT training process to fit the negative gradient value when constructing the next model. In XGBoost, Taylor expansion is used to expand the loss function into binomial function (second derivative) to fit the tree model faster and better. The tree model of XGBoost can be represented by Eq.2.
whereŷ i is the predicted value; x i is the i-th sample input; K is the total number of trees; F represents the function space of the decision tree (all CART trees); f k is a function in the function space F. In order to better learn the above model, it is necessary to minimize the objective function. The objective function of XGBoost is shown in Eq.3: The objective function of XGBoost consists of two parts. The first part is the loss function l, which is used to measure the difference between the predicted value and the real value. The second part is the penalty term of model complexity, which used to prevent over fitting of model output. The expansion of is shown in Eq.4: where γ is the regularization parameter of the number of leaf nodes, which is mainly used to inhibit the further splitting of nodes. λ represents the regularization parameter of leaf node weight to prevent leaf node weight from being too large. T is the number of leaf nodes; ω is the score of leaf node.
In the process of constructing CART decision tree, XGBoost algorithm solves the problem of bifurcation feature selection through greedy thought, and solves the problem of how to get the predicted score by finding the maximum value of objective function. When selecting bifurcation features, XGBoost uses greedy strategy to enumerate the objective function values and selects the feature with the minimum objective function value at the current time as the bifurcation feature. When calculating the prediction score of each leaf node, XGBoost calculates the minimum value of the objective function, and the maximum value point is the predicted score of the leaf node.
Since the internal use of XGBoost algorithm is gradient promotion strategy, in the construction of classification regression tree, not all the trees are obtained at once, but a new tree is added each time, and the previous test results are constantly patched while adding new trees. Assuming that the predicted value of the model after generating t-th trees isŷ (t) i , the derivation process of the construction process of the XGBoost model is shown in Eq.5.
The objective function of each layer is shown in Eq.6: The purpose of each layer of model construction is to find a f t to minimize the objective function. The Taylor expansion of the objective function at f t = 0 is approximately shown in Eq.7: where ) is the second derivative. By deleting the constant term in the formula, the objective function formula of step t is as shown in Eq.8: By introducing (f t ) into Eq.8, it is concluded that: where G j = i∈I j g i , H j = i∈I j h i and I j = {i|q(x i ) = j} are samples in the j-th leaf node sample set. By setting the derivative of the objective function to zero, the point with the minimum derivative value of the function is the predicted score of the leaf node: The minimum value of the objective function is obtained as shown in Eq.11.
The core of wind turbine fault warning algorithm is to establish the normal temperature regression prediction model of wind turbine components by using the historical SCADA data of normal components. The construction process of regression prediction model of wind turbine component temperature is shown in Fig.2. The specific construction process is as follows: 1) The model input parameters are extracted from the original SCADA monitoring data to construct the input data set. Selecting the relevant parameters and the temperature value of the previous period of time according to the PCC, and then use the historical data of these parameters to build the data set.
2) Divide the data set. According to the ratio of 7:2:1, the sorted data sets are divided into training, verification and test data sets. 3) Initialize XGBoost model. Set the model parameters, including the maximum depth of constructing decision tree, the learning rate of the model, the total number of training times, the number of threads used, and the method of specifying learning objectives and learning tasks. 4) Training model with training data set. In training the model, a CART decision tree is constructed firstly, and then the bifurcation characteristics are determined with the help of greedy algorithm to minimize the loss function, and the predicted score of leaf nodes is calculated to complete the construction of the second tree. By cycling the above steps, K trees with K classified feature nodes are finally constructed. 5) Adjust the parameter model by the training set. Constantly adjust the parameters for multiple validation set prediction experiments, and select a set of parameters with the highest accuracy of model prediction results as the final parameters of the model. 6) Verifying the accuracy of the prediction model by the test data set.

D. FAULT DIAGNOSIS
By inputting the SCAD data of the running wind turbine into the above prediction model, the predicted temperature values of the wind turbine components under normal operation can be obtained. The residual difference between the predicted temperature value and the measured temperature value in SCADA data represents the degree to which the current temperature state of the component deviates from the normal state. Therefore, by setting appropriate threshold, the occurrence and development of fault can be judged according to the trend of residual error. Based on Exponentially weighted moving-average (EWMA) control chart principle and 3sigma theory, this paper designe two fault threshold setting methods.

1) RESIDUAL TREND CHART BASED ON EWMA PRINCIPLE
EWMA is often used in statistical data processing, which fully considers the information of all previous observations in the form of setting weighting coefficient to reflect the recent change trend of target quantity [26], [27]. In this paper, the control chart based on EWMA principle is used to monitor the change trend of residual value, and the normal, early warning and alarm intervals are divided by setting partition. The expression of EWMA control point value is shown in Eq.12: where Re t is the residual at time t. The coefficient β represents the weight coefficient of EWMA control chart to historical data, β ∈ (0, 1], set β = 0.9. v 0 is the mean value of the first four sampling point. In addition, since model prediction is always unavoidable with errors, processing residual by VOLUME 8, 2020 EWMA not only reduces the fluctuation range of residual values, but also effectively eliminates the number of false alarm points, making the alarm algorithm more stable and accurate.

2) THE SETTING OF FIXED THRESHOLD BASED ON RESIDUAL TREND GRAPH
Firstly, the residual data set is obtained by calculating the residual between the predicted value and the measured value.
where Y t is the predicted value, X t is the measured value, and Re t is the residual value. Then, calculate the expected value E and deviation σ of the residual data set. The fixed threshold function is designed as shown in Eq.14.
According to the 3-sigma criterion, if the residuals obey normal distribution, then 99.73% of the residual values are concentrated in the range of (E − 3σ, E + 3σ ), and almost all the values are in the range of (E − 6σ, E + 6σ ). Considering the inevitable error of the prediction model, a certain margin should be reserved when setting the threshold value. In this paper, the threshold value of 4 is set as the threshold of early fault warning, and the threshold value of 8 is set as the alert line of abnormal component temperature. The temperature prediction residuals are calculated after the actual running wind turbine data is input into the model, and the calculated temperature residuals are compared with the set threshold. If the residual exceeds the early warning threshold, it indicates that the wind turbine component is in the initial stage of failure. If the residual continues to increase to the warning threshold, it indicates that the wind turbine component is about to fail and it is necessary to alert for the component failure.

3) THE DYNAMIC THRESHOLD SETTING METHOD BASED ON ADAPTIVE PRINCIPLE
The operation condition of wind turbine is complex. Under the influence of some uncontrollable objective factors, the normal value deviation of wind turbine SCADA parameters may occur within the acceptable range, which is specifically manifested as the extreme point on the residual value. If the fault alarm system adopts fixed threshold setting, it may lead to false alarm when the residual extreme value is higher than the constant threshold. Based on the fixed threshold setting method, this paper designed a dynamic threshold setting method according to adaptive principle. The specific steps are as follows: Step 1: Set data window size. According to the principle of K-S test, if the k-value of K-S test of two data sets is more than 0.05, it can be considered that the two data sets have the same distribution law [28], [29]. In this paper, according to the K-S test principle, the length of the smallest data subset which can reflect the characteristics of the original data set is selected as the size of the sliding window. As shown in Fig.3, firstly, select a certain range of data from the beginning of the data as the sub data set, and take K-S test with the original parent data set to test the similarity between the two data set. Then expand the data subset range to the right in turn until the k-value between the subset and the parent set is greater than 0.05, and record the length of the subset at this time, which is the window size N .  Step 2: Calculate the threshold. As shown in Fig.4, according to the sliding window size N determined in the previous step, select the data within {Re i−N , Re i−N +1 , · · · , Re i } range to calculate the threshold value. Since the temperature of wind turbine changes slowly in normal state, the change amplitude of residual value is small. Therefore, when setting the dynamic threshold, this paper fully considers the residual change trend of the previous period in the window. This paper uses the method shown in Eq.15 to calculate the residual value in the window.
where Re t is the residual value at time t, N is the size of the sliding window, σ is the standard deviation of residual under normal condition.
Step 3: Move the data window frame by frame and set a new threshold according to step 2.
Step 4: Repeat step 3 to get the threshold values at all times, and then connect them to form an adaptive threshold graph fitting the trend of residual Re t .

III. GENERATOR FAULT WARNING
In order to verify the reliability of the fault warning algorithm, this paper uses the SCADA data record of a wind farm in Inner Mongolia (China) as the experimental data. The rated power of the wind turbine used in the wind farm is 2000 kw, the cut-in wind speed is 3 m/s, the cut-out wind speed is 20 m/s, the impeller diameter is 110 m, and the data acquisition period of SCADA system is 30 s. In this paper, the normal operation of wind turbines and wind turbines with generator failures are selected as the control for the generator failure warning experiment.

A. GENERATOR TEMPERATURE RELATED PARAMETER SELECTION
This paper analyzes the data records of the SCADA system to find the wind turbine that has experienced a generator failure. The time of the wind turbine failure is 3:42:20 on October 20, 2019. In order to verify the validity of the temperature regression prediction algorithm, another normal operation data model is selected in this paper, and the data recording time interval is from August 1, 2019 to October 31, 2019. After deleting null values, removing outliers and normalizing the original data, the generator temperature in the model data is selected as X and the other monitoring items are Y according to Eq.1. The PCC between each monitoring item and the generator temperature is calculated as shown in Table 2.

B. XGBOOST REGRESSION PREDICTION OF GENERATOR TEMPERATURE
Since the temperature values of the generator change slowly and the time span of the original data is large, data prediction based on one sample point every 30 seconds will result in an excessive amount of data, which will slow down the prediction speed. Therefore, this paper predicts the average temperature of the generator every 15 minutes, that is, the monitoring item data of the input prediction model is the average value every 15 minutes. According to the construction process shown in Fig.2, the generator temperature regression prediction model was constructed. The minimum root mean square error (RMSE) of the final model was 0.484, and the minimum mean absolute error (MAE) was 0.335. Using the operating data of normal wind turbine from October 25, 2019 to October 31, 2019 to predict the generator temperature during this period, the results are shown in Fig.5.
The red curve in Fig.5 represents the predicted curve of generator temperature, and the green curve represents the  actual temperature curve of generator. When the wind turbine generator is in normal operation state, the predicted temperature value of the model can better fit the actual temperature value of the generator. By calculating the residual between the predicted value and the actual value of the generator temperature, the distribution of the predicted residual of the generator temperature during normal operation is shown in Fig.6(a). It can be seen from the figure that during normal operation of the wind turbine, the predicted residual value of the model will show a certain symmetry near the zero point. Although there will be a few points with larger absolute residual value, the overall distribution is relatively uniform. The mean value VOLUME 8, 2020 of the residual data set is -0.01 and the standard deviation is 0.503. By use the kstest() test function in Python 3.8, this paper tests the k-value between residual data and standard normal distribution is 0.125, which is greater than 0.05, so the residual data set conforms to normal distribution. Fig.6 (b) shows the residual distribution map and alarm threshold after the residual is processed by EWMA principle. In the figure, the blue curve represents the residual distribution curve after calculation, the green line represents the early warning threshold, and the red line represents the alarm threshold. When the residual is below the green early warning threshold, it indicates that the generator temperature is normal. When the residual is between the green early warning threshold and the red alarm threshold, it indicates that the generator temperature has a high trend, which should be paid attention to. When the residual is above the red alarm threshold, it indicates that the generator is about to break down, and the corresponding fault treatment preparation should be made.

C. GENERATOR FAULT WARNING
In this paper, the wind turbine with generator failure is selected for early warning experiment. The failure time of the wind turbine is 3:42:20 on October 20, 2019, and the time interval of temperature prediction data set is from October 14, 2019 to 7:00, October 20, 2019. Input the preprocessed data set into the temperature regression prediction model, and the temperature prediction results are shown in Fig.7. In Fig.7, the red curve represents the change curve of the predicted generator temperature, and the green curve represents the variation curve of the measured generator temperature. It can be seen from the figure that when the generator is in normal operation, the predicted value of the model is in good agreement with the actual value. When the generator is about to fail, the deviation between the predicted value and the actual value will gradually increase. At 3:42 on October 20, 2019, the SCADA system detected the generator failure and took braking measures to stop the generator. Therefore, the actual temperature of the generator in the figure dropped sharply after reaching the maximum value.

1) FAULT DIAGNOSIS BASED ON FIXED THRESHOLD
The residual between the predicted value and the actual value of the model is shown in Fig.8(a). When the generator is in normal operation, the residual value is more evenly distributed. When the generator is about to fail, the predicted residual value of the model will gradually increase. Fig.8 (b) shows the residual value and fault threshold after EWMA processing.
It can be seen from Fig.8(b) that there is a big difference between the predicted residual value when the generator is about to fail and the residual value under normal operation. In order to show the effect of the model on fault prediction more clearly, this paper enlarges the curve of the rising part of the mean residual to get Fig.9.
In Fig.9, the blue curve represents the residual distribution curve, the green line represents the early warning threshold, and the red represents the alarm threshold.
It can be seen from Fig.9 that the residual value of the first crossing the early warning threshold point is the 44th sampling point, the first residual value crossing the alarm threshold point is the 61th sampling point, and the real fault point of the wind turbine is the 73th sampling point. Since there is a sampling point every 15 minutes, the early feature of generator fault can be detected 7.25 hours (29 sampling points) in advance by using the temperature regression early warning algorithm described in this paper, and it can give an alarm to the wind turbine fault 3 hours in advance (12 sampling points).

2) FAULT DIAGNOSIS BASED ON ADAPTIVE DYNAMIC THRESHOLD
According to the method in section II, the adaptive window size is determined as 25 sampling points, and the dynamic threshold of generator fault is set as shown in Fig.10 (a). It can be seen from the figure that the change trend of dynamic threshold basically matches the change trend of residual, and the residual value exceeds the fault threshold value when the fault is about to occur. Similarly, in this paper, the curve of the rising part of the residual trend value is locally enlarged to obtain Fig.10 (b).
It can be seen from Fig.10 (b) that before the fault occurs, the dynamic threshold slowly increases with the change trend of the residual, and at the 56th sampling point, the residual value crosses the dynamic threshold for the first time. Therefore, using dynamic threshold to determine the fault can alarm the wind turbine 4.25 hours in advance (17 sampling points), which is 1.25 hours earlier than the fixed threshold (5 sampling points).

IV. GEARBOX FAULT WARNING
In order to further verify the universality of the fault warning method, this paper also carries out the early warning of gearbox fault. In this paper, a wind turbine with gearbox fault is selected from the actual wind farm as the experimental object, and a normal running wind turbine is selected as the modeling object. The fault time of the fault wind turbine is 18:32:33 on June 7, 2019, so the time for modeling the normal wind turbine is from May 1, 2019 to June 31, 2019. This paper selects the gearbox oil temperature in SCADA data to represent the gearbox temperature [16].

A. GEARBOX TEMPERATURE RELATED PARAMETER SELECTION
Before using the model to predict the gearbox temperature of wind turbine, it is necessary to determine the monitoring quantity related to the gearbox temperature change. Through the calculation of PCC, the final selection of gearbox temperature related monitoring items is shown in Table 3.

B. REGRESSION PREDICTION MODEL OF GEARBOX TEMPERATURE
Like the generator fault warning, this paper also forecasts the average temperature of gearbox every 15 minutes, that is, the monitoring data input into the prediction model is the average value every 15 minutes. The generator temperature regression prediction model was built according to the construction process in Fig.2. The minimum RMSE and minimum MAE of the final model are 0.410 and 0.263 respectively.
The historical data generated during the normal operation of the wind trubine is input into the model, and the gearbox temperature prediction results under normal state are finally obtained, as shown in Fig.11.  The red curve in Fig.11 represents the predicted value of gearbox temperature, and the green curve represents the actual value of gearbox temperature. From the figure, when the gearbox is in normal operation, the predicted temperature value of the model can better fit the actual value of gearbox temperature. The variation of predicted residual value of temperature during normal operation of gearbox is shown in Fig.12(a). The residual value distribution during normal operation of wind turbine will show a certain symmetry near the zero point. Although there are a few points with large residual error, the distribution is relatively uniform on the whole. The mean value of the residual data set is 0.052 and the standard deviation is 0.338. By use the kstest() test function in Python 3.8, this paper tests the k-value between residual data and standard normal distribution is 0.062, which is greater than 0.05, so the residual data set conforms to normal distribution.
The residual and threshold control areas processed by EWMA control principle are shown in Fig.12(b). The blue curve represents the residual mean change curve, and the red line represents the fault threshold.

C. GEARBOX FAULT WARNING
The wind turbine data record with gearbox fault is extracted from the SCADA system data record and used as the fault test data set. The time interval of the data set is from June 1, 2019 to 19:00, June 7, 2019. After preprocessing the data set, input the data into the established temperature regression prediction model, and the final temperature prediction results are shown in Fig.13. In Fig.13, the red curve represents the predicted gearbox temperature and the green curve represents the actual gearbox temperature change. When the gearbox is in normal operation, the predicted value and the actual value fit well. While when the gearbox is about to fail, the deviation between the predicted value and the actual value will gradually increase.
At 18:32:33 on June 7, 2019, SCADA system detected that the gearbox oil temperature was too high, and took braking measures to stop the wind turbine and the gearbox. Therefore, in the Fig.13, the actual temperature of the gearbox will drop sharply after reaching the maximum value.

1) FAULT DIAGNOSIS BASED ON FIXED THRESHOLD
The residual distribution between predicted and actual values is shown in Fig.14(a), when the wind turbine is in normal operation, the residual value distribution is more uniform, and when the gearbox is about to fail, the residual value predicted by the model will show a gradual upward trend. According to EWMA control principle, residual threshold control area is divided as shown in Fig.14 In Fig.14(b), the blue curve represents the residual distribution curve, the green straight line represents the early warning threshold, and the red represents the alarm threshold. Compared with the normal residual value distribution of gearbox, it can be found that the residual value fluctuates greatly when the gearbox is about to fail, and it will also show a certain upward trend. In order to show the effect of the model on fault prediction more clearly, the curve of the rising part of the residual is locally enlarged to get Fig. 15.
As can be seen from Fig.15, the first residual value crossing the early warning threshold point obtained by the prediction model appears at the 54th sampling point, the first residual value crossing the fault alarm threshold line appears at the 61th sampling point, and the gearbox fault point appears at the 69th sampling point. Through the calculation, it can be seen that the early fault characteristics of gearbox can be found 3.75 hours (15 sampling points) in advance by using the temperature regression early warning algorithm described in this paper, and send out the gearbox fault alarm 1.75 hours (8 sampling points) in advance.

2) FAULT DIAGNOSIS BASED ON ADAPTIVE DYNAMIC THRESHOLD
According to the method in section II, the adaptive window size is determined as 23 sampling points, and the dynamic threshold of gearbox fault is set as shown in Fig.16 (a). It can be seen from the figure that the change trend of dynamic threshold basically matches the change trend of residual, and the residual value exceeds the fault threshold value when the fault is about to occur. Compared with Fig.14 (b), it can be seen that using dynamic threshold can avoid the peak value of residual and avoid false alarm.
Similarly, in this paper, the curve of the rising part of the residual trend value is locally enlarged to obtain Fig.16 (b). It can be seen from Fig.16 (b) that before the fault occurs, the dynamic threshold slowly increases with the change trend of the residual, and the residual value crosses the dynamic threshold for the first time at the 58th sampling point. Therefore, using dynamic threshold to determine the fault can alarm the wind turbine 2.75 hours (11 sampling points) in advance, which is 0.75 hours (3 sampling points) earlier than the fixed threshold.

V. DISCUSSION
In order to effectively evaluate the performance of the temperature prediction model, this paper use the RMSE and MAE of the prediction model output to evaluation criteria to evaluate the model. The equation of RMSE is shown in Eq.15.
where y i is the true value and,ŷ i is the predicted value, and m is the number of data in the test set. The solution equation of MAE is shown in Eq. 16.
where y i is the true value and,ŷ i is the predicted value, and m is the number of data in the test set. In order to verify the prediction accuracy of the proposed algorithm, this paper uses experimental data to build a temperature prediction model based on SVR as a control experiment. In this paper, the wind turbine historical data in the generator early warning experiment is taken as the experimental data set. The data set is divided into training set, verification set and test set, which are used to train SVR temperature prediction model, adjust model parameters and verify the prediction effect. The final parameters of SVR temperature prediction model are: ''C = 5'', ''kernel = RBF'', ''gamma = 0.01''. The comparison between the final test results of the test set and the XGBoost temperature regression prediction model is shown in Fig.17.
It can be seen from Fig.17 that the RMSE and MAE obtained by using XGBoost to establish the regression prediction model are smaller than those obtained by using SVR modeling. The smaller the RMSE and MAE are, the more accurate the regression prediction results are. Therefore, the regression prediction model based on XGBoost described in this paper has the advantage of high accuracy.

VI. CONCLUSION AND PROSPECT
Early fault warning can effectively reduce the operation and maintenance costs of wind farms and improve the efficiency of power generation. In order to solve the problem of frequent faults in generator and gearbox of wind turbine, this paper proposes a fault early warning method for key parts of wind turbine. In this method, XGBoost is used to establish the normal temperature regression prediction model of wind turbine components, and the residual change trend between the predicted value and the actual value is used as the early warning index. In the selection of characteristic quantity, in addition to using PCC to determine the monitoring data, this paper uses the idea of time series prediction to select the temperature value of the previous period of the component as the characteristic quantity, so that the selected characteristic quantity can reflect the temperature characteristics of the component more. This paper use XGBoost algorithm to construct the temperature regression prediction model. Compared with the regression prediction model constructed by SVM algorithm, XGBoost has higher prediction accuracy and is more suitable for the characteristics of large amount of SCADA data. In the process of fault state assessment, this paper uses the control chart based on EWMA principle to control the change trend of residual, and divides the normal, early warning and alarm intervals by setting control lines. However, Since the extreme value of the residual value, using fixed fault threshold may lead to false alarm. For this problem, this paper proposes a dynamic threshold setting method based on adaptive algorithm, which can avoid the false alarm caused by the extreme value of residual error, and can also early warn the fault. The fault warning method proposed in this paper has wide applicability, and has been verified in the fault warning of generator and gearbox. In theory, this method can be applied to other industrial systems with similar multi-sensor data structures.
In the next step, this paper will start from two aspects: feature selection and wind turbine overall fault warning. On the one hand, since the large amount of SCADA data monitoring, in order to improve the speed and accuracy of feature selection, we should consider a method to automatically select the best feature. On the other hand, Though the method in this paper can predict the faults of generator and gearbox of wind turbine, the wind turbine is a typical nonlinear and multi coupling system, and the relationship between different faults is complex and coupling is strong. How to use SCADA data to accurately predict the overall fault of wind turbine will be a great challenge.