A SCADA-Data-Driven Condition Monitoring Method of Wind Turbine Generators

Changes in sensor measurement parameters of wind turbine SCADA systems usually do not provide reliable early alarms. To detect early faults or abnormal conditions of wind turbine generator components, a wind turbine generator condition monitoring framework based on the fusion of cascaded SAE abnormal condition monitoring and LightGBM abnormal condition classiﬁcation is proposed. The framework consists of two parts. The ﬁrst part is a strong anti-interference cascade SAE anomaly condition monitoring method considering that early anomalies are easily ﬂooded. The cascade SAE is trained with polynomial features and original features. The isolated forest is used to determine the alarm threshold of reconstruction error between the input and output of the cascade SAE. The operating condition of the wind turbine is judged by comparing the magnitude between the reconstruction error and this threshold. The second part is the anomaly condition classiﬁcation based on LightGBM. The optimal parameters of LightGBM are searched by Bayesian optimization to build a LightGBM multi-classiﬁcation anomaly condition classiﬁcation model. The results of the case study show that the proposed condition monitoring has high anomaly recognition capability: the cascaded SAE method has strong anti-interference properties and can capture the early abnormal conditions of wind turbine generators; LightGBM has a faster training speed than other classiﬁers with guaranteed abnormality classiﬁcation accuracy.


I. INTRODUCTION
The harsh operating environment of wind farms results in wind turbines having a high failure rate [1]. The O&M cost of onshore wind turbines accounts for about 10%-15% of the total wind farm revenue [2]. Generator system failure is one of the main causes of wind turbine downtime and accounts for 37% of all fault downtime [3], [4]. Therefore, it is important to diagnose generator faults as early as possible to reduce downtime and maximize productivity. Condition monitoring is the process of determining if there are any abnormalities in the operating condition of wind turbines and when they occur; abnormality identification determines the type of abnormality or time-varying behavior [5], [6]. The abnormal condition of a wind turbine may develop into a permanent failure or may be able to recover to its original state after some time.
An effective wind turbine condition monitoring system requires the installation of numerous high-frequency The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. sampling sensors. The wind turbine supervisory control and data acquisition (SCADA) system is capable of collecting and remotely or locally monitoring the operating parameters of the entire wind farm generator, and is characterized by fast signal changes and numerous operating parameters. The fault or abnormal characteristics of wind turbines are implicit in the SCADA variables that characterize their operating status, so the abnormal information carried by the SCADA variables can be mined to monitor the status of the wind turbine or alarm the abnormal status.
Fault alarms for a range of subassemblies of wind turbines are performed using wind speed statistics in [7]. A fault detection algorithm based on Gaussian process is proposed based on SCADA data with operational variables (pitch angle and rotor speed) as inputs to an additional model in [8]. The effectiveness of wind turbine fault alarms can be improved by processing SCADA data or extracting certain features, such as processing actual SCADA imbalanced data [9], [10] and using NOFRFs approach to extract damage sensitive features [11].
The principal component analysis is used to select a set of partial variables containing the variation characteristics of the original data to locate wind turbine faults [12]. Models such as generative adversarial networks, transfer learning, and convolutional auto-encoders can be applied to fault detection scenarios with small samples or the same type of wind turbines [13], [14]. Long short-term memory networks and spatio-temporal multiscale neural networks considering spatio-temporal characteristics can effectively capture the fault information of wind turbines in SCADA data [10], [15], [16]. The method of generating reference space or constructing residuals based on the normal behavior of wind turbines detaches the abnormal conditions from the normal data by measuring the difference between normal and abnormal conditions of wind turbines [17]- [19]. Different variants of auto-encoders and neural networks are widely used for wind turbine fault diagnosis [18], [20].
Many studies can detect anomalies in wind turbines, but no further identification of the detected anomalies is done. Classification algorithms can be used to analyze fault data features to identify specific faults in wind turbines [21], [22]. Optimized support vector machine classifiers can be better implemented for wind turbine fault diagnosis [23], [24]. Various ensemble learning algorithms (e.g., random forest and XGBoost) using decision trees as base learners can provide wind turbine condition monitoring schemes [25]- [29]. The XGBoost algorithm has been further improved in terms of the loss function, regularization, and parallelization processing, and has better classification performance, which can improve the accuracy of fault identification more effectively [30]. A wind turbine condition monitoring method based on multi-feature monitoring parameter information fusion is proposed using an optimization scheme based on the Bayesian optimization algorithm and XGBoost feature weight measurement [28]. Although XGBoost has high classification accuracy, its training speed is not advantageous.
For wind turbine generator abnormal condition monitoring and abnormal condition classification, this paper builds a condition monitoring framework based on cascaded stacked auto-encoder (SAE) and LightGBM to achieve early abnormal condition capture and fast and accurate abnormal condition classification of wind turbines. Firstly, the normal wind turbine SCADA data are used to train the cascaded SAE network and the alarm threshold is determined by the isolation forest. For the wind turbine SCADA data containing abnormal conditions, the magnitude of reconstruction error and alarm threshold are compared to determine whether the wind turbine is abnormal or not. Then Bayesian optimization is used to search the hyperparameters of the LightGBM multiclassification model to identify different anomaly types. Finally, the effectiveness of the proposed method is verified with actual failure cases of wind turbines.
The paper is organized as follows. Section II describes the wind turbine abnormal condition monitoring model. Section III elaborates on the wind turbine abnormal condition classification model. Section IV trains the wind turbine condition monitoring framework. Section V analyzes the effectiveness of the model with examples. Section VI briefly presents the conclusions of this paper.

II. WIND TURBINE ABNORMAL CONDITION MONITORING
To detect abnormal operating conditions of wind turbine generator bearing assemblies promptly and improve the accuracy of abnormality monitoring, a cascaded stacked autoencoder (SAE)-based abnormality monitoring model is constructed. Auto-Encoder (AE) adjusts the model parameters in an unsupervised learning manner so that the model output reconstructs the input as accurately as possible. An AE consists of an encoder and a decoder, where the encoder extracts abstract features of the data; the decoder is the inverse of the encoder and constructs reconstructed values close to the input. The reconstructed values have the same physical meaning as the input values.
The data under normal conditions are selected as the input for training AE. In this paper, the normalized Supervisory Control and Data Acquisition (SCADA) data under normal operation of wind turbines are selected as the input of AE. The cross-sectional data of the wind turbine SCADA measurement points under a certain moment are x. The encoder and decoder are expressed as where x, h andx are the input of the input layer, the output of the implied layer and the output of the output layer of AE, respectively. σ 1 and σ 2 are the activation functions of the encoder and decoder, W 1 and W 2 are the weights of the different layers, b 1 and b 2 are the biases of the different layers.
The training process of AE is the process of adjusting the parameter set {W 1 , W 2 , b 1 , b 2 } to minimize the distance metric function dist(x,x) between the input and output. The SAE is formed by connecting several AEs in series, which can extract the higher-order features of the input data layer by layer and reduce the dimensionality of the input data layer by layer.
The operating condition of the wind turbine is coupled with each SCADA variable. Considering the high-dimensional features of SCADA feature variables and the influence of polynomial features among variables on SAE, a combined model of 2 SAE cascades is constructed in this paper to improve the effectiveness of anomaly monitoring. To obtain the high-dimensional features and interrelated features of SCADA feature variables x, the polynomial and interaction features of x are generated. For example, for the n-th polynomial feature between two variables a and b as [1, a n b 0 , a n − 1 b 1 , a n − 2 b 2 , . . . , a 0 b n ] The input to the first SAE is the polynomial feature x (which does not contain the constant 1 and the variable x VOLUME 10, 2022 itself), and the outputx is obtained by training the SAE. The reconstruction error e between the original input x and the output reconstructed valuex is defined as the Euclidean norm of the difference between the two, i.e.
The input of the second SAE is the feature [x, e ] of the original SCADA feature variable x combined with the reconstruction error e of the first SAE. The reconstruction error e of the 2nd SAE input and output is used as the monitoring variable to measure whether the wind turbine is abnormal or not. The structure of the wind turbine abnormal condition monitoring model is shown in Section IV. Cascading SAE enables the reconstruction error to describe the operating state of the wind turbine more accurately, and reduces the situation that the large fluctuation of the wind turbine is mistakenly identified as abnormal.
Under normal operating conditions of wind turbines, there is a stable correlation between their SCADA variables. When an abnormality occurs in the generator components, it is manifested as an abnormal value of one or several SCADA variables, resulting in a large deviation of SAE reconstruction value. With the further aggravation of the abnormality, the reconstruction error value also increases gradually, so the reconstruction error can be used to judge whether the wind turbine is abnormal. When the reconstruction error exceeds the alarm threshold, it can be judged that the wind turbine enters the abnormal alarm condition, which may be the development stage of the early fault of the wind turbine.
In this paper, the alarm threshold for wind turbine anomaly monitoring is determined by the isolated forest algorithm. Isolated forest is similar to random forest, but the selection of features for division and points for segmentation is random each time, rather than based on information gain or Gini index. During the tree building process, if a sample reaches the leaf node quickly (i.e., the distance from the leaf to the root is short), this sample may be anomalous. Anomalous samples can be isolated by fewer times of random feature segmentation compared to normal samples. The reconstruction error of the second SAE is used as the input of the isolated forest, and the alarm threshold is determined by the value of the detected reconstruction error anomalies.

III. WIND TURBINE ABNORMAL CONDITION CLASSIFICATION
To determine the type of abnormal condition of wind turbine found by cascaded SAE, this paper predicts the type of possible early fault or abnormal condition of the wind turbine by LightGBM. LightGBM is an improved optimization algorithm based on Gradient Boosting Decision Tree (GBDT), by building multiple decision trees (base learners) to synthesize the output of the decision tree population to obtain the final result [31].
For given a data set with n examples and m SCADA features , LightGBM predicts the abnormal condition type of wind turbines asŷ i . y i is the target value corresponding to x i , i.e., the abnormal condition categories of wind turbines. x i denotes the selected SCADA features.
LightGBM uses K additive functions to predict the final classification target, i.e.
is the function space composed of decision trees. q denotes the structure of each tree that maps an example to the corresponding leaf index. T denotes the number of leaves in the tree. Each f k corresponds to an independent tree structure q and leaf weights ω.
In order to learn the set of functions in the model, the learning objective function of LightGBM is where l is a differentiable convex loss function that measures the difference between the predictionŷ i and the target y i . (f ) = γ T + 1 2 λ ω 2 is the regularization term to penalize the complexity of the model and prevent overfitting. γ and λ are regularization coefficients. letŷ (t) i be the prediction of the i-th instance at the t-th iteration, LightGBM adds a new function f t , i.e., uses a stepwise forward additivity model to maximize the reduction of the following objective function.
Second-order approximation can be used to quickly optimize this objective function. where For a fixed structure q(x), the optimal weight of leaf j and the corresponding optimal objective function value are Assume that I L and I R are the instance sets of left and right nodes after the split. Letting I = I L ∪ I R , the structure loss reduction after the split can be used to determine whether to divide and to identify the division candidates.
LightGBM's base classifier is the classification and regression tree based on histogram algorithm, which divides feature values into many bins and searches for split points on the bins. LightGBM abandons the level-wise decision tree growth strategy and uses the leaf-wise algorithm with depth restrictions. The LightGBM algorithm contains two innovative techniques, which are the gradient-based one-side sampling and the exclusive feature bundling, respectively, that enable the model to handle large-scale data and features more efficiently.

IV. WIND TURBINE CONDITION MONITORING A. MODEL FRAMEWORK
In order to detect abnormal conditions during wind turbine operation and to identify the type of that abnormality, a wind turbine condition monitoring framework that incorporates cascaded SAE and LightGBM is designed, as shown in Figure 1.
The anomaly monitoring model based on two cascaded SAEs is trained using wind turbine normal operation data. The isolated forest algorithm is used to mine outliers for the reconstruction error constructed by the second SAE, and the alarm threshold is determined by the outliers. If the reconstruction error is greater than the alarm threshold, the operating status of the wind turbine is abnormal. The wind turbine anomaly data are used as input to the cascaded SAE model to calculate the reconstruction error. The input SCADA data samples are labeled as normal or abnormal based on the magnitude between the reconstruction error and the alarm threshold. By adding labels to SCADA data with known abnormal types, the LightGBM classification model can be trained to further determine the fault category for the abnormal data screened by the cascade SAE.
The idea of the wind turbine condition monitoring framework is: based on the wind turbine SCADA data, the cascade SAE monitors whether the wind turbine is abnormal in a certain period, and LightGBM further identifies the specific type of the abnormality.

B. MODEL TRAINING
The inputs to the training cascade SAE are multiple SCADA characteristic variables of wind turbines under normal conditions. In this paper, three types of faults in generator bearing components of wind turbines are collected and 10 important SCADA variables are selected, namely, wind speed v, grid side power P, generator front bearing temperature T a , generator rear bearing temperature T b , gridside three-phase voltage U 1 ∼U 3 , and grid-side three-phase current I 1 ∼I 3 as shown in Table 1.
The cascaded SAE-based abnormal condition monitoring model for wind turbines requires training two SAE networks. Different models of wind turbines need to be trained separately for their respective anomaly monitoring models and to determine the alarm thresholds. Before training the first SAE in the anomaly monitoring model, polynomial features need to be constructed.
The theoretical maximum power output of the wind turbine is where C p is called wind energy utilization coefficient, its maximum value is 0.59. In the actual wind turbine power limit is smaller than Baez's law, usually take the value of 0.35∼0.45. Since the theoretical maximum power output of the wind turbine is proportional to the third power of wind speed, the third power of wind speed v 3 is added as the characteristic variable. The above ten feature variables (removing the power generation P and adding v 3 ) are the base feature variables. Since wind turbine power is related to a variety of parameters, polynomial features of power 2 are generated with the wind turbine SCADA base features. The constructed polynomial features retain only the combined features between each feature, i.e., they do not contain 0-th power features (i.e., 1), features themselves (i. e., a, b), and combinations of features themselves and themselves (i.e., a 2 , b 2 ). After generating the polynomial features, the correlation coefficient between the polynomial features and P is found using wind turbine power generation as the reference variable. The feature variables with correlation coefficients greater than 0.8 are selected, and finally 35 polynomial features are retained.   These 35 polynomial features are used as the input of the first SAE, and the output dimensions of each layer in this SAE encoding process are set to 1000, 500, 250, and 50, respectively. The activation function is chosen as ReLU, the loss function is chosen as MSE. The model parameters are adjusted with the Adam optimizer. The variation of the loss function curves in the training SAE process is shown in Figure 2, which indicates that the fitting effect of both the training and validation sets is good.
Letting the reconstruction error between the input and output of the 1st SAE be the feature variable e , the 10 features in Table 1 with e (i.e., [x, e ]) are used as the input of the 2nd SAE. The output dimensions of each level of the 2nd SAE coding process are set to 500, 250, and 50, respectively, and other parameters are the same as the 1st SAE. The reconstruction error (denoted by Re) constructed based on the anomaly monitoring model of the two SAE cascades is shown in Figure 3.
The alarm threshold for cascaded SAE anomaly monitoring is determined by the isolated forest algorithm, and the proportion of anomalies in the sample is set to 0.02. The outlier points of the isolated forest detection reconstruction error data distribution are shown in Figure 4. The average value of all outliers detected by the isolated forest is taken. In order to reduce the misjudgment of the condition monitoring caused by the large fluctuation of the wind turbine  during normal operation, the reconstruction error is set to be twice this average value. In this paper, if the reconstruction error is greater than the alarm threshold and its duration exceeds 30s (for wind turbines with a 1s acquisition interval), these wind turbines are considered to have an abnormal operation after that time.
The actual anomaly labels are added to the anomaly feature samples [x, e ] detected by the cascaded SAE as the input for training LightGBM. Because the proportion of data corresponding to each anomaly label in the wind turbine SCADA data is uneven, this paper uses stratified sampling to ensure that the proportion of label samples used for LightGBM training is the same as the original data set. Bayesian optimization search and k-fold cross-validation are used. The number of cross-validation is chosen as 5-fold to determine the optimal LightGBM hyperparameters at one time, as shown in Table 2. The trained LightGBM can output the predicted anomaly types.

V. CASE STUDY A. CASCADE SAE ABNORMAL CONDITION MONITORING
The reconstruction error calculated by the cascaded SAE anomaly monitoring model is small under the normal operation of the wind turbine. During abnormal hours or early fault development, the reconstruction error increases suddenly or has a creeping process. Comparing the magnitude of the reconstruction error value with the alarm threshold can identify whether an early failure or anomaly exists in the wind 67536 VOLUME 10, 2022  turbine. The cascade SAE abnormality monitoring capability is analyzed with actual cases of 3 different wind turbines in wind farms.
A wind turbine in Case 1, which collects SCADA crosssection data every 1s, was shut down at the site on 2015-10-17 01:11:15 for the generator front bearing temperature overrun fault. The cascade SAE found the anomaly at the red dot in Figure 5 (i.e., 2015-10-16 13:03:21), 12:07:54 ahead of the site time (the alarm threshold set by the isolated forest is 0.3786). The generator front bearing temperature overrun fault is triggered by the generator front bearing temperature greater than 95 degrees Celsius and lasts for 5 seconds. In this paper, in order to reduce the false alarm generated by operating fluctuations, the moment after the reconstruction error is greater than the alarm threshold and lasts for more than 30s is taken as the moment when the hidden trouble occurs in the wind turbine.
As can be seen from Figure 5, the reconstruction error of this wind turbine in the sample index interval [110000, 130000] after the red point falls back to normal. By analyzing the curves of wind speed v and generator front bearing temperature T a (as shown in Figure 6), the wind turbine was found abnormal near the red point moment. Larger increases and decreases in T a occurred, but T a did not meet the conditions for this wind turbine to trigger the temperature overrun fault. In the sample index interval [110000, 130000], the cascaded SAE judged that the abnormality of the wind turbine has disappeared.  A single SAE was constructed as a comparison to the cascaded SAE model with the coding process with output dimensions of 500, 250, and 50 for each layer, and the inputs were the variables in Table 1. The wind turbine anomaly monitoring of the single SAE is shown in Figure 7, which amplifies the larger fluctuations of this wind turbine in the sample index interval [90000, 130000]. At this point, the variable T a physically directly reflects the fluctuations of this wind turbine and dominates the trend of the reconstruction error. Although a single SAE can detect the abnormal condition of wind turbines earlier than the cascaded SAE, it is less robust and more affected by the larger fluctuations.
A wind turbine in Case 2 records a section data every 1s and fails to shut down at 2015-07-24 04:50:00 in the field due to generator rear bearing temperature overrun fault. The cascade SAE model of this wind turbine was constructed, and the isolated forest determined its alarm threshold to be 0.4218. The reconstruction error of the cascade SAE at 2015-07-24 02:06:42 was greater than the alarm threshold, and the wind turbine was judged to have an abnormality. This fault was found earlier than the site at 02:43:18, as shown in Figure 8. The trigger condition for the generator rear bearing temperature overrun fault is after the generator rear bearing temperature is greater than 95 degrees Celsius and lasts for 5 seconds.   Comparing the single SAE model (the model structure is the same as the setup of the single SAE in Case 1), as shown in Figure 9, the time to detect anomalies is similar to that of the cascaded SAE. The single SAE is less resistant to interference than the cascaded SAE, and both have roughly the same reconstruction error variation trend.
The variable T b physically directly reflects this fault of this wind turbine. After the sample index value of 45000, the trend of the reconstruction error is similar to the trend of T b , as shown in Figure 10. The generator bearing temperature overrun fault needs to focus on the trend of bearing temperature.
A wind turbine in Case 3 records a section data every 10min and fails to shut down at 2015-10-23 17:00:00 due to damage to the front and rear bearings of the generator. The cascade SAE model of this wind turbine was constructed, and the isolated forest determined its alarm threshold to be 0.3702. The reconstruction error of the cascade SAE at 2015-10-21 07:50:00 was greater than the alarm threshold, and the wind turbine was judged to have an early fault. Its fault was found at 09:10:00 earlier than the site, as shown in Figure 11.
Compared with the single SAE model, as shown in Figure 12, the single SAE model basically cannot detect the early failure of this wind turbine, it only detects anomalies in very short intervals. Its abnormality monitoring capability   is worse than the cascaded SAE. The trends of wind speed, T a and T b of this wind turbine are shown in Figure 13. The values of T a and T b fluctuate within the normal range before the failure shutdown of this wind turbine.
When early faults occur in wind turbine operation, the correlation relationship between variables is destroyed. But the cascaded SAE model still outputs the corresponding reconstructed variables for abnormal data according to the correlation relationship in normal time, which leads to an increase in the reconstruction error value, so the abnormal condition or early faults of wind turbines can be monitored. The samples with reconstruction error greater than the alarm  threshold are added with normal or fault labels to practice the LightGBM algorithm for further abnormal condition categories.

B. LIGHTGBM ABNORMAL CONDITION CLASSIFICATION
After detecting the sample data of wind turbine reconstruction errors greater than the alarm threshold, the LightGBM model is trained with the above three types of faults to identify the corresponding anomaly categories. The LightGBM is trained with the anomaly data detected by the cascaded SAE for these three types of wind turbines, including three types of fault conditions: generator front bearing temperature overrun fault F 1 , generator rear bearing temperature overrun fault F 2 , and generator front and rear bearing damage F 3 .
The wind turbine anomaly identification results are shown in the confusion matrix in Table 3. The values in the table indicate the number of entries of wind turbine SCADA data. The diagonal data of the confusion matrix is the number of correct classifications, and the off-diagonal is the number of incorrect classifications in the corresponding row or column. Table 3 shows that LightGBM was able to identify faults F 1 and F 2 100% of the time, and for fault F 3 , there were very few times when it was incorrectly classified as fault F 2 . The reason for this could be the similar fluctuations in the changes in the SCADA characteristics of the wind turbines for both early faults after the reconstruction error was greater than the threshold.
The cascaded SAE anomaly monitoring algorithms are constructed in the same way for these three fault types of wind turbines. Different types of classifiers are trained based on SCADA anomaly data with added anomaly labels. LightGBM was compared with support vector machine (SVM), decision tree, random forest, and XGBoost, as shown in Table 4.
From Table 4, LightGBM has better classification accuracy than other classifiers. Random forest and XGBoost are both decision tree based algorithms and their accuracy is close to LightGBM. The Decision tree training speed is faster than LightGBM because of the low complexity of the decision tree model, so its training parameters take a short time, but the decision tree accuracy is lower. XGBoost accuracy is closest to LightGBM, and its training speed is faster than other classifiers, but it is 6 times longer than the training time of LightGBM. LightGBM has certain advantages in training time and classification accuracy. The effectiveness of LightGBM for wind turbine abnormal condition classification is verified.

VI. CONCLUSION
In order to detect early faults or abnormal conditions of wind turbine generator components in a timely manner, this paper designs a framework for wind turbine generator condition monitoring based on cascaded SAE and LightGBM, and verifies the effectiveness of this framework through case studies. By training the cascade SAE with normal operation data of wind turbines and setting appropriate alarm thresholds for different wind turbines, the abnormal data capture of early faults of generator components is achieved. The problem of difficulty in obtaining early fault samples in the field is also solved. The cascaded SAE is less affected by the fluctuation of wind turbine operation than the single SAE and has a certain anti-interference capability. The parameters of the LightGBM anomaly classification model are determined by Bayesian optimization search and combined with stratified sampling and 5-fold cross-validation. Compared with other classifiers, LightGBM has higher classification accuracy and faster training speed, and can accurately identify wind turbine anomaly categories.