Fault diagnosis methods based on machine learning and its applications for wind turbines: a review

With the increase in the installed capacity of wind power systems, the fault diagnosis and condition monitoring of wind turbines (WT) has attracted increasing attention. In recent years, machine learning (ML) has played a crucial role as an emerging technology for fault diagnosis in wind power systems has played a crucial role. Even though ML methods have shown great potential in dealing with the issues related to the fault diagnosis of WT, there are still some challenges encountered in many aspects. In this paper, typical fault diagnosis methods based on ML methods for wind power systems are thoroughly reviewed in terms of both theoretical fundamentals and industrial applications, including traditional machine learning (TML), artificial neural networks (ANN), deep learning (DL) and transfer learning (TL), in the development line of ML technologies. The advantages and disadvantages of various methods are analyzed and discussed. Meanwhile, a distribution diagram is provided for the discussions of ML methods applied for WT fault diagnosis, and the existing challenges on the applications for fault diagnosis based on ML for wind power generation systems are presented. Moreover, some prospects for future research directions are provided.

With the increasing consumption of fossil fuels and the 32 gradual deterioration of environmental problems, there is an 33 urgent need to find a clean and renewable energy source. 34 Wind energy is irreplaceable in energy structures owing 35 to its rapid growth. Wind power accounts for 20% of the 36 world's total electricity, and WT are receiving increasing 37 attention as the core components of wind power generators [1], 38 [2]. Usually, wind power generators are installed in remote 39 areas or offshore areas where traffic is inconvenient, and 40 the gearbox is generally installed in a sky above tens or f (x 1 ) = 1 and f (x 2 ) = −1, the margin equals to: Thus, maximizing the margin is equivalent to minimizing 220 w 2 or w 2 2 . Then, to achieve the optimal hyperplane, the 221 SVM solves the following optimization problem: The transformation of this optimization problem into its 223 corresponding dual problem gives the following quadratic 224 problem: y i α i = 0, α ≥ 0 ∀i = 1, 2, · · · , l, where α i is the Lagrange multiplier. The solution of the 226 previous problem gives the parameter w = l i=1 y i α i x i of 227 the optimal hyperplane. Hence in dual space, the decision 228 function becomes: The SVM maps the eigenvectors of the low-dimensional The Bayes rule indicates how the information of known 491 probability density functions, p(x|ω i ) , and a priori probabil-492 ities, P (ω i ) , can be used to calculate the posteriori proba- This expression above can be readily evaluated if the den- . So in this case, we have: where µ i is the mean and d ln 2π is a constant. All the nec-500 essary information of each class and feature cluster is con-501 tained in the mean vector and covariance matrix. The center of 502 each cluster is determined by the mean vector and the shape 503 of the cluster using the covariance matrix. As for sorting 504 tasks, in cases where all correlation probabilities are known,

505
Bayesian decision-making considers how to choose the best 506 class label based on these probabilities and misjudged losses.    there is an error category that has never been recorded in the 601 data, and the diagnostic accuracy decreases when the input 602 data attributes are not independent of each other. Moreover, 603 the diagnostic system needs to know the prior probability in 604 advance, which is usually subjected to a hypothetical model.

605
If the model selected is unreasonable, the diagnostic accuracy 606 decreases.  The first one is a state variable, which indicates the state of the 613 system at a precise moment because the system often changes 614 between these states, so the state variables of the system  The transition between states in the basic HMM depends on 621 a certain probability, which is the state transition probability 622 matrix, and assumes that the probability matrix and the state 623 in the diagnostic model do not change over time. 624 An N states' HMM can be expressed by: where N is the number of state of HMM and the state b t ∈ {S 1 , S 2 , · · · , S N } at moment t. M is the number of possible observations for each state. π denotes to probablity of the original state and π = (π 1 , π 2 , · · · , π N ). Note that π i can be described as: and π i satifies to normalization condition, i.e. N i=1 π i = 1. A is state transition probability matrix, B is observed probability matrix and HMM model can also be simplely remarked as λ = (π, A, B).     it can obtain better diagnostic results than the general learner.
where x i represents the ith sample, c i is the cluster to which  we have a dataset X, and we want to classify the data in it.

772
If these data are divided into c classes, the corresponding c 773 class centers are c i , and each sample x j belongs to a certain 774 class, and the membership of c i is determined as u ij . Then we 775 define a FCM objective function and its constraints as follows: where m is a factor of membership, generally values to 2, and

794
The results showed that concurrent faults could be effectively 795 diagnosed.

796
(b) Fault diagnosis applied to the bearing of WT.

797
The K-means CA was used to process the outliers of the  for WT fault diagnosis is presented in Table 1.   The BPNN is a multilayer feedforward network trained by an error back-propagation algorithm. The training process includes two parts: the forward propagation of the diagnosis network and the reverse fine adjustment of the network parameters. The error propagation in the opposite direction is to distribute the output error through the hidden layer to the input layer, giving out the error to each layer unit. The learning rule applies the gradient descent method to continuously adjust the weights and thresholds of the network through backpropagation as shown in Fig 4, thus minimizing the sum of squared errors of the network. Given the training includes d features and y i ∈ R l includes l health states, the output of the hth hidden layer is expressed as: where (x h i ) j is the output of the jth neuron in the hth hidden layer, and x 0 i = x i , n h is the number of neurons in the hth hidden layer, σ h represents the activation function of the hth hidden layer, n h−1 is the number of neurons in the (h − 1)th hidden layer, ω h j is the weights between the neurons in the previous layer and the jth neuron in the hth hidden layer, and b h j is the bias of the hth hidden layer. The predicted output of BPNN is: where (ŷ) k is the predicted output of the kth neuron in the output layer, σ out is the activation function of the output layer, ω out j and b out j are respectively the weights and bias of the output layer. When given a certain training samplex i , y i , the optimization objective of BPNN aims to minimize the error between the predicted output and the target one by:  If the training dataset is given as where x i is ith data instance, t i is the label of ith data instance, and such set denotes to all training data, the output of hidden layer can be wirtten as, where h i (x)(i = 1, 2, · · · , L) is output of the ith hidden layer node, which is not unique. Generally, h i (x) can be described as follows: is the activation function and Sigmoid function and Gaussian function are commonly used. w i and b i are parameters of hidden layer nodes. Then the output of the "generalized" single hidden layer feedforward neural network ELM is:  The RBFNN is a single hidden layer feedforward neural network that uses a radial basis function as the activation function of the hidden layer neuron, which is a local response function. The radial basis function(RBF) is a real-valued function whose value depends only on the distance from the origin or arbitrary point c, and point c is called the central point. That is, the RBF can be described as follows: In the application of this method, the number of center   The RBFNN has strong nonlinear fitting ability, which can 1081 deal with the law that is difficult to analyze in the system.

1082
Moreover, it has fast convergence speed and better general-  The SOM process involves for major components, ini-1108 tialization, competition, cooperation and adaptation. In the 1109 initialization process, all connection weights were initialized 1110 with small random values.

1111
In the competitive process, if the input space is D dimension, the input data can be written as x = {x i : i = 1, ..., D} and the connection weights between the input units i and the neurons j in the computation layer can be written as w j = {w ji : j = 1, ..., N ; i = 1, ..., D} where N is the total number of neurons. The discriminant function can be defined VOLUME 4, 2016 as the squared Euclidean distance between the input vector x and the weight vector w j for each neuron j, The neuron whose weight vector comes closest ro the input 1112 vector is declared the winner.

1113
In the cooperative process, a similar topological neighbourhood for the neurons in SOM is defined as, where S ij is the lateral distance between neurons i and j on 1114 the grid of neurons, I(x) is the index of the winning neuron.

1115
A special feature of the SOM is that the size σ of the neigh-1116 bourhood needs to decrease with time which is defined as where σ 0 and τ σ are super parameters.

1118
In the adaptive process, the winning neuron and its neighbours have their weights updated. In practice, the appropriate weight update euation is where t is a time epoch and ing rate. η 0 and τ η are super parameters.  The support vector regression method was proposed to 1144 predict the energy output, and the SOM method was used 1145 to reduce the dimensionality of the high-dimensional data.

1146
The experimental parts were based on real wind energy time The structure of ART inludes layer F 1 and F 2, layer F 1 is the comparison layer and layer F 2 is the recognition. The neurons of each sublayer of F1 works in a shunt model and operates in an instantaneous balance. V i (i = 1, ..., M ) represents the activity of neurons and can be calculated as, where A and D are constants, B, C = 0, and J + i represents 1207 the stimulation and J − i represents the inhibition. The solution 1208 of V i is, The winning neuron y is obtained by layer F 2 selects the from the result of F 1, which is calculated as, where T j is the similarity of neuron j of F 2, z ji represents the and a multi-population genetic algorithm (MPGA).This is 1295 an effective work that provides a learning-based reliability 1296 analysis method for complex equipment.

1297
The summary of applications of ANN for WT fault diag-1298 nosis can be shown in Table 2. However, the hidden layer in the ANN usually has one layer,

1299
where θ = {ω, a, b} represents the parameters of RBM. Then, the marginal distribution of the visual units can be calculated as: where where σ f is the activation function of the encoder network, and θ = {ω, b} is the training parameters of the encoder network. The reconstructed samplex i can be obtained by the decoder network, which is expressed as follows: where where σ g is the activation function of the decoder network, and θ = {ω , b } represents the training parameters of the decoder network. In order to reconstruct the original input as well as possible, the optimization objective of AE focuses on minimizing the error between the input samples and the reconstructed ones by: The gradient descent algorithm is still utilized for the network   Fig. 6 shows a typical RNN schematic diagram. RNN has both internal feedback and feedforward connections between processing units. The internal feedback connection can maintain the state of the hidden nodes and provide memory for the network. The output of the network is not only related to the current input, but also to the internal state of the previous network, reflecting good dynamic characteristics. The the output O t and the value of hidden layer S t can be calculated as: where t is the moment, U, V, W are parameters of the sys-1628 tem network, X t represents the instance vector. In the whole 1629 training process, the same parameter W is used at each time.

1630
The RNN fully considers the association between samples,  Table   1711 3. Moreover, the high-quality diagnosis results in the cited  V. TRANSFER LEARNING (TL) 1751 In previous methods, training sets and testing sets for model Kandaswamy et al. [127] introduced an approach to im-  Table 4.  it is possible to quickly adapt to changes in the data in a can have a higher diagnostic accuracy.

2021
In terms of DT, the C4.5 algorithm was used in Ref.