A Transformer Fault Diagnosis Method Based on Parameters Optimization of Hybrid Kernel Extreme Learning Machine

Dissolved gas analysis (DGA) is a widely used method for diagnosing internal transformer defects. The traditional single intelligent diagnostic method cannot efficiently process large amounts of incomplete defect information with DGA, which affects the accuracy of fault diagnosis. To this end, this paper proposes a transformer fault diagnosis method based on the optimization of kernel parameters and weight parameters of a kernel extreme learning machine (KELM). First, based on Mercer’s theorem, we combine the radial basis kernel function with the polynomial kernel function to construct a new hybrid kernel function. Then, the gray wolf optimization (GWO) algorithm and the differential evolution (DE) algorithm are combined to improve the diversity of the gray wolf population, enhance the searchability of GWO, and prevent GWO from falling into a local optimum during the iterative process. Finally, the kernel parameters and weight parameters of the hybrid kernel function are optimized by using the modified grey wolf optimization (MGWO) algorithm. The International Electrotechnical Commission Technical Committee (IEC TC) 10 transformer fault data and constructed hybrid feature set is used as the input set of the model, the model is simulated and analyzed, and the transformer fault data collected at a site are used for training and verification. The simulation results on the two sets of data show that the method can accurately and effectively diagnose transformer faults, and has a higher fault diagnosis accuracy rate than traditional methods.


I. INTRODUCTION
A power transformer is an important part of a power system. It undertakes the tasks of power conversion and power transmission, and plays a vital role in the power system. Since transformer failures may induce very large economic losses, formulating corresponding measures in advance according to the transformer status, and timely detecting and accurately determining latent transformer faults, are of great significance for extending the life of a transformer and improving the safety, reliability, and economy of a power system.
Dissolved gas analysis (DGA) is an important tool for detecting early faults of oil-filled transformers. Once an oilimmersed transformer fails, insulators in the transformer, such as the internal solid and liquid materials, may chemically The associate editor coordinating the review of this manuscript and approving it for publication was Huaqing Li . decompose due to electrical and thermal stress, leading to gas release [1]. Therefore, we can detect abnormal states of a transformer by identifying the composition and content of the dissolved gas in the oil, and further determine the fault type, severity and development trend. Traditional fault diagnosis methods mainly include the ratio method [2], [3], key gas method [4], triangle method [5], and pentagon method [6]- [8]. Although they are simple and effective, these methods still have many problems, such as inconsistent diagnostic results and low accuracy, which reduce the reliability of fault analysis. In recent years, with the continuous development of artificial intelligence (AI) technology, machine learning and pattern recognition methods have been widely used in transformer fault diagnosis. Artificial neural networks [9], [10], support vector machines (SVM) [11], fuzzy logic [12], Bayesian neural networks [13], adaptive network-based fuzzy inference VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ systems [14], deep belief networks [15] and other technologies have been applied to transformer fault tasks. These methods compensate for the shortcomings of traditional DGA methods, directly or indirectly improve the accuracy of fault diagnosis, and provide new ideas for transformer fault diagnosis. The extreme learning machine (ELM) proposed by Guang-Bin Huang et al. in 2004 provides a unified learning platform with a wide range of feature mapping types, and thus can approximate any continuous objective function, making classification of disjoint areas possible. Since they have better scalability, stronger generalization and faster learning speeds than SVMs [16], ELMs have been widely used in the fields of fault diagnosis, such as transmission line fault diagnosis [17], rolling bearing fault diagnosis [18]- [20], and analog circuit fault diagnosis [21], [22]. In the field of transformer fault diagnosis, Malik and Mishra [23] applied a principal component analysis on International Electrotechnical Commission Technical Committee (IEC TC) 10 DGA data to find the most relevant variables first and then used an ELM to classify early transformer faults. Compared with fuzzy logic and artificial neural networks, their approach has better diagnostic accuracy. Li [24] et al. first used an adaptive evolutionary ELM to optimize the weights and bias of a network and then utilized an arctangent transformation to change the structure of the experimental data, leading to an improved generalization ability of the algorithm to improving. Huang [25] et al. proposed ELM power transformer fault diagnosis technology based on multiscale information fusion. Specifically, multiple ELM models with different numbers of nodes are employed to generate the initial diagnosis results and then multiscale information fusion is used to achieve a final fault diagnosis result by fusing the initial diagnosis results. Although it has advantages, including fast calculation speeds, strong generalization abilities, and anti-overfit abilities, Wei's method selects the optimal number of neurons in the hidden layer of the ELM according to trial and error, and thus may not fully reveal the sample information, which will result in overfitting when the number of hidden layers is too large. The ELM based kernel function may be an effective method for addressing the above problem. However, the performance of the ELM based kernel function is sensitive to the parameters and regularization coefficients of the kernel function. This paper combines the modified gray wolf optimization (MGWO) algorithm with the hybrid kernel ELM (KELM) to construct a transformer fault diagnosis method. In the proposed method, a hybrid kernel function with a linear combination of a global kernel function and a local kernel function is proposed to improve the learning ability and generalization ability of the KELM. Moreover, modified GWO is proposed by combining differential evolution (DE) [26] and gray wolf optimization (GWO) to optimize the parameters of the hybrid kernel; thus, the optimized network structure of the hybrid KELM is achieved, improving the accuracy of transformer fault diagnosis. Finally, the experimental results on IEC TC 10 DGA data and our data show that the KELM algorithm has a better classification effect on the five types of transformer faults than the SVM algorithm.

II. TRANSFORMER FAULT DIAGNOSIS BASED ON A KERNEL EXTREME LEARNING MACHINE A. KERNEL EXTREME LEARNING MACHINE
The ELM output function in the case of a single output node is: where is the output weight between the i-th node of the hidden layer and the output layer and β = [β 1 , . . . , β L ] T is the output weight vector. G (a i , b i , x) is the output of the i-th hidden layer node, and the node parameter is randomly x)] T is the output vector of the hidden layer relative to the input. The output vector h(x) is a feature map that maps data from the n-dimensional input space to the L-dimensional hidden layer feature space H. After introducing the kernel function, the kernel matrix of the KELM can be defined as: The output function of the ELM classifier can be further written as: where I is the identity matrix, λ is the regularization coefficient, and T is the training set label. After we use this method, we do not need to know the specific form of the feature map h (x) but use the kernel function for the output calculation. Thus, the random generation of weights and bias is avoided, and there is no need to set the number of hidden layer neurons L.

B. HYBRID KERNEL FUNCTIONS
Traditional kernel functions include global kernel functions and local kernel functions. A global kernel function emphasizes the influence of the data as a whole on the kernel function; hence, the generalization performance is stronger, but the learning ability is weak. A local kernel function emphasizes the influence of the data near the key point on the kernel function; hence it has a strong learning ability, but the generalization performance is weak. According to Mercer's theorem, the nonnegative linear combination of Mercer nuclei is still a Mercer nucleus [27]. To obtain a strong learning ability and generalization ability, two different types of kernel functions are merged by means of variable weights to construct a hybrid kernel function.
The Gaussian kernel function K (x,y) = exp − γ · x − y 2 can map the input space to an infinite-dimensional feature space. The function structure is simple, the convergence speed is fast, and the learning ability is strong. Therefore, the Gaussian kernel function is selected as the local kernel function [28]. In the function, γ is the width control parameter, which can control the radial range of the function. When the input sample changes in a large range, the polynomial kernel function K (x,y) = ((x · y) + η) d still has a greater impact on the sample and has a good generalization performance. Therefore, a polynomial function is used as the global kernel function, and the hybrid kernel function after fusion is: where ω is an adjustable global kernel function weight parameter, which defines the relative contribution of the global kernel to the hybrid kernel function. It can assign different linear weights to the global kernel function and the local kernel function.

C. FAULT DIAGNOSIS MODEL
Transformer fault diagnosis is essentially a multiclassification problem. This paper uses a hybrid kernel function ELM as a classifier to extract the internal features of transformer fault data. The constructed fault diagnosis model mainly includes the following: (1) Sample collection: Transformer DGA data containing various fault types are collected to form a fault sample set. (2) Feature selection: The hybrid feature set is used as the input of the fault diagnosis model. (3) Normalization processing: To eliminate the difference in the data size of different features, the feature data are normalized preprocessing with formula (5), and the normalized sample value is in the range of [0, 1]. Hence, the calculation speed of the model is increased.
where x * is the value after normalization and x i max and x i min are the maximum and minimum values of the data before normalization, respectively. (4) Sample division: Five-fold cross validation is used to randomly divide the samples into a training set and a test set. Cross validation can be used to effectively evaluate the performance of the training model and improve the stability and generalization ability of the model.

III. PARAMETER OPTIMIZATION BASED ON THE MODIFIED GRAY WOLF ALGORITHM A. MODIFIED GRAY WOLF OPTIMIZATION ALGORITHM
GWO is a group intelligence algorithm based on gray wolves' social hierarchy and hunting behaviors [29]. In traditional GWO, each wolf pack is generated by the initial wolf pack, which means that the GWO algorithm's accuracy depends largely on the initial wolf pack. If the randomly generated initial wolf pack selection fails, the algorithm will converge prematurely, and the accuracy will be reduced. In the later stages of the iteration, when each individual is close to the prey, the algorithm can become easily stuck in a state of stagnation and can lack diversity. The DE algorithm is introduced to improve GWO, and GWO is forced to jump out of the local optimum through mutation, crossover, and selection.
In the GWO algorithm, each gray wolf represents a candidate solution in the population. The optimal solution is defined as alpha (α), the second-best solution is defined as beta (β), the third-best solution is delta (δ), and the remaining candidate solutions are assumed to be omega (ω) wolves. In GWO, the optimization process is guided by α, β, and δ. The location update process is shown in formula (6)-formula (13): where t represents the current iteration number; A and C are coefficient vectors; X α , X β and X δ represent the current positions of α, β and δ respectively; After each position update, the gray wolf population generates a mutated individual M i,t with formula (14): Among them, the parameters r i 1 and r i 2 are mutually exclusive integers randomly generated within the range of [1,NP]; F is the scale factor; and X best,t is the wolf pack individual with the best fitness value in the t-th generation group.
After the mutation stage is over, the test individ- . . , u D i,t is obtained through the crossover operation, and the crossover operation is shown in formula (15): where CR is the crossover rate; and j rand is a randomly generated integer to ensure that test individual is different from X i,t by at least one parameter.
Finally, the individuals who enter the next generation are selected through the greedy algorithm. The selection operation is shown in formula (16): The pseudo process of the modified gray wolf algorithm is shown in Table 1.

B. MGWO EFFICIENCY TEST
To verify the optimized performance of MGWO, 13 benchmark functions [30], [31] are used to test the algorithm. To verify the results, the MGWO algorithm is compared to the particle swarm optimization (PSO) and traditional GWO algorithms. The test function is shown in Table 2. Among them, the function dimension, variable search range and function optimal value are abbreviated as Dim value, Range and f min respectively. In addition, the Dim, the population size, and the maximum number of iterations of the 13 benchmark functions are set to 30, 30, and 1000, respectively. To avoid random errors, each test function was run independently 30 times, and its average results and standard deviations were recorded. The test results are shown in Table 3.
As seen from the table, MGWO is significantly better than PSO for 13 out of 10 benchmark functions. Moreover, the results of MGWO are significantly better than those of traditional GWO in dealing with all the selected functions. In short, the modified GWO algorithm has a more powerful search performance and can avoid local optima very well.

C. PARAMETER OPTIMIZATION BASED ON MGWO
The parameter setting has a great influence on the classification accuracy of the KELM. As the parameters change, the correct rate of the fault diagnosis model also changes. The parameters of the hybrid KELM are optimized by using the modified GWO algorithm. The specific optimization steps are as follows: (1) The initial parameter settings include the number of gray wolf individuals NP, the maximum number of iterations t max , the parameter dimension D, the scale factor F, and the crossover rate. (2) Set the value range of each parameter λ, γ , η, ω of the hybrid KELM. Initialize the gray wolf population, and each gray wolf individual corresponds to a set of parameters. (3) Calculate and rank the fitness of the gray wolf population.
The top three individuals are the α wolf, β wolf and δ wolf. (4) Initialize random numbers r 1 and r 2 , and use formulas (8), (9) and (10) to calculate A, C, and a, respectively. Formulas (11), (12), and (13) are used to update the gray wolf population positions. (5) Perform the mutation operation to generate mutant individuals, and then perform crossover operations between the mutant individuals and the gray wolf population according to formula (15) to obtain test individuals. (6) According to formula (16), judge whether each test individual is retained. (7) Calculate the fitness of the updated population and sort the population. If the termination condition is met, the optimal individual and its fitness are output. Otherwise, return to step 4.

IV. SIMULATION EXPERIMENT ANALYSIS A. TRANSFORMER FAULT TYPES AND SIMULATION EXPERIMENT DATA
According to IEC 60599, transformer fault types are divided into five types: partial discharge (PD), discharges of low energy (D1), discharges of high energy (D2), thermal faults of low and medium temperatures (T1 and T2), and thermal faults of high temperatures (T3). The IEC TC 10 transformer fault data is used to train and test the model, and the transformer fault data collected in China is used to test the generalization performance of the diagnostic model. One hundred seventeen sets of IEC TC 10 transformer fault data and three hundred seventy sets of domestic transformer fault data were collected. The distribution of the fault samples is shown in Table 4.

B. FEATURE SET SELECTION
Feature selection is the key to classification models. For transformer fault diagnostic models, different diagnosis methods use different feature combinations. AI methods often use DGA gas content as the input of the diagnostic model, whereas traditional DGA diagnostic methods often use the dissolved gas ratio as a feature combination. To obtain the core attributes contained in the sample data, based on the feature combination corresponding to the above methods, this paper proposes a hybrid feature set. The feature combinations corresponding to different diagnostic methods are shown in Table 5.
The five traditional feature sets in Table 5 are used to perform fault diagnosis on the IEC TC 10 transformer fault data, and they are with the proposed hybrid feature sets. The VOLUME 9, 2021 SVM method is used to carry out the simulation test, and the correct average rate of the cross-validation fault diagnosis task is shown in Table 6. Table 6 shows that the diagnostic accuracy is the highest when the mixed feature set is used as the input for fault diagnosis.

C. SEARCH SPACE
The generalization performance of the KELM is closely related to the parameters of the kernel function. To obtain a great generalization performance and improve the convergence speed, it is necessary to select an appropriate parameter optimization space. Performance tests are performed on the radial basis kernel function parameter (λ, γ ) and the polynomial kernel function parameter (λ, η). Based on the same data set λ, γ , η ∈ {2 −24 , 2 −23 , . . . , 2 24 , 2 25 }, each pair of (λ, γ ) and (λ, η) has 2500 different combinations, and the polynomial kernel parameter d = 3. The simulation results are shown in Fig. 2 to Fig. 5. Fig. 2 is a grid diagram of the KELM classification accuracy that uses the radial basis kernel function. Fig.3 is a contour map of KELM classification accuracy using the radial basis kernel function. Fig. 4 is a grid diagram of the KELM classification accuracy using the polynomial kernel function. Fig. 5 is a contour map of the KELM classification accuracy using the polynomial kernel function.
In Fig. 2 to Fig. 5, the deeper the yellow dots are, the higher the accuracy of the diagnosis is. The deeper the purple dots are, the lower the fault diagnostic accuracy is, and the red dots are the points with the maximum fault diagnosis accuracies. Fig. 2 to Fig. 5 show that the selection of parameter combinations A and B has a significant impact on the accuracy of fault diagnosis. Only in a very narrow range can the fault    diagnosis accuracy rate reach the best. If the parameters are selected improperly, the accuracy rate will decrease sharply. The classification accuracy contour and the maximum value points of the classification accuracy rate are combined to select an appropriate parameter range. The radial basis kernel function parameter search range from Fig. 3 is selected as λ ∈ 2 −5 , 2 5 and λ ∈ 2 −5 , 2 5 , and the polynomial kernel function parameter search range from Fig. 5 is selected as λ ∈ 2 0 , 2 10 and η ∈ 2 −10 , 2 0 . Then, the optimal parameter search ranges of the two kernel functions are shown in Table 7.

D. SIMULATION ANALYSIS
Based on the IEC TC 10 transformer fault data in Table 2, the hybrid feature set is used as input to test the MGWO-KELM model. The 117 groups of fault data are divided into VOLUME 9, 2021      correct average rate of the five test groups is used as the fitness value. The kernel function parameters and weight search range during training are shown in Table 7. The relevant initialization parameters of the modified gray wolf algorithm are set as follows: the population size is 20, the maximum number of iterations is 100, the variable dimension is 4, the scaling factor F = 0.5, and the crossover rate CR = 0.5. The iterative diagram of the MGWO-KELM model is shown in Fig. 6.  Based on the same combination of DGA features, the PSO-KELM, MGWO-SVM, and PSO-SVM models are used to diagnose transformer faults. The fitness curve of the obtained PSO-KELM algorithm is shown in Fig. 7, the fitness curve of the MGWO-SVM algorithm is shown in Fig. 8, and the fitness curve of the PSO-SVM algorithm is shown in Fig. 9. As shown in Fig. 8 and Fig. 9, during the iterative process, the average fitness of the SVM algorithm frequently fluctuates, indicating that it is too sensitive to the parameters and that subtle parameter fluctuations will seriously affect the classification effect of the model. This model can increase the difficulty of optimization. Fig.7 and Fig.9 show that the PSO algorithm has poor searchability and is prone to falling into a local optimum. In addition, the algorithm stalled early in the iteration and could not optimize the model parameters. Comparing Fig. 6 with Figs. 7-9 shows that the hybrid core ELM has a strong learning ability and generalization performance leading to a higher classification accuracy rate during the training process. The modified GWO algorithm has a strong   search ability and faster convergence speed, and it only needs a few iterations to achieve the best network structure. Fig. 10 to Fig. 13 show the classification results on the IEC TC 10 transformer fault data of the trained model. Fig. 10 shows the fault classification result of the MGWO-KELM model proposed in this paper, Fig. 11 shows the fault classification result of the PSO-KELM model, Fig.12 shows the fault classification result of the MGWO-SVM model, and     Table 8. Fig.10 to Fig.13 and Table 8 show that the correct rate of the MGWO-KELM algorithm proposed in this article is 88.89%, which is higher than 83.76% obtained by the PSO-KELM, 81.2% obtained by the MGWO-SVM, and 78.63% obtained by the PSO-SVM. This shows that the hybrid KELM optimized by the modified gray wolf algorithm has a better fault diagnosis performance. The proposed algorithm is further compared with the methods in the literature. In [11], the author established an improved Krill-herd (IKH) algorithm to optimize the SVM transformer fault diagnosis model. Based on the same IEC TC 10 data set, the average testing accuracy of IKHSVM in [11] reaches 85.71%. Compared with the results in [11], the test accuracy of the MGWO-KELM is higher, which verifies the validity of the model. A paired t-test is used to determine whether there are significant differences between the MGWO-KELM algorithm and the other three algorithms. First, the error rate of the 5-fold cross-validation test set of the different algorithms is calculated. According to Difference i , the t-test is carried out. The mean µ, variance σ 2 and t-statistic τ t of the differences between the different algorithms are calculated. The results are shown in Table 9. When α = 0.05, the critical value t α/2,k−1 = 2.776. The t-statistic values τ t of the three-component paired t-tests are all greater than 2.776. This shows that the MGWO-KELM algorithm is significantly better than the compared algorithms.
The MGWO-KELM model is used to simulate and analyze the collected domestic 370 sets of transformer fault data. The diagnosis results and correct rates of the different fault types are shown in Table 10. For T3 faults, 147 groups were correctly diagnosed, and 8 groups had diagnostic errors, including the PD 2 group, D2 3 group, and T1 and T2 3 groups. The correct rate is the highest among all failures, and it is 94.8%. For D1 faults, 29 groups were correctly diagnosed, and the correct rate was 61.7%.
The MGWO-SVM algorithm is used to diagnose the same 370 sets of transformer fault data, and a comparison of the simulation results is shown in Fig. 14. Compared with the SVM, the hybrid KELM algorithm has a significant improvement in the diagnostic accuracy of the five types of faults. The diagnostic accuracy of the MGWO-KELM model for the 370 sets of transformer fault data is 87.3%, which is similar to the results on the IEC TC 10 fault data; this further verifies the reliability and validity of the MGWO-KELM model.

V. CONCLUSION
In this paper, a transformer fault diagnosis model is established based on the KELM algorithm. The model is optimized from two aspects: the hybrid kernel function and the MGWO algorithm, which further improves the model effect. By comparison with other algorithms, the conclusions are as follows: (1) The fault diagnosis model based on the hybrid KELM algorithm can accurately and effectively identify the type of transformer fault. Compared with the traditional SVM, it has higher classification accuracy. (2) The traditional GWO algorithm tends to fall into local minima or prematurely converge during the optimization process. The DE algorithm is used to improve the GWO algorithm and enhance the search performance of the gray wolf algorithm. Compared to the conventional PSO algorithm, the MGWO algorithm has a stronger search ability and faster convergence speed. (3) There are still some shortcomings in the research. For example, when the number of samples is small, the proposed model has difficulty distinguishing D1 and D2 faults, resulting in misclassification. In fact, transformer fault are is also related to other factors, such as voltage levels, insulating oil types [32], oil temperatures, loads, and operating years [33]. In future work, the relationship between multisource fault information and transformer faults will be comprehensively considered to further enhance the accuracy and reliability of transformer fault diagnosis.