Power Transformer Fault Diagnosis Based on DGA Using a Convolutional Neural Network With Noise in Measurements

Fault type diagnosis is a very important tool to maintain the continuity of power transformer operation. Dissolved gas analysis (DGA) is one of the most effective and widely used techniques for predicting the power transformer fault types. In this paper, a convolutional neural network (CNN) model is proposed based on the DGA approach to accurately predict transformer fault types under different noise levels in measurements. The proposed model is applied with three categories of input ratios: conventional ratios (Rogers’4 ratios, IEC 60599 ratios, Duval triangle ratios), new ratios (five gas percentage ratios and new form six ratios), and hybrid ratios (conventional and new ratios together). The proposed model is trained and tested based on 589 dataset samples collected from electrical utilities and literature with varying noise levels up to ±20%. The results indicate that the CNN model with hybrid input ratios has superior prediction accuracy. The high accuracy of the proposed model is validated in comparison with conventional and recently published AI approaches. The proposed model is implemented based on MATLAB/toolbox 2020b.

Power transformers are considered one of the vital equipment in the electric power system. Early detection of transformer faults avoids the discontinuity of the power network and reduces the loss of profits for the electric utilities. Various faults in power transformers are generated due to the deterioration of their insulation system. The insulation system consists of an insulation oil and an impregnated paper. The insulation deterioration results from exposure of the transformer to several stresses such as electrical, mechanical, and thermal stresses. These stresses lead to the formation of dissolved gases, some of them are combustible gases such as hydrogen (H 2 ), methane (CH 4 ), ethane (C 2 H 6 ), ethylene (C 2 H 4 ), acetylene (C 2 H 2 ), and carbon mono-oxide (CO), and others are incombustible gases such as carbon dioxide (CO 2 ) [1]- [3]. These dissolved gases help determine possible failures inside the transformer using dissolved gas analysis (DGA).
The associate editor coordinating the review of this manuscript and approving it for publication was Yu Wang . According to the values of the dissolved gases concentration and their ratios, some theories and rules were established that link the proportions of these gases and the expected failure [4], [5]. In [4], the Key gas method, the Dornenburg method, and Rogers' method were presented as three DGA techniques used to interpret the transformer fault based on the ratio limits between the combustible gases or the percentage to their sum. In [5], the transformer faults were divided into five types based on a dataset for transformers in service (IEC TC 10 database). It classified the transformer faults into five types in a triangular form called the Duval triangle. These faults include the following types: (i) partial discharge (PD) represents small carbonized captures in the paper, (ii) low energy discharge (D1) causes large captures in paper and carbon particles in the oil, (iii) high energy discharge (D2) characterized by extensive carbonizations and metal fusion, (iv) low and medium thermal faults (T1/T2) with oil temperature less than 300 • C for T1 and oil temperature greater than 300 • C and less than 700 • C for T2, and (v) high thermal fault (T3) with oil temperature greater than 700 • C. These abovementioned conventional methods failed to interpret the transformer faults due to some issues such as an outage of gas ratio combinations from predefined codes or dependence on only three combustible gases in the Duval triangle. Therefore, the diagnostic accuracy of such conventional methods in some cases is very poor.
New pentagon-based graphical representations [6], [7] enhanced the diagnostic accuracy of the conventional methods considering the main five combustible gases in their diagnosis. These new graphical representations could increase the diagnostic accuracy rather than the Duval triangle method [8].
In [9], heptagon shape was developed as another graphical representation method that determines the transformer faults based on the main five gases with CO and CO 2 .
Finally, the deep learning approach was implemented in [20]- [22] to predict the transformer fault types. In [20], a depth learning and Softmax classifier model was presented to predict the transformer fault types. The model presented in [21] was built based on a deep belief neural network (DBNN) to diagnose the transformer fault types. In [22], a long short-term memory (LSTM) with DBNN (LSTM-DBNN) model was implemented to detect the transformer fault types, whereas a deep parallel diagnostic (DPD) model was introduced in [23].
The noises in gas concentration measurements are one of the most critical issues that reduce the diagnostic accuracy of DGA methods. Accordingly, these noises should be considered during the evaluation process of the diagnostic method. Noises can originate during either oil sampling, sample storage, or gas separation and measurement. The noises due to sampling and storage can reach about 14%, while measurement noises lie in the range of 5% [24]. Most of the previous DGA methods didn't deal effectively with noises in DGA measurements, which is the paper's main aim.
In this paper, we develop a noise-resistant DGA method by using a Convolutional Neural Network (CNN). Our main contributions are • the augmentation of CNN training dataset with noisy points to improve the DGA diagnosis accuracy, and • solving the DGA problem using different combinations of gas ratios and identifying the ratios that achieve the best diagnosis accuracy.

II. CONVOLUTIONAL NEURAL NETWORKS
In contrast to general neural networks, a convolutional neural network (CNN) contains one or more convolution layers that work as filters [25]. The filter function is applied to each neighborhood of nodes of the previous layer, producing a corresponding set of outputs each time. Fig. 1 illustrates the convolution process in one dimension. The grey nodes represent zero-valued padding at the edges of the input layer to simplify filter processing at the boundaries. The kernel is applied to each subset of neighboring nodes by sliding the convolution kernel, called the input window. Two important configuration hyperparameters define a convolution layer, namely, the kernel size and the stride. The kernel size indicates the number of inputs processed by the convolution kernel in one application. On the other hand, the stride represents the number of nodes by which the filter is displaced after each application. Multiple convolution filters (kernels) can be applied in a single convolution layer to increases the learning capacity to extract more features.
Other types of layers are commonly used within a CNN [25]. Pooling layers are special convolution layers with the main purpose of reducing dimensionality or subsampling. A max-pooling layer outputs the maximum input within its input window. Similarly, an average-pooling layer outputs the average input. Fully connected layers are important for mapping learned features into more comprehensive functions. Threshold layers such as rectified linear unit (ReLU) layers are often utilized to improve the nonlinear capacity of the model. Batch normalization layers are often applied to the output of convolution layers before the application of nonlinearities. Normalization improves the learning speed and dampens the effect of the random initial network weights.
CNNs are expected to be suitable for DGA because of their high noise resilience. Although CNNs are considered complex and expensive to train, this isn't a concern because training is performed offline, and the application of the resulting trained model is sufficiently efficient. Therefore, we focus on obtaining a trained model that achieves the highest possible classification accuracy.

III. PROPOSED METHODOLOGY A. PROPOSED CONVOLUTIONAL NEURAL NETWORK ARCHITECTURE
The proposed CNN architecture is illustrated in Fig. 2. The input points of the CNNs are treated as an image with VOLUME 9, 2021  dimension 9 × 1 to be suitable in predicting transformer fault types based on DGA. Then, two convolution stages are performed. Each stage starts with a convolution layer followed by a batch normalization layer and a ReLU threshold layer. A max-pooling layer is inserted between the two convolution stages. A final classification stage includes a fully connected layer followed by a Softmax layer and a classification layer. The proposed CNN was implemented and simulated using the 2020b MATLAB Deep Learning Toolbox [26].
The configuration parameters for each of the layers are summarized in Table 1. These parameters were fine-tuned through extensive trial-and-error simulations to maximize the CNN prediction accuracy. During training, the CNN weights are optimized using a stochastic gradient-based optimization algorithm known as Adaptive Moment Estimation (Adam) [26], [27], which builds on the idea of adding a momentum term to the weight update formula to reduce oscillation along the steepest descent. Adam also uses different learning rates for different weight vector elements based on a moving average of the first moment of the gradient and a moving average of its square. Finally, gradient clipping is employed to stabilize the training in the presence of gradient outliers. The algorithm is listed as Algorithm 1, and its hyperparameter settings are shown in Table 1.

Algorithm 1 Adaptive Moment Estimation Optimization Algorithm
Inputs: α step size β 1 , β 2 1 exponential decay rates The combined dataset is made available as part of DGALab's public code repository [42], and a summary of the dataset sources and fault type distribution is given in Table 10 in Appendix A. The complete dataset samples are divided into two subsets. The first set is used for training and represents 65% (383 samples), randomly selected from the complete dataset. The second set is used for the testing process and contains the remaining samples representing 35% (206 samples) of the complete dataset. The noise in measurement is introduced to each sample, R = r i , using the equation adapted from [8].
where, m is the maximum noise level and N = n i 5 i=1 is a 5 × 1 random vector with component values between 0 and 1. The maximum noise level is varied in the set {5, 10, 15, 20} to generate four additional sets of noisy samples corresponding to noise levels up to ±5%, ±10%, ±15%, and ±20%. After augmenting the original dataset with the generated noisy samples, the total number of the training samples becomes (383 × 5 = 1915), while the total number of the testing samples becomes (206 × 5 = 1030).

C. PROPOSED INPUT RATIOS
The CNN is trained using the training samples but with different input ratios. The CNN inputs used are (i) the old conventional ratios, (ii) new ratios, and (iii) hybrid ratios (conventional and new ratios together), while the output of the CNN is the transformer fault types. Table 2 presents the different ratios used for training the CNN model. The transformer fault types diagnosed by the CNN are (i) partial discharge (F1), (ii) low energy discharge (F2), (iii) high  energy discharge (F3), (iv) low thermal (F4, oil temperature less than 300 • C), (v) medium thermal (F5, oil temperature greater than 300 • C and less than 700 • C), and (vi) high thermal (F6, oil temperature greater than 700 • C). Fig. 3 presents the proposed methodology diagram. Firstly, the five input dataset gasses (H 2 , CH 4 , C 2 H 6 , C 2 H 4 , C 2 H 2 ) in ppm are used to generate the noise data with different noise levels up to ±20%. Then, the dataset transformation process, according to that presented in Table 2, is implemented. The transformation ratios of {1} to {10} are selected, and then the dataset is randomly divided into training and testing sets. Thus, the training dataset is used for training the CNN model. Finally, the training and testing datasets are applied to the generated CNN model to obtain the output diagnosis for both.

IV. RESULTS AND DISCUSSION
The convolutional neural network (CNN) is implemented and carried out using MATLAB/toolbox 2020b. The CNN model is developed based on the training samples then its prediction accuracy is evaluated. Like other optimization methods, a CNN depends on random initialization, which means that different results are obtained each time the CNN is trained using the same dataset. Therefore, we train the CNN ten times for each of the ratios introduced in Table 2. Finally, the CNN model accuracy for each ratio is evaluated based on the mean value of the ten training results.
The CNN prediction accuracy can be estimated as follows: where, PT and NT are the positive and negative class true rates, respectively, and PF and NF are the positive and negative class false rates, respectively. The CNN loss can be expressed as follows: where, n is the number of dataset samples, C is the number of classes, k ij is the probability that the i th sample belongs to the j th class, O ij is the output of the dataset sample i in the class j, which is the output of the Softmax layer. Table 3 presents the statistical analysis of the prediction accuracy obtained through ten training attempts of the CNN model when its inputs are the new form ratios {5}. After each attempt, we observed the prediction accuracy for three cases: (i) using the training samples with all noise levels, (ii) using the testing samples with all noise levels, and (iii) using the complete dataset (both training and testing samples) for each maximum noise level, 0%, ±5%, ±10%, ±15%, and ±20%, separately. The training accuracy varies from 96.8% to 98.1% with mean and STD values of 97.7% and 0.37%, respectively, while the testing accuracy varies from 92.8% to 94.6% with mean and STD values of 93.7% and 0.54%, respectively. The CNN model exhibits good accuracy for detecting the fault types from noisy samples with maximum noise levels up to ±20%. The mean values of the prediction accuracy over the 10 training attempts are 97.4%, 97.2%, 96.5%, 96.1% and 94.3% for 0%, ±5%, ±10%, ±15% and ±20% maximum noise levels, respectively. Fig. 4 illustrates the prediction accuracy and the loss against the iteration number during the second training attempt in Table 3. The results indicate that the training accuracy is near one hundred percent, while the loss is low   near zero, which means a good training accuracy of the CNN model. Fig. 5 compares the actual fault types (F1 to F6) against the predicted fault type to illustrate the prediction accuracy of the CNN model at 0%, 5%, 10%, and 20% noise levels. The results show that the CNN model has a good detecting accuracy at 0%, 5%, 10%, and 20% noise levels with high prediction accuracy of 96.9%, 96.6%, 96.3%, and 94.7%.
The CNN model was trained with different input ratios. The mean prediction accuracy for ten attempts of training and testing episodes at each noise level was calculated and presented in Table 4 to increase confidence in results. The results illustrate that the prediction accuracy with the new ratios is better than that with the conventional ratios as an input of the CNN model. Furthermore, the results illustrate that the accuracy with the hybrid ratios {7} (Percentage ratios + ln{Percentage ratios}) has the highest accuracy of 95% for  overall testing samples and the highest prediction accuracy of 95.8% with up to ±20% noise level. Fig. 6 presents the prediction accuracy and loss against the iteration number during one of the training attempts with input ratio {7} to the CNN model. The results illustrate that the training accuracy is the nearest to one hundred percent, while the loss is low near zero, indicating a good training accuracy of the CNN model. Fig. 7 illustrates a comparison between the fault types (F1 to F6) predicted by the CNN model against the actual fault types. The results show that the CNN model has a good performance at 0%, 5%, 10%, and 20% noise levels with high prediction accuracy of 98.5%, 98.3%, 98%, and 96.6%, respectively. Fig. 8 presents a boxplot comparison of the CNN prediction accuracy with inputs {1}, {4} and {7} at noise levels of 0%, ±5%, ±10%, ±15% and ±20%, respectively.  It illustrates that the prediction accuracy of the CNN with input {7} (hybrid input ratios, five percentage ratios + ln [Rogers' 4 ratios]) is a superior one compared to that of inputs {1} (conventional input ratios, Rogers' 4 ratios) and {4} (new input ratios, five gas percentage ratios) respectively.

V. MODEL VALIDATION A. CNN MODEL AGAINST ANN WITH NOISY DATA
The performance of the proposed CNN model is compared with the artificial neural network method (ANN). The ANN method is built using MATLAB ANN toolbox version 2020b. The trained dataset (1915 samples was applied to the 9-variable input ratio {7}) is divided into three sets (70% for training, 15% for validations, and 15% for testing). The ANN model is trained under different hidden layer numbers. One hundred twenty-five neurons are used for hidden layers that give the best training predicting accuracy. The training performance for both training, testing and validation datasets is introduced in Fig. 9. The training and testing results of the   Table 5.
Six dataset samples with different noise levels (0%, ±5%, ±10%, ±15% and ±20%) are used in Table 6 as case studies for comparing the CNN and ANN models. The original samples are shaded in grey, followed by the noisy samples derived from the original sample using (1). The ACT column indicates the actual fault type. The CNN and ANN columns indicate the corresponding diagnosis generated by each of the two methods. The results illustrate the effectiveness of the CNN model with different transformer fault types and all different noise levels up to ±20%, while the ANN model fails to diagnose the transformer fault types with high noise levels (Highlighted as bold).

B. CNN MODEL AGAINST MACHINE LEARNING MODELS WITH NOISY DATA
Three machine learning approaches are used decision tree method (DT), support vector machine method (SVM), and ensemble method (EN) are built based on the MAT-LAB/classification learner toolbox (2020b). The CNN, DT, SVM, and EN methods were applied to the 9-variable input ratio {7}. Then, the generated models were compared using VOLUME 9, 2021  the testing samples. Fig. 10 presents the minimum error against iteration number during the training stage of the three machine learning methods (DT, SVM, and EN methods).  The cross-fold validation with ten folds is used during % training of the three machine learning methods. The optimization technique used to determine the optimal parameters of each method are the Bayesian optimization method, while the acquisition function used is expected improvement per second plus. The optimal parameters of the DT, SVM and EN methods are introduced in Table 7. Table 8 illustrates the results of the proposed CNN model and the results of DT, SVM, and EN methods during training and testing stages. The results illustrate the superiority of the proposed CNN model compared to other methods. The results of the proposed CNN model are compared with the conventional and recently published AI methods at different noise levels. Table 9 presents the results of the best training attempt of the proposed CNN model with input ratio {7} side by side with the results of Rogers'4 ratios, IEC 60599, Duval triangle, conditional probability [8], modified-Rogers'4, modified-IEC 60599 [2] and code-tree [19] methods. The results indicate the superiority of the proposed CNN model compared to the other methods. To facilitate the application of the proposed method by electrical engineers and experts in the field, it was implemented in the DGALab framework [42].

VI. CONCLUSION
This research proposed the CNN model to detect the transformer fault types under different uncertain noise levels in measurements. The DGA data were collected from various resources, including literature and electrical utilities. The noise in DGA data was introduced to all samples with various levels ranging up to ±20%. The dataset samples were randomly divided into two subsets, a training set with 65% of data samples and a testing set with the remaining data samples. Moreover, the data samples were presented with different input ratios to the CNN. The superiority of CNN in detecting various fault types with different noise levels was evaluated through several indicators as follows: 1-It was found that the predicting accuracy of the CNN with the input of five percentage ratios plus ln [Rogers' 4 ratios] is a superior one compared to other inputs.