Anomaly Detection for Partial Discharge in Gas-Insulated Switchgears Using Autoencoder

In this article, we propose a new anomaly detection method to detect the partial discharge in a gas-insulated switchgear. An autoencoder was used for anomaly detection and was modeled on the one-class classification problem. Based on the one-class classification scenario, in which the training data exploited the noise data only, the proposed autoencoder learned the low-dimensional latent information from the high-dimensional space of the input signal. Then, the reconstruction error was used as a fault indicator, and the threshold was determined using the partial discharge data. The performance of the proposed AE was verified by on-site noise and PRPD experiments, using an online UHF PD monitoring system in the real-world environment. The results showed that the proposed autoencoder not only achieved 86.75% detection performance for the on-site noise and partial discharge data in gas-insulated switchgears but also allowed better detection performance than the one-class support vector machine learning procedure by 40.5%.


I. INTRODUCTION
Power systems are rapidly expanding with increasing power demands and distributed energy resources. For utilities, owing to the aging of existing power systems, asset management is a requisite to extend the life of infrastructure assets and ensure the reliability of the power grid [1]. Several studies, such as event detection [2], reliability evaluation [3], [4], and fault diagnosis [5] have been conducted for asset management in power grids. Condition monitoring and diagnosis is a major part of asset management. Thus, online and offline measurements are performed to monitor the conditions of the power grid assets [5], [6]. A gas-insulated switchgear (GIS), applied to substations, is a major protection device for electric power facilities. A GIS is a valuable device in protecting, controlling, and isolating the equipment in a power grid in the case of an incident (e.g., power surge) [7], [8]. Internal defects can occur in a GIS during the process of transfer, installation, and operation [7]. When a failure occurs in a GIS, the impact The associate editor coordinating the review of this manuscript and approving it for publication was Dazhong Ma . of the accident is huge so recovery takes a lot of time and the power outage also increases. Defects in GISs cause PDs that result in the breakdown of insulation [8]. Therefore, it is essential to avoid failure by detecting the PD of a GIS and addressing the defects in the GIS at an early stage [9], [10].
According to IEC 62478, there are several phenomena, such as light emissions, acoustic waves, electromagnetic signals, and chemical reactions caused by a PD occurrence in a GIS [11]. To measure these phenomena, various electrical, mechanical, and chemical methods have been used. Few existing electrical methods use ultra-high frequency (UHF) sensors, the acoustic methods use the acoustic sensors, and the chemical methods use the dissolved gas analysis technique [10], [12], [13]. In particular, the electrical method that uses the UHF sensors has the advantage of high sensitivity for PD detection. In this study, the UHF method was utilized in the PD measurement system for condition monitoring and assessment of GISs [12].
In order to investigate the PD characteristics in a GIS, the time-resolved partial discharge (TRPD) and the phaseresolved partial discharge (PRPD) analysis methods were studied [14]. The TRPD based method analyzed the timedomain, frequency-domain, and time-frequency-domain features from the PD pulses [14]- [16]. The PRPD based method measured and analyzed the amplitude at each phase. Defect types were identified by analyzing the number of the PD pulses, maximum amplitude, or average amplitude in each phase [17]. In the PRPD method, we used the signalprocessing techniques in the time-domain [18], frequencydomain [19], and time-frequency-domain [20] to obtain meaningful features. From these features, machine learning based classifiers, such as neural networks [21], decision trees [22], k-nearest neighbors (k-NN) [23], and support vector machines (SVMs) [24] were trained for PD classifications.
To improve the performance of fault detection, many deep learning models have been studied to extract features and classify the PDs automatically in an end-to-end manner. Deep neural networks have achieved state-of-the-art accuracies in multiple pattern recognition tasks in different domains, such as computer vision, speech recognition and text classification [25]- [27]. To improve the fault detection accuracy using the PRPD method, various deep neural network models have been proposed based on popular neural network structures such as the convolutional neural network (CNN) [28], long short-term memory (LSTM) [29], and self-attention network [30]. However, most of the existing deep learning-based fault diagnosis methods are supervised, i.e., they require training the data with their corresponding fault labels [31], [32]. It is generally difficult to obtain an on-site labeled fault data.
In this article, to overcome the lack of on-site labeled fault data for supervised learning, we propose a new anomaly detection method for GISs using autoencoders (AEs). An AE is a simple feed-forward neural network with multiple layers of hidden nodes, where the number of hidden nodes is usually fewer than the input nodes. The model trains the hidden nodes to reconstruct the input at the output without a labeled dataset by minimizing the reconstruction error [33]. The proposed AE learnt the features directly from the unlabeled noise data during the training process. Noise and PRPD measurements of GISs were conducted on-site. Noise data was used for training, validating, and testing as well as in new test sets. The loss function was used to calculate the reconstruction errors in our AE model and the hyper parameters of the proposed AE were determined using a validation set. The test set, which was composed of noise data and PD data, was used to determine the threshold for fault detection in a semi-supervised manner. Then, the detection performance was verified using a new test set that included the on-site noise and the PRPD data. The major contributions of this article are summarized as follows: • Anomaly detection using AEs was applied for the first time to detect faults in semi-supervised learning. The proposed AE was trained using noise data in the realworld environments and fault data was used to determine the threshold. The proposed AE had an advantage in detecting faults in real-world applications because the proposed AE used only the noise data during the training process.
• The proposed AE outperformed the one-class support vector machine (OCSVM), an anomaly detection method that required only one class of normal samples [34]. This was because AEs had the advantage of feature extraction based on the on-site noise data for GISs.
• The performance of the proposed AE was verified through on-site noise and PRPD experiments in the real-world environments. On-site PRPD data included seven types of faults that could occur in a GIS, such as the crack, floating, free particle, protrusion on conductor (POC), protrusion on enclosure (POE), particle on spacer (POS), and void. By using the on-site noise and the PRPD data, the proposed method exhibited an improved detection performance than the OCSVM by 40.5% and also exhibited a detection performance of 86.75% for on-site noise and PD data in GISs.
The remainder of this article is organized as follows. We briefly introduce the anomaly detection in Section II. Section III presents the on-site noise and the PD measurements for GISs. In Section IV, an anomaly detection method using an AE is presented. Performance evaluations are presented in Section V. Finally, the article is concluded in Section VI.

II. ANOMALY DETECTION
The detection of anomalies has provided a classic explanation of problems across multiple domains, ranging from scientific observations to financial transactions [23]. We define anomalies as items, events, or observations in the data that deviate significantly from other items, events or observations in terms of behavior so as to arouse suspicion. Anomalies are also referred to as abnormalities, deviants, or outliers in the data mining and the statistics literatures [31]. Several attempts have been made to detect anomalies in order to define an area that reflects the normal behavior; any observations outside the defined area is an anomaly [35].
Supervised deep learning has been widely researched in many fields [32], [36], [37]. However, most supervised methods require a set of labeled datasets of both the normal and the abnormal classes to train a deep supervised binary or a multiclass classifier. Although the performance of the supervised models for detecting anomalies has improved, but in practice they still face some problems. First, it is difficult to define a normal area that contains all possible normal behaviors. Furthermore, the distinction between normal and abnormal behaviors are often unclear, an apparently anomalous observation near the boundary may in fact be normal and vice versa. Secondly, the concept of normal behavior continues to evolve and it might not be possible to adequately represent an existing notion of normal behavior. Thirdly, labeled data is usually difficult and expensive to obtain, which lead to the lack of labeled training samples [35]. Finally, the data usually VOLUME 8, 2020 includes noise that tends to be close to the real anomalies, making it difficult to identify and delete them.
In contrast to supervised learning, unsupervised learning consists of working with unlabeled data [23], [31], [35]. Unsupervised approaches do not require labeled anomaly data, so they are more suitable for fault diagnosis to learn the underneath features without the requirement of a labeled dataset [31], [32]. The basic idea behind the unsupervised anomaly detection approach is to find an approximate model capable of capturing the normal behaviors of complex systems. The approximate model could then be used to mark anomalies if the deviation of the predicted behavior of the trained model from the actual observation exceeds a certain threshold. However, it is difficult to converge high dimensional data in the unsupervised technique as it is less accurate than the supervised technique.
Semi-supervised learning deals with a partially labeled dataset to detect anomalies [38], [39]. It has advantages of both supervised and unsupervised learning and good accuracy, even when the dataset is not fully labeled. In this study, we propose a semi-supervised anomaly detection method using AE. The proposed method uses unsupervised learning to learn the best possible representation of data and exploits supervised learning to determine the threshold.

III. ON-SITE NOISE AND PRPD MEASUREMENTS
In this section, we present the on-site noise and PD measurements using an on-line UHF PD monitoring system for GISs [40]. Fig. 1 shows a block diagram that is composed of a GIS, an internal UHF sensor, and a data acquisition system (DAS), for noise and PD measurements. The internal UHF sensor is used with a frequency range of 0.5 GHz to 1.5 GHz and a sensitivity of −14.5 dBm at 5 pC. Verification was performed by CIGRE TF 15/33.03.05 [7].

A. ON-SITE NOISE MEASUREMENTS
Noise was measured for 312 cases using a commercial on-line UHF PD monitoring system for on-site GISs in 6 substations in South Korea and other countries. Fig. 2 shows an example of on-site noise signals for 1000 power cycles and the corresponding 2D representation, where the noises of 1000 power cycles are accumulated to generate the 2D representation, the number of noises per 1000 power cycles is illustrated by different colors. The noise signals are evenly distributed for the different values of phases and power cycles, as shown in Fig. 2a.
The measured signal at the m th power cycle can be defined as is the number of power cycles and N is the number of phase angles in a power cycle. In the matrix form, the measured signal is defined as X, which can be represented by a sequence of M consecutive power cycles as  difficult to distinguish between noise and PRPDs, because there is information loss in the process of feature extraction by statistical parameters, such as mean(PDs) and max(PDs).

IV. PROPOSED SCHEME
In this section, we define the anomaly detection problem for PD diagnosis and describe the architecture of the proposed method to detect PRPDs in a GIS. The proposed model employed the autoencoder structure using a training process based on the noise data only and determined the threshold using the noise data and the PRPDs in a semi-supervised manner.

A. PROBLEM FORMULATION
In anomaly detection, we assumed that the training data contained normal data points only, and we identified whether a new sample was an anomaly.
Let L(X) : R M ×N → R ≥0 denote a distance function mapping input matrix to a positive value that shows how far a sample is from a normal state, where M is the power cycle, N is the phase angle, and R ≥0 = {x ∈ R |x ≥ 0 } is the set of positive real numbers. The higher the value of L(X) the higher is the chance of the corresponding data point being abnormal. For a given threshold value ε > 0, we define the detection accuracy L(X) as the ratio of the correctly detected anomaly samples with L(X) > ε to the normal ones as L(X) < ε.
Our goal was to learn the score function L(X) and the corresponding threshold ε, to achieve the best detection accuracy of anomalies on the new test data, while minimizing the falsely identified normal sample.

B. AUTOENCODER ARCHITECTURE
An autoencoder is a neural network using a bottleneck structure. The goal of training the autoencoder was to minimize the difference between the reconstructed input and the original input [32]. Given X ∈ R M ×N , the encoder transforms X to a latent representation z ∈ R a and the decoder is trained to use VOLUME 8, 2020 z to reconstruct the original input, where a is the dimension of the latent vector [41]. Fig. 6 shows the detailed structure of the proposed AE. This model consists of two major components: an encoder (for encoding input) and a decoder (for reconstruction). The encoding part transforms the inputs into an internal representation and the internal representation is translated into the output by the decoding part.
For an input, a flattened vector x Flattened is defined as where W l e and b l e are the weight matrix and the bias vector of the l th encoding layer (l = 1, 2, . . . , L), respectively, ϕ l e is the non-linear activation function, and h 1 e = ϕ 1 e (W 1 e x Flattened + b 1 e ) is the output of the first encoding layer. By stacking multiple encoding layers, the latent vector is calculated as Similarly, the k th decoder layer is a fully connected layer and can be calculated as where W k d and b k d are the weight matrix and the bias vector of the k th decoding layer (k = 1, 2, . . . , K ), respectively, ϕ k d is the non-linear activation function and h 1 is the output of the first decoding layer. Finally, the reconstructed input matrixX is calculated from the last decoding The parameters of the proposed AE model were learnt through the mini-batch B to minimize the loss function, where the parameters included the hyperparameters, weight parameters, and bias parameters. The loss function of the t th training data is calculated as whereX t denotes the reconstructed sample corresponding to the training sample X t . The total loss J is calculated as Before training, the input dataset in the range of x min and x max was normalized to [0, 1], where x min and x max were the minimum and maximum values of the original dataset, respectively [32]. The weights and biases were updated based on the gradient information of the total loss. The Adam algorithm [41] was chosen as the gradient descent method because it required that only the first-order gradient be calculated, thus reducing the calculation complexity.

C. ANOMALY DETECTION USING AUTOENCODER
The proposed AE aimed to find a compact representation of the input data distribution. The AEs are generally dataspecific, and their utility is restricted to data that is considerably similar to their training data [43]. For anomaly detection, the AE model was first trained on a noise-only dataset to regenerate noise signals. Thus, it is difficult to reconstruct PRPD signals using the AE model. Here, the reconstruction error was used as an anomaly score to indicate the potential anomaly, because the reconstruction errors of PRPD signals of the proposed AE are larger than those of the noise signals.
After training, we assumed that the trained AE model had learned a good representation of the normal signal pattern, recorded by the UHF sensor. Therefore, the differences between the reconstructed signals and their corresponding inputs could be considered as an important metric for detecting the possible anomalies. An error threshold was identified based on the performance of the model using the test set containing both noise samples and PRPDs to discriminate the PRPDs from the noise data. To determine the threshold, the receiver operating characteristic (ROC) curve with the test set was used in a semi-supervised manner [44]. The threshold ε was selected by minimizing the distance from the corresponding point on the ROC curve to the ideal threshold, where the true positive rate (TPR) was defined as the probability of detection of the fault data, detected as the fault estimate. The false positive rate (FPR) was defined as the probability of the false alarm that was estimated using the noise data. The ideal threshold had TPR = 1 and FPR = 0. Fig. 7 shows a semisupervised learning method for anomaly detection.

V. PERFORMANCE EVALUATION
This section presents the performance evaluation using on-site noise and PRPD measurements. Table 1 shows the number of samples for on-site noise and PRPDs in GISs, where seven types of PRPDs such as crack, floating, free particle, POC, POE, POS, and void PDs, are considered. Each experiment was performed with M = 1000 power cycles and N = 256 phase angles. Noise was measured using the UHF sensor in on-site fields. For semi-supervised learning, we divided the experimental dataset into training, validation, test, and new test sets [44]. The training set was 80% of the noise data and the validation set was 10% of the noise data.
To determine the threshold, we used 5% of the noise data and 50% of the PRPD data for the test set. Then, the detection accuracy was calculated using the new test set, which consisted of 5% of the noise data and 50% of the PRPD data.
For AE, we deployed the encoder and decoder as plain and fully connected neural networks. Each layer was followed by a rectified linear unit (ReLU) activation with no batch normalization. To acquire the optimized hyperparameters, such as the batch size, number of layers, latent vector size, etc., the proposed model was trained using several combinations of parameters, and the best combination was selected. The AE model with L = 7 layers, a latent vector size = 32, a batch size with |B| = 16 and a learning rate of 0.0001 was chosen. The details of the proposed AE are listed in Table 2, where the total parameters for the training set are 131,408,736.
Reconstruction loss values were used to detect faults by selecting an appropriate detection threshold. This is because the AE model was trained using noise data only. It was expected that the loss values of the noise data would be lower than those of the fault data due to the differences in the data distribution.    9 shows the reconstruction error on the new test set, containing 50% fault dataset and 5% noise data. From these histograms, most noise samples are not detected falsely as fault data, whereas in some fault data, the reconstruction loss is higher than the threshold. To analyze the errors of the proposed AE, Fig. 10 shows the scatter diagram of loss values on the new test set with threshold ε = 0.023 and data visualization on errors for noise, crack, and POS. For anomaly detection, the proposed AE detects most of the PRPDs from noise samples, as shown in Fig. 10a. This is because the features are learned by a small number of nodes in the middle layer of the proposed AE using the on-site noise data for GISs to effectively reconstruct the input. Amplitudes of the noise sample in Fig. 10b are higher than those of the noise data in Fig. 2a. As shown in Figs. 10c and 9d, PRPDs for crack and POS have lower amplitudes compared to Figs. 3a and 3g, respectively. This is because some kinds of crack and POS were detected as noise samples. Table 3 lists the accuracy of the AE model compared to an OCSVM [46]. In our experiment, the OCSVM classifier used the RBF kernel, µ = 0.3, and γ = 'scale' after multiple trials using µ ∈ (0, 1] to find the best model [34]. The proposed AE model achieved 86.75% accuracy on the new test set and was approximately 40.25% higher than OCSVM. This is because the proposed AE could extract meaningful features from the noise data and reliably detect the PRPD. The proposed AE outperformed in the particle and the POS types, and successfully detected 100% and 50% of the samples, respectively, whereas OCSVM could not detect the particle and the POS types in our experiment. Only in the POE type, OCSVM achieved 100% accuracy and was higher than our method by 20% in terms of accuracy. With the other classes, the proposed AE model was more accurate than OCSVM, with approximately 3.33%, 18.75%, 42.42%, and 73.81% differences for floating, crack, POC, and void, respectively. Notably, the performance of the proposed AE was achieved with much fewer false positive detection cases compared to OCSVM, where only 6.25% of the noise samples were falsely recognized as fault data. To understand better what the model learned, we analyzed the features of the input and the latent vectors. Fig. 11 shows the t-distributed stochastic neighbor embedding (t-SNE) representation to visualize a set of inputs composed of both noise and PRPDs, and their output extracted from the latent layer [47]. Here, t-SNE helped in reducing the dimensions of the data, from a multi-dimensional vector to only the top 2 components with maximum variation and visualized them such that similar objects were transformed into nearby points. Fig. 11a shows that numerous fault data are very close to the noise data and hence difficult to classify accurately. As shown   in Fig. 11b the floating, free particle, POC, and void types are separated into noise after the encoding process because the proposed model extracted the meaningful features for reconstruction. It is seen that some samples of the crack and the POS types are close to the noise samples, which affected the low detection performance of the proposed AE, as listed in Table 3. Also, the POE slightly overlaps with the noise sample and hence the proposed AE detection performance for POE was 80%.

VI. CONCLUSION
In this article, we proposed an AE-based anomaly detection method to detect PRPDs in GISs. The proposed AE was solely trained with noise data, and the threshold was determined using noise and PRPD data in a semi-supervised manner. The proposed AE has an advantage of feature extraction, based on the on-site noise data for GISs. It was verified based on the on-site noise and PD measurements, using an online UHF PD monitoring system for GISs. The on-site PD data included seven types of faults, such as the crack, floating, free particle, POC, POE, POS, and void. The experimental results revealed that the proposed AE achieved a detection accuracy of 86.75% and had a 40.25% higher detection per-formance than the OCSVM. The proposed AE can be applied to offline anomaly detection based on the noise and PRPD measurements, using an offline system. In future studies, we intend to design artificial cells for analyzing the PRPD patterns considering various severities of faults and conduct further verifications of the proposed method for all severity levels.