Fault Detection on Insulated Overhead Conductors Based on DWT-LSTM and Partial Discharge

,


I. INTRODUCTION
Insulated conductor can improve the stability of power transmission and reduce the construction space compared with traditional bare conductors [1]- [3]. Therefore, insulated conductors are used more and more in overhead power transmission [4], [5]. However, there is a major challenge in using insulated overhead conductor (IOC). When IOC is broken and fall to the ground or something such as a branch hits IOC, it cannot cause overcurrent. Therefore, phase-to-ground and phase-to-phase faults usually cannot be detected by standard protective equipment [6]. This can create a long-term potential reliability threat in some locations where the tree and tree branches frequently hit the conductors or continuously push and bend the conductors. Eventually, the power line may be damaged and broken, which causes power outage or even The associate editor coordinating the review of this manuscript and approving it for publication was Qingli Li . tree fire [7]. These faults can occur partial discharge (PD) phenomenon [8], [9]. The current value of PD is very small (about 6-10 amps) and easy to be interfered by external background noise, which is the main problem to achieve accurate detection [10].
In recent years, deep learning algorithms, such as deep belief network, deep neural network, convolutional neural network, deep adversarial convolutional neural network, and transfer network, have been gradually applied to fault diagnosis [11]- [15]. In 2016, Li et al proposed a convolutional neural network (CNN) with deep architecture which is established to extrapolate new features automatically to realize ultra-high frequency (UHF) signals recognition in GIS [16]. In 2017, a novel method based on multi-kernel multi-class relevance vector machine (MMRVM) was proposed for partial discharge pattern recognition [17]. In 2018, Wan et al. proposed an approach to detecting PD patterns in gas-insulated switchgear (GIS) using long short-term memory (LSTM) and recurrent neural network (RNN) [18], [19]. In the same year, the stacked denoising autoencoder (SDAE) based deep learning method for PD pattern recognition of different insulation defects of high voltage cables was presented in [20]. Based on deep learning, the network of [21] was constructed for pattern recognition straight forwardly. Through on site detection and simulation experiments, image data sets of five partial discharge defects are established and comparative experiments are conducted. Adam et al. applied LSTM to identify different types of PD activity in insulated cables, using single PD impulses as input data. The experimental result shows that the recognition accuracy of LSTM is slightly lower than the random forest method, but LSTM method has the advantage of not requiring artificial statistical features [22]. The spectrum form is usually adopted in the online monitoring system for partial discharge of power equipment. As the discharge time increases and the discharge intensity changes, the PD spectrum will also change accordingly. When the PD spectrum fluctuates with time, its characteristic parameters (such as phase, amplitude, and number of discharges) will also change with time series. The discrete wavelet transform (DWT) can decompose the original signal and extract the features with different resolutions [23]- [25]. LSTM has excellent capability of time series information mining. This paper presents a novel method that combines DWT with LSTM of many-to-one input and output, and increases time series details of PD activity. The method is used to detect PD in the IOC fault. The overall framework of the proposed method is shown as Figure 1. Firstly, the original signal noise is reduced using wavelet method. Secondly, the noise reduction signal is decomposed by DWT and signal features with different resolutions are obtained. Finally, the signal features are input into the many-to-one LSTM model, and the detection result is obtained.
Our contributions are summarized as follows: (1) We propose an effective learning method for power grid fault diagnosis with noisy signal. This method is implemented by combining DWT and LSTM. (2) The oversampling technology is proposed to solve the problem of serious imbalance between the number of fault samples and the number of normal samples. The comprehensive evaluation index is designed to evaluate the model performance. (3) The DWT method is used to solve the problem of the original data with large noise. It can improve the accuracy of LSTM classifier. (4) It is proved that the accuracy of LSTM detection model can be improved by combining the different levels of signal obtained from DWT decomposition. The best combination level is obtained by designing the contrast experiment. This paper is organized as follows. Section II introduces the data set. Section III describes the signal noise reduction based on DWT. Section IV describes the signal decomposition and feature extraction based on DWT. Section V describes the IOC fault detection based on LSTM model. Section VI describes the experiment, results and discussion. Section VII concludes the paper.

II. ENET DATASET
Technical University of Ostrava (VSB) devised a special meter to measure the voltage signal of the stray electrical field along IOC, hoping to detect the hazardous PD activities. In 2018, VSB released the ENET data set on Kaggle which is the world's largest data science collaboration platform. The data set contains 8,711 labeled voltage signals from four different locations. Those locations represent the deployment in the real environment (forested and hardly accessible terrain). Each signal is voltage waveform of 50Hz, which contains 800,000 data points and pre-marked as PD (525) or Non-PD (8,186). Due to the large volume of the data set, the Hadoop Distributed File System (HDFS) storage format is used. Examples of the PD signal and the Non-PD signal are shown in Figure 2. It can be seen that the maximum and minimum value of Non-PD signal is about 40mv and −40mv, and the fluctuation is relatively stable. In contrast, the maximum and minimum value of PD signal is about 60mv and −80mv, and the signal fluctuation increases significantly.

III. THE SIGNAL NOISE REDUCTION BASED ON DWT
As an important signal processing method, wavelet transform has the characteristics of multi-layer and multi-resolution analysis. By zooming and panning the wavelet function, signal details in the time and frequency domains can be analyzed [26]- [28]. Wavelet transform mainly includes continuous wavelet transform (CWT), discrete wavelet transform (DWT) and discrete wavelet packet transform (DWPT). CWT requires continuous integration and the calculation is complicated. DWT is the discrete processing of CWT. Combined with Mallat algorithm, computational complexity of DWT can be reduced. The DWT expression is as (1).
(1) VOLUME 8, 2020 where, ϕ j,k (n) and ψ j,k (n) are the scaling function and wavelet function respectively, J represents the series of wavelet decomposition, and N represents the total number of coefficients of wavelet decomposition, a j (k) and d j (k) are the approximate coefficient part and the detailed coefficient part respectively, which can be expressed as (2).

A. NOISE REDUCTION ALGORITHM BASED ON DWT
High-frequency signals with small amplitude and lowfrequency signals with large amplitude can be obtained after signals are decomposed by wavelet. A series of wavelet coefficient prediction values are obtained after mapping through the threshold function. These prediction values are used to reconstruct noise reduction signals. Signal noise reduction process is shown as Figure 3. The right part shows the change of signal waveform during noise reduction. The blue curve represents the original signal, the red curve represents the signal after passing through the high-pass filter, and the green curve represents the signal after passing through DWT noise reduction. The left part is the signal processing flow. The noise reduction algorithm includes wavelet decomposition, threshold quantization processing and wavelet reconstruction. First of all, input the raw voltage signal. Then determine the number of decomposition layers, the scale equation and wavelet threshold. As the mother wavelet of signal decomposition, the noise signal in the original signal is filtered out to obtain useful signals, and these useful signals are reconstructed. In order to obtain the optimal frequency resolution and noise reduction, Daubechies6 wavelet function is adopted. The wavelet decomposition structure is shown in Figure 4. Y is a discrete sequence of noisy signals, a j (k) and d j (k) are the approximate coefficients and detail coefficients on the scale j (j = 1, 2, . . . , n) respectively. The original signal can be decompose into n detail coefficients d i and n approximation coefficients a i . If the threshold where dij is the wavelet coefficient without threshold processing, andd ij is the value after truncation processing. The noise reduction result can be obtained from the threshold processing result and the n-th construction signal. Let ϕ(x) be a scale function of multi-resolution analysis, then where k is a real number, (2x − k) is a standard orthogonal basis for multi-resolution analysis, and P k is a scale function. The threshold W is an important parameter in the noise reduction process [29], [30]. Based on [31] and the average absolute deviation, the threshold W of the decomposition layer n is calculated as (5) and (6).
The hard threshold method is used to deal with the approximate and detailed coefficients, as (7).
The two-scale equation of Daubechies wavelet is where Z is the set of integers.
The steps for coefficients construction of the two-scale equation P k are as follows: x Select a positive integer (n ≥ 2) and a polynomial: where ω ∈ [0, 2π ], L is the maximum value of non-zero data in the sequence, z is the noise signal, R, Q(z), P n (y) is algebraic polynomial of real coefficients, y is a real number between [0,1], andH (z) is high-pass filter with linear phase. As the number of decomposition layers n increases, the smoothness of Daubechies wavelet increases.

B. TEST AND SIMULATION
Three-phase signal noise of normal samples and fault samples is reduced, as shown from Figure 5 to Figure 10, respectively. In each set of figures, three figures of the first row are all data points of the signal, three figures of the second row are the first 10000 data points of the signal to observe the noise reduction details. Two figures of the first column are the original signal. Two figures of the second column are the signal through the high-pass filter. Two figures of the third column are the signal after noise reduction. The noise reduction is that keep the approximate and detailed coefficients amplitude when it is greater than the threshold value, and 0 when it is less than the threshold value. It can help to extract and classify the signal features.

IV. FEATURE EXTRACTION BASED ON DWT
When PD activity occurs, each vibration signal contains unique information about the particular condition of IOC. It is called fault feature frequency signal. In practice, IOC tend to work with non-stationary power transmission, which results in additional aperiodic pulses. In this way, the traditional feature analysis method based on the envelopment is no longer applicable. Therefore, DWT is chosen to decompose the original signal and obtain signal feature with different resolutions. The noise reduction signal is input and decomposed by DWT. The DWT calculation equation is shown in equation (1).   The DB4 wavelet is suitable for transient detection of electrical signal and used as the mother wavelet function. The DB4 function divides the signal into five detailed components (f 1 , f 2 , f 3 , f 4 , f 5 ) and an approximate component (A 1 ). Five detail components are represented as high frequency components and one approximate component is represented as low frequency components. The signal with fault and the signal without fault are decomposed using DWT, as shown in Figure 11.

V. IOC FAULT DETECTION BASED ON LSTM
LSTM is a sequence prediction model which predicts output through information embedded in a series of time steps. In recent years, researchers have applied LSTM to some time series problems such as inventory, weather forecasts, and machine translation [32]- [34]. In these tasks, LSTM is usually superior to traditional machine learning models. Then it is used to detect the IOC fault.

A. RECURRENT NEURAL NETWORK
Since the LSTM neural network is based on enhanced recurrent neural network (RNN), the RNN is reviewed firstly.
The RNN is regarded as a group of feedforward neural network (FNN), where the hidden neurons of the previous time step are connected to the hidden neurons of the next time step. Hidden neurons H t are obtained by combining the weight of the previous iteration cycle W h with the weight of the current input information W x . And so on, this process will continue to the next time iteration cycle. In this way, the RNN can take advantage of sequential information but not regard the signal as a combination of isolated points. The output of the current iteration period is not only based on the current input, but also based on the information of the previous iteration period. The structure of the many-to-one RNN model is shown in Figure 12.

X is the input and H is the vector of hidden layer:
According to the chain rule, the network loss gradient is and derivative is Combining (12) and (13), According to W h and tanh < 1, (tanh'(W h H t−1 + W x X t ) · W h ) may be less than 1 or greater than 1, which will cause the gradient to disappear or explode [35]- [36]. This will significantly affect the weight update and make it difficult to converge. LSTM uses complicated gate control instead of tanh activation function in the gradient flow, so it has better stability and performance.

B. LSTM ALGORITHM
Compared with the traditional RNN, LSTM introduces a specially designed unit that can accurately control the hidden state information flow from one time step to another [37]- [39]. The structure of the LSTM is shown in Figure 13. In Figure 13, X t and H t are the input vector and the network hidden state vector at the time iteration period t, respectively. C t is a vector which is stored in external memory unit. The interaction among the unit state vector, the input vector and the hidden state vector is accomplished through forgetting gate (f t ), input gate (i t ) and output gate (o t ).
The calculation of forgetting gate vector is: where, [H t−1 , X t ] is the concatenated vector of the previous hidden state vector H t−1 and the current input vector X t , W f and b f are the weight and bias of f t which are determined by network training, σ is the sigmoid activation function. The flow of information in the vector C t is controlled by dot multiplication of elements. The temporary state vectorC t is calculated by where W c and b c are the weight and deviation of f c , tanh is the tanh activation function. The calculation of input gate vector is where W i and b i are the weight and deviation of i t . They are determined by network training. The state of new C t in the time step t is updated, The current hidden state is determined by the new state and the write gate O t . Similar to f t and i t , O t can be written as The hidden state of current step H t is calculated, H t is used to calculate the output of the current time step.

VI. EXPERIMENTS AND DISCUSSIONS A. FAULT SCENARIOS
Insulated conductors can improve the stability of power transmission and reduce the construction space compared with traditional bare conductors. Therefore, insulated conductors VOLUME 8, 2020 are used more and more in overhead power transmission. However, the following two faults may occur in IOC, which are difficult to detect: (1) The standard protection devices used for bare conductor systems are often not able to detect the IOC's phase-to-ground fault. Because of the insulation cover, the phase-to-ground fault will not likely cause an overcurrent when IOC breaks and falls to the ground. This can create risky situation to people in close proximity to the fallen conductors.
(2) Something, such as tree branch, hitting the conductors will not be detected by the upstream protection devices. This can become a potential reliability threat in some locations where the tree branches etc. frequently hit the conductors due to wind or continuously push and bend the conductors. Eventually, the power line will be damaged, causing a power outage or starting a tree fire.

B. EVALUATION INDICATORS
Accuracy rate and error rate are two commonly used evaluation indexes to measure the classification model. However, they are not suitable for analyzing the imbalance dataset of ENET because using accuracy and error rates requires that each type of samples is equally important. In the ENET dataset, it makes more important to classify less samples type correctly than to classify more samples type correctly. So precision rate and recall rate are more suitable for ENET data analysis than accuracy rate and error rate. The less samples type is recorded as positive examples and The more samples type is recorded as negative examples. The labels of true results and forecast results are shown in Table 1. TP (true positive) indicates that the prediction result is positive and it is correct. TN (True Negative) indicates that the prediction result is negative and it is correct. FP (False Positive) indicates that the prediction result is positive and it is false. FN (False Negative) indicates that the prediction result is negative and it is false.
The precision rate (P) calculation formula is The recall (R) calculation formula is The F1-Score is the harmonic mean of Precision and Recall.
Matthews Correlation Coefficient (MCC) is an index used to measure the performance of binary classification. MCC is a correlation coefficient to describe the actual classification and the predicted classification and defined as where (TP + FN )(TN + FP) is a constant, which is defined as Then get: The expression of Matthew correlation coefficient is: In order to comprehensively evaluate the model performance, the precision rate, recall rate, F1-Score, and MCC are selected as the evaluation indexes.

C. FUSION LAYER SELECTION BASED ON F1-SCORE
The fusion of signal data from different layers are used for IOC fault detection. The Many-to-one LSTM structure is shown in Table 2. The features of each layer and their corresponding labels are shown in Table 3. There are six sets of comparative tests, as shown in Table 4. The signal data of different decomposition levels are obtained by DWT decomposition, and then they are integrated into the LSTM model in 3D format. For each decomposition level of data, randomly select 1000 groups of 3-phase signal as training data. There are 80000 data points for each phase signal. When the time step is set to 160, 80000/160=5000 data points are taken for each phase signal. In order to reduce the input vector dimension, the average value of 50 data points is taken as a new data point, then each phase signal consists of 5000/50=100 data points.
The F1-score of different fusion layers is shown as Figure 14. It can be seen that the LSTM classifier with four layers of features (A 1 , f 2 , f 3 and f 4 ) has obtained the best classification results. Therefore, it is selected as the input to the final classifier. In contrast, two layers of features cannot capture sufficient fine-grained feature changes. Five layers of features may amplify less meaningful feature changes and cause overfitting, then result in reducing classification performance.

D. THE EFFECT OF NOISE REDUCTION AND OVERSAMPLING
In the ENET data set, there are 525 PD signal samples and 8186 non-PD signal samples. The number of two types of samples is shown in Figure 15. 0 represents normal samples and 1 represents fault samples. The data amount of normal samples is about 16 times that of fault samples. There is a serious imbalance between normal samples and fault samples. The problem of serious imbalance in data categories is solved by using Synthetic Minority Oversampling Technique (SMOTE) which is an improved scheme based on random oversampling algorithm. Because random oversampling adopts a simple copying strategy to increase minority samples, it is easy to generate model over simulation. The basic idea of SMOTE algorithm is to analyze a small number of samples and artificially synthesize new samples to add to the data set.
1) For each sample x in the minority class, calculate its distance to all the samples in the minority set using the Euclidean distance as the standard, and obtain its k nearest neighbors. 2) Determine the sampling ratio N according to the sample imbalance ratio. Randomly select several samples from k nearest neighbors of minority class sample x. Assume that the selected neighbor is x n . 3) Construct a new sample which is randomly selected neighbor x n according to the following formula: In order to evaluate the effects of noise reduction and oversampling, the F1 score is used for the test. The test results are shown in Figure 16. It can be seen that both oversampling and noise reduction can improve the classification performance. If oversampling is not used, the classifier will focus on normal signals, thereby reducing the ability to identify fault signals. Wavelet noise reduction can remove the noise information contained in signals, which can reduce the difficulty of LSTM feature extraction and improve the classification accuracy.

E. THE COMPARISON OF DIFFERENT ALGORITHMS
The loss value of the proposed model during training is shown in Figure 17. It can be seen that the loss and accuracy tends to converge when the number of iterations is about 80. In the whole process, the loss value of the verification data set drops faster and fluctuates less than those of the training process, and eventually converges to 0.08, which may be caused by less number of verification data set (training data:Validation data = 7: 3). The accuracy of Matthew's correlation coefficient is shown in Figure 18. It can be seen that the accuracy tends to converge when the training iteration reaches about 80 times. The training accuracy is slightly higher than the verification accuracy, and reaches 0.86. The accuracy rates of different evaluation indexes are shown in Table 5. Fuzzy neural network (FNN) combines the advantages of neural network system and fuzzy system, and it has great advantages in dealing with non-linearity and ambiguity. Support Vector Machine (SVM) is used to solve the problem of data classification and belongs to a kind of supervised learning algorithm [40], [41]. XGBoost is an open source machine learning project and has effectively implemented the GBDT algorithm. The MLR algorithm proposes and implements a non-linear relationship between learning features directly in the original space. The proposed model is compared with five classifiers (FNN, SVM, XGBoost, MLR, and LSTM), and the results are shown in Table 6. It can be seen that the accuracy of the proposed model is better than other models.

VII. CONCLUSION
Aiming at the problem that the phase-to-ground fault and phase-to-phase fault of IOC are difficult to detect, a method based on combination of DWT and LSTM is proposed through detecting partial discharge in this paper. And we get the following conclusions.
(1) DWT can effectively reduce the noise of original voltage signal, so that the signal features are more obvious. (2) DWT is chosen to decompose the noise reduction signal and obtain signal features of different resolutions. The DB4 wavelet is suitable for transient detection of electrical signal and used as the mother wavelet function. (3) In some time series problems, LSTM is usually superior to traditional machine learning methods and can enhance the PD recognition performance. (4) Different algorithms are used to make a comparative test on the ENET data-set. The results show that the DWT-LSTM method provides the best classification results measured by F1-Score. The proposed method based on DWT-LSTM and partial discharge is suitable for IOC fault detection. In the future, we will use two methods to further improve the accuracy of fault detection. The first method will use bidirectional LSTM to extract the features of fault samples and normal samples. The second method will use small sample learning, migration learning and other means to solve the problem of rare fault samples.