Fusion of Infrared-Visible Images in UE-IoT for Fault Point Detection Based on GAN

In the development of modern intelligent Ubiquitous Electric Internet of Things (UE-IoT), infrared thermal imaging always plays an important role in automated early-warning detection of developing failures of critical assets such as transformers, disconnects and capacity banks in electrical power substation non-intrusively. However, the low resolution and contrast of infrared images hinder the subsequent analysis and recognition of fault points. In contrast, visible images present abundant texture details of the equipment without thermal information. In order to assist the detection of fault points, this paper proposes a Generative adversarial networks (GAN) based infrared and visible image fusion method to produce a composite image with enhanced edges and better quality. The edge loss function is added to represent the perceptual edges. In the discriminator, the proposed method improves the texture similarity between fusion image and visible image by minimizing the Wasserstein distance in VGG (Visual Geometry Group network) feature space. The experimental results show that the fault regions become more salient and the details are enhanced. In this way, it can facilitate the detection of fault points both reliably and accurately.


I. INTRODUCTION
Since the security and reliability of power infrastructure is growing more important than ever before, preventing the breakdown of power devices due to different faults becomes a main concern. Considering that the complexity between the symptoms and faults, it is not accurate to perform fault diagnosis based on single information source. Worse, it may lead to false or missed judgements [1]. Therefore, the comprehensive analysis and evaluation of multiple information sources is needed. In recent years, the proliferation of Ubiquitous Electric Internet of Things (UE-IoT) in electric power system has led to the deployment of various sensor devices, which transfigure traditional devices or environment becoming more intelligent. Consequently, improved fault detection and remote monitoring can be implemented.
Intelligent inspection robots and unmanned aerial vehicles have been widely used as terminal nodes in the inspection of substations and overhead lines in the Internet of Things [2]. These robots and unmanned aerials often carry infrared and visible light dual sensors to collect relevant equipment images and then sent them to cloud platforms for analysis and processing [3]. Infrared sensor can measure the thermal radi-The associate editor coordinating the review of this manuscript and approving it for publication was Takuro Sato. ation of the object non-intrusively in real time, wherefore is often used for fault detection [4]. Lopez-Perez and Antonino-Daviu [5] applied infrared images to the fault detection of induction motors. Zhou et al. [6] designed a fault diagnosis and remote detection system to realize the analysis and processing of the infrared image of the insulator. Qin et al. [7] used infrared thermal imaging technology to monitor the process of electrical performance testing and designed an infrared monitoring system for thermal batteries. Cui et al. [8] designed a system to monitor the electrical equipment based on infrared images in UE-IoT. Although infrared images are widely used for fault detection in UE-IoT, relying on the single infrared image obtained by infrared sensor is not accurate enough. The infrared image has fewer details, low contrast and resolution, which leads to inaccurate fault detection. In contrast, the visible image obtained by the optical sensor retains abundant texture information of power equipment. Therefore, by performing edge computing at the terminal and using image fusion algorithms, the useful information from different sensors can be integrated together, which can not only facilitate fault identification and detection [9], but also compress data and save communication traffic [10].
For decades, numerous infrared and visible image fusion methods have been proposed. These methods can be mainly divided into five categories: subspace [11], multi-scale transform [12]- [15], sparse representation [16], [17], hybrid models [18], [19], and the convolution neutral network (CNN) [20]- [22] based methods. Subspace based fusion methods convert the whole images into uncorrelated components with non-parametric techniques such as PCA or ICA, then the PCA or ICA coefficients matrices are weighted by the Piella metric. Multi-scale transform based methods are the most widely used methods in image fusion, which decompose the source images into several sub-bands at different scales, and then certain fusion rules are applied to fuse the sub-bands. Finally the fusion image is reconstructed by performing inverse transform [23]. Multi-scale transform, including pyramid transform [12], Wavelet transform [13], non-subsampled Contourlet transform (NSCT) [14], and nonsubsampled Shearlet transform (NSST) are all introduced in image fusion. Although multi-scale transform based fusion method has ever been the mainstream, they fail to capture the texture details efficiently compared with edge features. The reason is that they adopt the fixed basis functions. Therefore, ringing artifacts are always visible in the fusion images. Unlike the multi-scale transform based fusion methods, sparse representation based methods divide the source images into overlapped images patches, then sparsely represent the image patches as linear combination of atoms from the learned over-complete dictionary. After that, the representation coefficients are fused according to designed fusion rules. Finally, the fusion image is generated through reconstruction algorithm. Sparse representation based methods can express the structured features more efficiently, benefitting from more flexible choice of the basis functions. ASR [16] and CSR [17] are typical examples of this category. Hybrid model-based methods synthesize the advantages of different schemes to improve the fusion quality. Ding et al. [18] introduced a fusion method based on the Shearlet transform and sparse PCA features.
In addition to the above methods, the infrared-visible image fusion method based on CNN is also receiving great attention in recent years. Liu et al. [20] proposed a new method for infrared and visible image fusion based on convolutional neural network (CNN). Li and Wu [21] studied an auto-encoder-decoder network, in which dense blocks were used in the encoder to extract the features and these features were fused in the decoder to obtain the fusion image. Yu Zhang et al. [22] proposed a general image fusion framework based on CNN, which adopted two convolutional layers to extract the salient image features from multiple input images and then fuse these features via feature fusion rules. Finally two convolutional layers were used to combine and reconstruct these features to obtain the fusion image. The CNN based fusion methods can extract more meaningful features and learn the parameters of filters for the image fusion task.
Above mentioned fusion methods for infrared and visible image perform the same representation framework for all input images, this is not appropriate for specific fault detection task. In fact, the thermal information in the infrared image should be preserved in the fusion image, because it reflects the important thermal radiation of devices. On the other hand, the visible image shows the texture structures of devices and should also be reinforced in the fusion image. In addition, most fusion methods in the literature choose fusion rules artificially, which seems a little rigid and lack of adaptability.
Considering the application of image fusion for the fault detection in the UE-IoT, this paper proposes an image fusion method of infrared and visible images based on improved Wasserstein Generative Adversarial Network (WGAN-gp). The GAN include two parts: the generator and the discriminator, contending with each other in a mini-max game. Later, WGAN [24] was proposed to speed the performance and maintain the stability of the training process. GAN shows great potential to generate realistic images and has been introduced in various vision tasks, such as image registration [25], image generation [26], super-resolution [27] and image denoising [28] etc. In our design, the generator is supposed to preserve the thermal information and the visible gradients as much as possible, while the discriminator takes the texture features in the visible image as the constraints. To enforce the GAN convergence faster, gradient penalty is added during the discriminator update.
The main contributions of this paper are as follows: (1) Our proposed image fusion method is based on GANs, which is a fully end-to-end model and the fusion image can be generated without explicitly designing fusion rules.
(2) The VGG network is employed in the update process of the generator and the discriminator, which help to improve the texture similarity between fusion image and visible image.
(3) Considering that the temperature of the equipment is an important basis for fault detection, we design a specialized loss function for infrared and visible image fusion in fault detection, in which the thermal information is given more weight than detail information.
The rest of this paper is organized as follows. The background and related works are presented in Section II. Section III describes the proposed fusion method. The experimental results and discussion are given in Section IV. Conclusions are presented in Section V.

II. RELATED WORK A. THE UBIQUITOUS ELECTRIC INTERNET OF THINGS
Ubiquitous Electric Internet of Things (UE-IoT) is a smart service system, which implements modern information technology and advanced communication technologies. It not only realizes the interconnection of all things and humancomputer interaction in all links of the power system, but also realizes the identification, perception, interconnection and control of grid infrastructure, people and their environment. The UE-IoT architecture is the same as other forms of IoT systems, which can be divided into four layers: perception layer, network layer, platform layer, and application layer [29]. The network architecture is shown in Fig. 1.

1) PERCEPTION LAYER
The perception layer is the bottom layer of the ubiquitous electric Internet of Things, and it is also the hub that connects field devices and the network layer. Its main function is to realize information collection and signal processing [30]. Due to the large number of field device terminals to be connected to the perception layer, the terminal operating system is not uniform, and the data provided by the terminal needs to be integrated and processed by edge computing before reaching the platform layer.

2) NETWORK LAYER
The network layer is the second layer in UE-IoT, which consist of access network and transmission network, where the former undertakes the task of data access while the latter realizes the function of information transmission. As a link, the network layer connects the perception layer and the application layer, responsible for transmitting the information obtained by the perception layer to the application layer safely and reliably, and then performs information processing according to different application requirements.

3) PLATFORM LAYER
The platform layer is located above the perception layer and the network layer, and below the application layer. It is the core of the UE-IoT, which is expected to integrate the huge amount of information resources in the network into a large interconnected network through computing power, and solve problems such as data storage, data mining, and privacy protection.

4) APPLICATION LAYER
The application layer is the top layer of the UE-IoT, and its role is to analyze the data transmitted from the network layer and apply it to some practical scenarios to provide users with a series of services.

B. GENERATIVE ADVERSARIAL NETWORKS
Generative adversarial networks (GANs) are algorithmic architectures proposed by Ian Goodfellow [31],which involve a generator and a discriminator. The generator is trained to produce plausible image with similar characteristics as the input samples, while the discriminator learns to identify the fake images from the real images and penalizes the generator for producing the fake samples. Such adversarial learning process between generator and discriminator continues until the generated data distribution cannot be distinguished from the real one.
Although the original GAN has achieved impressive results in generating realistic images, it is challenging to train a GAN model, confronting with the problems of training instability and non-convergence. In order to solve above problems, many improved versions of GANs have been proposed. To make the adversarial network training process controllable, the conditional Generative Adversarial Networks (cGAN) [32] introduced conditional variables in training process. In order to solve the gradient vanishing problemin original GANs, the Least Square GANs (LSGAN) [33] adopted the least squares loss function for discriminator, thus performed morestable than original GANs. It was argued [24] that the JS divergence used in original GANs may result in gradient vanishing and wrong optimization direction, so Wasserstein divergence was adopted instead of JS divergence to measure the distance between two distributions, which greatly improved the stability of GANs. However, in practical applications, Wasserstein GAN (WGAN) still suffers from unstable training and slow convergence, and the improvement is not obvious in experiments compared to traditional GANs. As an advanced version of WGAN, namely the WGAN-gp, Ishaan et al. [34] pointed out that the reason why such problems appear was that WGAN used weight clipping directly when enforcing Lipschitz constraints. Thus, in WGAN-gp, gradient penalty was applied to meet Lipschitz continuity constraints, which made training process more stable and converge faster than WGAN.

III. PROPOSED METHOD A. MOTIVATION
Although the existing fault detection methods of power equipment in UE-IoT have achieved remarkable success, there still exist some problems. Most fault detection methods only consider infrared images of power equipment. However, the low resolution of infrared images and blurred target edges make it difficult to locate the fault accurately. On the other hand, the fast development of information technologies puts forward higher requirements for the accuracy and speed of fault detection technology for UE-IoT power equipment.
To integrate the thermal radiation distribution from infrared image and the texture details from visible image into one fusion image, we propose an infrared and visible image fusion method based on improved Wasserstein generative adversarial networks. In order to preserve more edge information in the visible image, we use the first three convolutional layers of VGG19 networks [35] as feature extractors, and calculate the edge loss between extracted features of the generated image and the visible image.

B. METHOD
We assume the fusion of infrared and visible images as the adversarial learning problem between the generator and the discriminator in the GAN. Our work can be divided into two parts, namely training phase and testing phase. As shown in Fig. 2. In the training phase, firstly, a pair of infrared and visible image is stacked and fed into the generator. The generator extracts features from the input images and integrates them, and then generates the fusion image under the guidance of the loss function. Next, features of the fusion image and the visible image extracted by the trained VGG19 are fed into the discriminator. The discriminator will minimize the difference between the fusion image and the visible image in the feature spaces by the Wasserstein distance. This adversarial learning process between the generator and the discriminator continues until the maximum number of iterations of the adversarial network is reached. In the testing phase, we only input the infrared image and visible image pair into the trained generator, the output of the generator is the final fusion image.

C. ARCHITECTURE 1) GENERATOR ARCHITECTURE
As shown in Fig. 3 (a), our generator architecture is composed of five convolution layers. First of all, the stacked infrared image and visible image are fed into the first layer, where 7 × 7 filters are applied to extract low level features. Then, these features will be input to the following layer, in which 5 × 5 filters are used. In the third and fourth layer, we apply 3×3 filters to further extract the high level features. In order to keep the output image with the same size as the input image, the size of filters in the last layer is set to 1 × 1. Considering that the down sampling and padding operation will lead to the loss of image details and the artifacts around image boundary [36], convolution layer without any down sampling and padding is introduced. Besides, the activation function in each convolutional layer adopts leaky ReLU. Furthermore, the batch norm layer is used in the first four layers except the last one. All parameter settings of the generator are shown in TABLE 1.

2) DISCRIMINATOR ARCHITECTURE
The discriminator consists of four convolutional layers and a fully connected layer, as shown in Fig. 3 (b). The filter size of the first two layers is 5 × 5, whereas 3 × 3 filters are used in the second two layers. After each convolutional layer, we adopt the layer normalization rather than the batch normalization, considering that we only penalize the discriminator with respect to each input independently, not the entire batch. In addition, we use leaky ReLU activation function for first four layers, and the stride of filters is set to 2. It should be noted that the discriminator in traditional GANs is simply a classifier, which evaluates the probability of the input image to be real image. But the aim of discriminator in WGANs is to measure the distance between the input image distribution and the real image distribution, which can be regard as a linear regression problem. Thus we use linear layer as the last layer and remove the sigmoid layer. Parameter settings of the discriminator are listed in TABLE 1.

D. LOSS FUNCTION 1) GENERATOR LOSS
Since the generator is designed to generate a fusion image which preserves as much thermal information in the infrared image and detail information in the visible image as possible, we update the generator under the guidance of this  information. The loss function of generator is designed as: As shown above, the generator loss consists of three parts: adversarial loss L adversal , edge information loss L edge and thermal information loss L thermal , where θ and γ are weighting factors. Considering that the temperature variation is an important detection indicator for fault detection, we assume that the thermal information in the infrared image is more important than the details in the visible image for our purpose. Thus, we prefer to preserve the thermal information in the fusion image during the update of the generator, and we set θ to 1 and γ to 10.
The L thermal is the thermal information loss between the fusion image and the infrared image. Since it is well known that the thermal information of the infrared image is represented by its pixel intensity [37], we define the L thermal as follows: where the · F denotes the matrix Frobenius norm, and the I f and I r represents the fusion image and infrared image, respectively.
To capture the texture and edge information in the visible image, style loss proposed in image style transfer task is employed in [38]. Through extracting the feature maps by the VGG19 network, it evaluates the difference between feature maps from different layers.
where the G(·) is gram matrix of feature map, and the ϕ k (·) denotes the feature map in k-th layer of the trained VGG19. The size of the feature map in k-th layer is C k ×H k ×W k . For the convenience of calculation, the feature maps are reshaped into C k × H k W k . The L adversal represents the adversarial loss between generator and discriminator, which is defined as: where the I n f denotes the n-th fusion image in training set, the ϕ K (·) represents the feature map in the last layer extracted by the VGG19 and D θ D (·) denotes the output of discriminator.

2) DISCRIMINATOR LOSS
To solve the problems of unstable and slow convergence during the GAN training process, the WGAN-gp [34] added gradient penalty term during update of the discriminator.
Inspired by that, we also introduce the gradient penalty in our work. Thus we fine-tune the loss function based on the original loss function in [34]: where the first two items are included in the original loss function of discriminator. The L penalty is the gradient penalty term of discriminator, λis a penalty coefficient. The L penalty is defined as: where the ∇ α represents the gradient operation, α is a random number in (0, 1), and I α = α · I n f + (1 − α) · I n v . The I α can be regarded as a random sample between visible image and the fusion image.

IV. EXPERIMENTS
To verify the advantages of the proposed fusion model, we compare our method with other five state-of-the-art methods. The compared methods include adaptive sparse represent (ASR) [16], cross bilateral filter (CBF) [39], latent low rank representation (LATLRR) [40], ResNet and zero-phase component analysis (ResZca) [41], visual saliency map and weighted least square optimization (WLS) [19]. All these methods can be implemented based on their public codes, and the parameters are set the same as in original papers. Experiments are performed on a workstation with 3.2GHz Intel Xeon CPU W-2104, GPU GeForce RTX 2080, and 8 GB memory.

A. TRAINING DETAILS
We select 49 pairs of infrared and visible images from TNO database 1 as our training data. However, it is far from enough to train a good model, so we crop each image into a set of 64 × 64 image patches by overlapping scheme (i.e., the stride is 16). Thus, we can obtain 45027 pairs of infrared and visible image patches. Since there is no padding operation in the convolutional layer, we pad these patches to the size of 78 × 78 before feeding them into the generator. In this way, the generated fusion image patch is the size of 78 × 78. Both generator and discriminator use the Adam [42] solver with the learning rate of 0.002. Penalty coefficient λ is set to 15 and the batch size is set to 1, the max iterations of the GAN network is set to 20000. The procedure and parameter settings are summarized in Algorithm1.

B. FUSION METRICS
In recent years, numerous subjective metrics have been proposed to evaluate the quality of the fusion image on the basis of human visual system. Subjective metrics are usually carefully designed by measuring the details or distortion Update discriminator by AdamOptimizer: Update generator by AdamOptimizer: in the fusion image. Here, we compare the performance of fusion methods under three evaluation indexes, including structural similarity (SSIM) [43], mutual information (MI) [44], and human perception inspired index Q c [45]. Two sources images are referenced as A and B, and the fusion image is denoted as F. The image variable X can be A or B.

1) STRUCTURE SIMILARITY
SSIM is a common model that measures the structure similarity between the source images and the fusion image. It models distortion as a combination of three different factors: brightness, contrast and structure. The SSIM is defined as: where the µ and σ represent the mean and standard variance of the image patch. The σ XF indicates the standard covariance correlation between the source and fusion image. Sliding windows are applied to calculate the SSIM of the whole image. The closer the value of SSIM is to 1, the better the quality of the image.

2) MUTUAL INFORMATION
Mutual information (MI) is a metric that measures how much information is transmitted from the source images to the fusion image. The MI is defined as: where the MI AF and MI BF are the mutual information between the fusion image and the source image A and B, they are defined as follows: where the P XF represents the joint gray histograms between the source and the fusion image. The P X andP F denote the normalized gray histogram of the source image and the fusion image. The greater the Q MI value indicates the better fusion image quality.

3) HUMAN PERCEPTION INSPIRED INDEX
Q C is a human perception inspired index, it measures the visual effect of the fusion image by comparing the contrast characteristics of the fusion image and the source image. The Q C is defined as follows: where Q AF and Q BF represent the degree of contrast information in source image A and source image B retained in the fusion image, respectively. The λ A and λ B represent saliency maps of Q AF and Q BF .

C. OBJECTIVE COMPARISONS
In order to compare the fusion effects of various methods in different environments, we select four pairs of infrared and visible images of power equipment as experimental images, including transformer, insulator, arrester and breaker respectively. The sizes of these four pairs of images are 296 × 196, 496 × 500, 256 × 256, and 1080 × 900. The fusion results are shown in Fig. 4. For the ''transformer'', most methods have integrated the information well in the fusion image, except that the CBF-based method presents unexpected blocking artifacts around the pillar of the insulating sleeve and the ASR-based method shows obvious speckles near the fuel tank edge of the transformer. As to ''insulator'', the fusion image produced by LATLRR appears some artifacts around the transmission tower, and in the WLS based fusion image the edges of the transmission tower is more blurred than other fusion results. The ResZca and our proposed method perform well in first two pairs of image, but the ResZca based fusion image appears obvious pseudo-Gibbs effects in the neighborhood of the voltage stabilization ring in the third row and the wires in the fourth row, while our proposed method has avoid such artifacts. The texture and edge information in visible images are well preserved. Furthermore, the target in our fusion image is more highlight   compared with all other methods, which indicate that our method can keep more thermal information in the infrared image.

D. QUANTITATIVE COMPARISONS
The quantitative comparison result of three metrics is shown in Table 2. For the SSIM, the proposed fusion method gets the largest values in first three fusion images, which indicate that more texture and details are preserved in the fusion image. For the MI and Q C , the values of our last three fusion images are significantly higher than other methods, especially in the last fusion image, which indicate our method not only retains more information in the original image, but also more friendly for human perception system. Furthermore, considering fault detection for power equipment in the UE-IoT, real-time is also a very important criterion. The time costs of different methods are listed in Table 3. As we can see, our method takes much less time than other algorithms, which means that our method is more suitable to meet the real-time requirements of fault detection in UE-IoT than other algorithms.

V. CONCLUSION
In this paper, we propose an infrared and visible image fusion method based on improved Wasserstein generative adversarial network for fault points detection in the UE-IoT. We adopt the trained VGG19 model to extract the features of the generated fusion image and the visible image, and minimize the style differences between these features to ensure that the fusion image can obtain enough edge details. Besides, considering the actual requirements of fault detection, we emphasize the importance of the thermal information during training process. Our method is an end to end model, which can bypass designing fusion rules manually. To verify the performance of proposed method, we compare it with the other five methods. The experimental results show that the fusion image obtained by our method can preserve more details from source images, and has better visual fidelity.