Sparse GANs for Thermal Infrared Image Generation From Optical Image

Thermal infrared (TIR) images are not influenced by the illumination variations and can be used in total darkness. With these advantages, TIR technology has a wide application in surveillance systems and various defense systems. However, there are not enough TIR images for wide range of application because the equipment for thermal infrared imaging is expensive and demands strict imaging conditions. To address this problem, we propose a sparse generative model based on pix2pix framework to produce synthetic TIR data from optical RGB images. Considering little texture and color information in TIR images, this model uses a U-net architecture but only selects partial low-level and high-level information for symmetric connections. Specially, we integrate intensity and gradient losses into the objective to train models, which assists generation models to learn more infrared images’ characteristics. The experiments on public datasets prove that this proposed method can generate TIR data from optical images. Compared with current pix2pix networks, this method achieves increases by over 6.5% and over 1.2% separately on the metrics of SSIM and PSNR based on the public datasets. The SSIM value even gets an increase by 7% for daytime images. Meanwhile the network parameters decent by 13%.


I. INTRODUCTION
Image-to-image translation based on generative adversarial networks (GANs) has received increasing attention. It converts an image from one representation of a given scene to another and has brought widespread applications such as image colorizing [1], [2], photo-realistic images from art paintings and reverse [3]- [5], aerial images to maps [6], daylight images to night images and reverse [7], and so on. Nevertheless, there is still little work for synthetic thermal infrared datasets from optical images. Optical images can present information similar to what the human could see but are sensitive to illumination variation, while TIR images can make up for this weakness by their thermal radiation differences. In certain fields such as military and environmental monitoring, TIR images are more useful than optical images because they are independent of the quality of the environment. Compared with capturing optical images, TIR facility is not only expensive but also demands strict The associate editor coordinating the review of this manuscript and approving it for publication was Mingjun Dai . testing conditions, which makes TIR images are not as available as visible images. To overcome these restrictions, this paper aims to apply GANs to construct TIR data from easily obtained optical RGB images. A supervised framework is established based on pix2pix GANs [6] for optical-to-TIR translations using labeled data.
Pix2pix framework is the first GAN-based image-to-image translation work with good performance to produce strong results in the unimodal image prediction setting when there is spatial correspondence between input and output pairs [6]. It contains an encoder-decoder architecture. During training, the learned encoder attempts to pass enough information to the generator to resolve any ambiguities regarding the output mode. Some generators select symmetric dense skip connections, which maintain both low-level and high-level information well. Our goal is to generate TIR data from optical domains. TIR images present rich geometric structure and material property by receiving thermal radiation, but have fewer details such as color and textures comparing with visible images. So dense skip connections are not needed for Optical-to-TIR images. This work therefore seeks a new generator architecture with sparse skip connections. Partial low and high layers are symmetrically connected. This model reduces both modal parameters and redundant details. To train a robust image-to-image translation system, many efforts have been made. The straightforward approach is to constrain the network training by an objective. Some networks minimize the L2 distance between predicted and ground truth pixels [8]. This tends to result in blurry problem because this distance averages all plausible outputs. To get clear structure and vivid details, some networks select L1 loss on pixel-wise space [9]. Except for GAN loss, perceptual loss has been used in image-to-image translation tasks [10]. Although we have some kinds of losses to evaluate the difference between ground truth and synthetic image, most of current networks aim to produce vivid optical results. To get realistic TIR images, this work tries to apply constraints into the object that are applicable to TIR data. Gradient and intensity losses between ground truth and generated image are added to the objective then SGD is used to optimize the network parameters. Using this proposed strategy, we can synthesize high-quality TIR images. In summary, the main contributions for the style transfer problem in this paper are: • We propose a sparse ''U-net'' GAN network that strikes a better balance between style and content. The sparse skip connections have fewer modal parameters and reduce redundant details that are enough to achieve good TIR data.
• We introduce new metrics in terms of content and style. The content similarity is evaluated by gradient loss and thermal radiation difference is restricted by the L1 intensity loss. These two losses make the synthetic images more similar to the ground-truth TIR. The comparison results show that the optimization for current object brings better TIR data.
• We perform extensive and detailed experiments to verify the performance of the presented strategies. Our proposed model performs better than current state-ofthe-art models on the public dataset with less network parameters.

II. RELATED WORK
Current researches on converting optical images into TIR data can be divided into two parts: one is utilizing manual ways and hand-crafted features to produce TIR style. The other uses deep learning networks to learn the mapping between different styles without manual intervention. Luo et al. [11] present a physical model to simulate the different thermal radiation features among different targets. They divide a gray image into small parts and then manually segment the target object from the background. After that, each area is set an infrared radiation value manually. Wu et al. [12] use the histogram difference between optical and infrared image pairs to convert optical images to infrared images. Li et al. [13] proposed a neural network-based infrared image generation method to predict the temperature of the target with different materials, but manually segmentation is needed. The above methods can consider rich physical and low-level features of different targets, but the segmentation of image has much difficulty in processing large quantities of images.
On the other hand, GANs have achieved promising results in image style transfer. This framework is good at capturing data distribution by learning from a big dataset. Mirza et al. propose Conditional GANs (CGANs) [14] to learn mapping information between inputs and outputs using a conditional convolutional neural networks. Isola et al. propose pix2pixGAN [6] based on CGANs. Pix2pixGAN mixes the GAN objective with L1 distance to make the output near the ground truth. Zhang et al. [15] use this framework to translate the labeled RGB data to TIR data. To present different radiation features for different scene, Li et al. [16] integrate extra scene classification network into multi-branch generators based on pix2pix network. Although this method can achieve good results, the training and testing are time-consuming. Sometimes there are wrong classification results in unreal IR images. As pix2pix needs paired training data, which are usually costly to obtain, Zhu et al. present CycleGAN [17], an unsupervised imageto-image translation. This framework learns the mapping between two unpaired image domains with the aid of a cycle-consistency loss. Apart from CycleGAN, many other GAN variants [18]- [20] have been proposed to tackle the cross-domain problem. Zhang et al. [15] also use unpaired image-to-image framework to help generate TIR data, but sometimes get worse results compared with supervised GAN network since unsupervised models are easily affected by unpredictable contend during the translation stage.
In this paper, we also use pix2pix GANs to realize opticalto-TIR style translation since it is a powerful framework for image-to-image translation. With the goal of reducing the network complexity and considering the infrared characteristics, we modify the network structure to ignore partial low-level details and add content losses into the objective.

III. THE PROPOSED METHOD
Based on pix2pix GANs, our model (seen in Fig.1) uses a ''Unet'' architecture [21] as the generator G. Similar to [22], we use the strided convolutional layers take the place of pooling layers because they will cause the loss of useful information during multidimensional reduction. The discriminator D is a ''PatchGAN'' classification by capturing local style statistics. The two deep networks compete against each other. The generator tries to generate samples that resemble the real TIR images, whereas the discriminator tries to detect whether samples generated by G are real.

A. SPARSE U-NET GENERATOR
The generator G contains an encoder-decoder architecture with symmetric skip connections. Since TIR images have less color and texture information, current dense connections in the ''U-net'' framework includes much redundancy and too many parameters. In this paper, we use sparse connections to simplify this model. Fig.2 shows the model framework. It includes 16 sub-models and four pairs of symmetrical connections between the first and fifteenth, third and thirteenth, fifth and eleventh, seventh and ninth layers.
As the usual practices of deep learning strategies [22], encoders have eight convolutional layers with a 4 × 4 kernel and each has an activation layer. The stride for all the convolutional layers is set as 2 and paddings are 1 to make input and output have the same size. In the encoders, we select LeakyRelu as the activation function to avoid dead nerve cells and its non-zero slope a = 0.2. The decoders have eight transposed convolutional layers with the kernel size 4 × 4. We choose ReLu as the activation function to enhance the non-linearity of this model and speed up convergence rate. In this way, the number of network parameters in our sparse generator decreases to 47M from 54M in the dense U-net connections (descending by about 13%).

B. ARCHITECTURE OF DISCRIMINATOR
It is well known that the L1 loss produces blurry results on image generation problems [23]. To model high-frequency crispness, we leverage PatchGANs for the discriminator network, which focuses on penalizing the structure in local image patches. Fig.3 shows the architecture of our discriminator. It includes 5 blocks with 3 channels input. The convolutional layers separately include 64-128-256-512-1filters. As the generator, the convolutional kernel is 4 × 4. The size of stride is set as 2 from the first to third layers and as 1 at the fourth and fifth layers, padding value is 1. Each block except for the fifth has an activation layer of Leaky Relu to ensure the parameters get enough update. In this work, the discriminator classifies whether local image patches with the size 70 × 70 in the synthetic TIR image are real or fake by averaging all convolutional responses to provide the ultimate 1D output of this discriminator.

C. OBJECTIVE OPTIMIZATION
We use the loss function in [6], [15] to guide the training process, which consists of two parts expressed as: Here, x,y and z represent the optical image, ground-truth TIR image and a random noise respectively. G tries to minimize this objective while D is to maximize it. Additionally, pix2pix model includes a traditional construction loss, L1 distance, to ensure the quality of generated synthetic images: Therefore, the full objective function is: where λ is a weight to control the contribution of L1 loss.
Since TIR images capture the information of ground objects by their thermal radiation difference, their intensity can simulate the received radiation. To learn this characteristic, we add intensity loss into current objective. The intensity I can be calculated by: Here, R, G and B are the color values of the image. The intensity L1 loss between the synthetic TIR and ground-truth is described as: In addition, in order to retain the semantic information, we propose an adversarial content loss based on the gradient information, which favors maintaining the local appearance and shape of different objects. We also formulate the gradient loss as L1 distance to minimize the difference between the generated and input TIR image: Here GRA(y) and GRA(G(x,z) are the gradient maps for the ground truth and the synthetic TIR image separately. We define the final objective function as: where λgra and λ I are also the weights to control the relative importance of each loss.

A. IMPLIMENTATION DETAILS 1) DATASETS
The same as classical pix2pix model, we need paired optical and TIR images for training. Here we conduct extensive experiments on the KAIST multi-spectral pedestrian data set and other data set. 64,447paired images are chosen randomly from these data sets to train our model and the remaining optical images are used to test the GAN network. There are no intersections between training and testing sets.

2) TRAINING DETAILS
We end-to-end train the generator network and discriminator network in an adversarial manner. We alternately train the generator and the discriminator, one step on D, then one step on G. The discriminator D ensures synthetic TIR images generated by the generator can be distinguished from the ground-truth TIR image. It tries to maximize L GAN (G,D) . During training discriminator, we also aim to maximize loss: logD(x,G(x,z)) rather than minimizing log (1−D(x,G(x,z)) brought by the generator. The generator G is trained to minimize the adversarial loss to make the synthetic image more plausible. The generator also tries to minimize the weighted loss: λL L1 (G) + λ gra L L1 (GRA)+λ I L L1 (I ) (λ = 100, λ gra = 100 and λ I = 30) by the output of forward propagation and the real labels of images. Then the model parameters are updated by back propagation with Stochastic Gradient Descent (SGD).
The model is trained with batch-size 3. A smaller learning rate is more suitable for TIR data as they have less detailed information than optical RGB images. Here, the learning rate for mini-batches is set as 0.0002. We don't use decay during training because the model does not over fit within two epochs. Momentum parameters are set as β1 = 0.5, β2 = 0.999.

1) QUALITATIVE EVALUATION
To verify the performance of our method, we first visually experiment with three base generator networks. Fig.4 presents the comparison results of image transformation from optical images to TIR data. They are produced by the generators separately with symmetrical low-level, high-level and the proposed skip connections. That shows generator with connections at the first and third layers or at the fifth and seventh layers can only produce blurred objects such as cars and trees, which sometimes cannot be distinguished from the background. The proposed method enhances the semantic features by using both low-level and high-level connections. The shapes and contours are much clearer and synthetic images are more similar to the ground truth TIR images. TABLE 2 shows using both low-level and high-level features brings highest SSIM and PSNR values.
To describe the details information in the synthetic TIR images produced by our model with sparse skip connections and the model with dense skip connections, Table 3 lists the average LBP values (local binary pattern, it's often used to describe the local texture information [24]) for synthetic TIR images in Fig.5. The TIR results of Fig.5 (a) from the left to VOLUME 8, 2020   To provide a more detailed analysis of the proposed method, then we present the performance comparison of different GANs models including pix2pix network just used in [15], unsupervised CycleGAN network [17] and two baselines that ablate L L1 (GRA), L L1 (I ) respectively. The model in [15] even fails to generate distinguishable outputs. The riding man is blurry and cars cannot be separable. Without L L1 (GRA) or L L1 (I ), the image quality is unsatisfactory. When training with unpaired images, CycleGAN can't well transfer the characteristics of the thermal targets. Fig.5 (d) shows that riding man and cars present black color. We observed in Fig.5 (e) that the synthetic TIR images are visually realistic with a higher quality of translation results generated from out model. L L1 (GRA) and L L1 (I )    losses allow our model to learn reliable features with different attribute values universally applicable to produce real TIR images.

2) QUANTITATIVE EVALUATION
It is necessary to measure quantitatively the difference between the translated images and true ones. All quantitative results based on the test dataset, Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [25], are shown in Figure 6-7 and Table 4. Figure 6 and 7 show the detailed PSNR and SSIM scores of different models under 5, 10 and 15 training epochs. The model using source pix2pix network [15] gets about 29dB PSNR. After adding L L1 (GRA) and L L1 (I ) losses, PSNR has slight improvement. In contrast, the CycleGAN model with unpaired images [16] has lowest PSNR. While the bar graphs in Figure 7 show each model except CycleGAN has the highest SSIM values with the 10epoch-trained networks. Our model achieves an increase by 6.5%∼10.5% on the metric SSIM compared with the source pix2pix model [15] and by 3%∼6% after adding L L1 (GRA) or L L1 (I ) losses. Though CycleGAN [17] can get higher and higher SSIM values with the increasing epochs, it is timeconsuming and has no better results than the others have. Our proposed model enhances the semantic consistency using intensity and gradient constraints that encourage appearance and structural similarity. The sparse connections not only maintain low-level and high-level features but also reduce network's parameters. This simplification is more applicable to TIR images.
In addition, we can find in TABLE 4 that the synthetic daytime TIR data have better qualities than those at night resulted from any GANs model. The SSIM increase for daytime synthetic TIR images reaches 7% while only 1.4% for the night images when comparing with the dense model [15]. That is because lower temperature at night cause weak thermal radiation difference. Indistinguishable objects tend to produce low-quality synthetic results.

V. CONCLUSION
In this work, we propose an image-to-image translation model to learn the mapping from the optical images to TIR domains. We design a sparse U-net architecture to learn the low-level and high-level features with less networks' parameters. To optimize this deep network, we introduce intensity and gradient losses into the objective to improve the semantic and appearance consistency during translation. The experimental results on the current dataset show our method works well. Although our method can bring better results, there are cases where the synthetic TIR data at night are different from the real infrared images and that needs further study.
MIAO ZHANG received the B.S. degree in traffic information engineer from the Nanjing University of Aeronautics and Astronautics, China, in 2018, where she is currently pursuing the M.S. degree. Her current research interests include deep learning and image processing.
FENG ZHANG received the B.S. degree in traffic information engineer from the Nanjing University of Aeronautics and Astronautics, China, in 2018, where he is currently pursuing the M.S. degree. His current research interests include deep learning and image processing.