Region-Based Removal of Thermal Reflection Using Pruned Fully Convolutional Network

In general, an image obtained from a thermal camera often has a mirror reflection or shadow reflected off the ground around an object, which is referred to as thermal reflection. Sometimes the thermal reflections are connected to their objects in images, which makes it difficult to detect or recognize the object only. Thermal reflections sometimes occur on the wall near an object and are detected as another object when they are not connected to the object. Furthermore, the size of thermal reflection and pixel value significantly vary with the medium of the reflected range and the surrounding temperature. In these cases, the patterns and pixel values of thermal reflection and the object become similar and difficult to distinguish. However, there are insufficient studies on removing the thermal reflection of various kinds of objects in diverse environments. Therefore, in this paper, we propose a pruned fully convolutional network (PFCN)-based method for removing the thermal reflection of an object using the surrounding information when image transformation is performed only within the region of an object. When experiments were conducted using self-collected databases (Dongguk thermal image database (DTh-DB) and Dongguk items & vehicles database (DI&V-DB)) and open databases, the method proposed herein exhibited more outstanding performance in removing thermal reflection when compared with the state-of-the-art methods.


I. INTRODUCTION
Typically, a long-wavelength infrared (LWIR) camera, which is often used in surveillance systems, can measure electromagnetic radiation of wavelengths 8-12 µm [1]. Most of the thermal radiation generated from an object or body is infrared radiation, and the LWIR camera is commonly used to measure such heat information. Hence, an LWIR camera is also referred to as a thermal camera. A thermal camera can make objects at close and far distances visible in dark surroundings without using an additional illuminator. Figure  1 shows the thermal camera, a visible light camera, thermal images and the respective visible light images. However, as shown in Figure 2, there are thermal reflections (the areas of dotted line) such as shadows or mirror reflections on the ground surface near the object in the images obtained using a The associate editor coordinating the review of this manuscript and approving it for publication was Qingli Li . thermal camera in both indoor and outdoor environments. The performance of object detection or recognition algorithms is degraded due to such thermal reflections. However, very few studies have been conducted on the removal of thermal reflection. Therefore, we propose a novel method for the removal of thermal reflections by conducting image transformation only within the specific region of thermal images. Recently, various image transformation algorithms have been developed for deep-learning-based image processing tasks. In particular, image-to-image translation methods based on generative adversarial network (GAN) have been showing high accuracy. Normally, an entire image is transformed when transforming an object in an image. However, the accuracy is reduced in such a method, as the background region is also transformed in addition to the object being transformed. Thus, a method for increasing the accuracy of transformation is proposed. In the method, transformation operation is conducted only within the region of an object that has been VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. From the left to the right, the thermal camera with captured thermal images, and the visible light camera with captured images, respectively. Images captured in daytime and night, respectively, from the top to the bottom.
detected using deep learning. The surrounding information is also considered when transforming an image only within the region of an object. More specifically, thermal reflections in a thermal image are removed by transforming the heat of specific regions of the wall and surrounding floor to match the heat of the background. The remainder of this paper is organized as follows. Previous studies are discussed in Section II, and the contributions of this study are explained in Section III. The details of the proposed method are explained in Section IV. The experiment results and comparison experiment are discussed in Section V, and lastly, the conclusion of this study is provided in Section VI.

II. RELATED WORKS
The existing deep-learning-based image transformation methods can be divided into transforming only a specific region of an image and transforming an entire image. There are several GAN-based image-to-image translation methods [2]- [9]. In study [2], authors developed a two-step unsupervised learning method that transforms images between different domains by using unlabeled images without specifying any correspondence between them so as to avoid the cost of acquiring labeled data. In [3], an unsupervised image-toimage translation (UNIT) method based on GAN and variational autoencoders (VAEs) is proposed. In the paper, two limitations of the method are explained. The first limitation is that the transformation model is unimodal due to the Gaussian latent space assumption. The second limitation is that the training could be unstable due to the saddle point searching problem. In study [4], triangle GAN that can be used for semi-supervised joint distribution matching is proposed. The approach learns the bidirectional mappings between two domains with a few paired sample images. In [5], a StarGAN, a scalable method that can perform image-to-image transformation for multiple domains using only a single model is proposed. The architecture of StarGAN allows simultaneous training of multiple image data sets with different domains within a single network. In [6], a method based on GAN that learns from images to discover relations between images in different domains (DiscoGAN) is proposed. The DiscoGAN can generate highly qualified images with transferred style without using any explicit pair labels and learns to relate images from very different domains. In study [7], GAN in the conditional setting is explored to design new conditional GAN (cGAN) that learns a conditional generative model. In [8], a coupled GAN (CoGAN) method for learning a joint distribution of multi-domain dataset is proposed. In contrast to the existing methods, it requires tuples of corresponding image data in different domains in the training set. CoGAN method learns a joint distribution without any tuple of corresponding image data.
In studies [2]- [8], an entire image was transformed using a deep learning network. In study [9], the network was trained by using an entire image and a corresponding mask image of objects simultaneously as inputs. They also reported that image transformation methods in previous studies fail in case of multiple objects and when the shape of an object 75742 VOLUME 8, 2020 changes. Hence, the performance of image transformation was enhanced using the mask image. In study [10], a perceptual loss network (PLN)-based method was proposed in which image-to-image translation was performed while the image style was transferred. However, the above image-toimage translation methods involve transforming an entire image; thus, the accuracy is reduced more by the background region being transformed in addition to the region of an object than by only the region of an object being transformed. Hence, we propose a method for increasing the accuracy of image transformation by transforming only the region of an object. In this study, a method for removing thermal reflection in images obtained using a thermal camera was examined. Typically, there are two problems in thermal images, namely thermal reflection [11]- [14] and halo effect [15]- [17]. In study [11], a method of suppressing thermal reflection is proposed. In the method, the visible light reflection and the reflection of heat are experimented. Additionally, various polarizers and plates are also used, and the change in the thermal reflection according to the angle of the plates is graphically shown. In this way, a thermal reflection suppression technique considering the angles according to the plates of various polarizers and materials in the experiment is proposed. However, in this method, given that the angle varies depending on the material of the plates, the suppression performance is reduced when there are nearby floors or walls made of different materials.
In [12], a thermal reflection elimination method is proposed. The method is conducted using Mask R-CNN [18] to detect thermal reflection regions in thermal images. The method eliminates thermal reflections based on the detected regions of thermal reflections. The method changes the value of pixels only in the detected regions to increase the accuracy of the transformation. In [13], two methods are proposed such as a method that classifies the regarded material in order to estimate improved surface temperature values, and a method to detect and remove thermal reflections in thermal images. The detection method is conducted using a background subtraction algorithm. To remove a thermal reflection, the method uses weighted moving averages. In study [14], a novel reflection removal approach using polarization properties of the reflection in thermal images is proposed. The method uses four input images of different polarization angles such as 0, 45, 90, and 135 degrees for removing thermal reflections. These studies are conducted to remove thermal reflections without using deep learning.
Halo effect is explained in a documentation provided by FLIR [15]. For example, the halo effect in a thermal image is a circular region of high intensity pixels that surrounds an object. In studies [16] and [17], methods for the detection of subjects in images with the halo effect were proposed using contour-based approaches. The methods are based on background subtraction that fuses contours obtained from a thermal image and a visible light image. However, these contour-based methods are not methods for removing halo effects in images, but rather, approaches for accurately detecting subjects in thermal images with halo effects. Moreover, other existing studies that have investigated the detection [19]- [26], identification [27], [28], and recognition [29]- [31] of thermal images and the survey study [32] did not consider these two problems. Therefore, we propose an image transformation method based on the regions of thermal reflection using deep learning.
In addition, there are previous studies [33]- [42] that are conducted for image inpainting tasks which are similar to a task conducted in this study. Image inpainting technique is used to fill damaged, deteriorating, or missing parts of an image. In study [33], a method for semantic image inpainting, which generates the missing information by conditioning on the available data is proposed. The authors claim that in their method, inference is possible irrespective of how the missing information is structured, while the state-of-the-art learningbased methods require specific information about the holes in the training phase. In [34], a spatial region-wise normalization named region normalization (RN) to overcome the limitation of image inpainting problem is proposed. The mean and variance shifts caused by full-spatial feature normalization (FN) limit the image inpainting network training is presented. In [35], a method based on a deep generative model which can not only synthesize novel image structures but also explicitly utilize surrounding image features as references during network training to make better predictions is proposed. The model is a feedforward, fully convolutional neural network (FCN) which can process images with variable sizes and with multiple holes at arbitrary locations during the test time. In study [36], a generative image inpainting approach to complete images with guidance and free-form mask is proposed. The approach is based on gated convolutions learned from huge number of images without additional labelling efforts. The authors presented user sketch as an exemplar guidance to help users to remove distracting objects quickly, modify image layouts, edit faces, clear watermarks, and create novel objects in images.
In [37], a learnable bidirectional attention maps (LBAM) for image inpainting is proposed. The method used FCN to conduct image inpainting. In [38], a fined deep generative model-based method which designed a coherent semantic attention layer to learn the relationship between features of missing information in images. The method used FCN to conduct image inpainting. In study [39], an architecture named Shift-Net for image completion that exhibits high speed with promising details via deep feature rearrangement is proposed. The study presented a special shift-connection layer to the U-Net architecture. The method uses FCN with a shift layer. In [40], an image inpainting model named PEPSI which overcomes the limitation of the two-stage coarse-tofine network using the joint learning scheme is proposed. In study [41], a two-stage adversarial model EdgeConnect that comprises of an edge generator followed by an image completion network is proposed. First, the method generates an edge information from a damaged image then combine the obtained edge information with the damaged image as inputs  to second generator for desired output. In [42], a PGGAN approach is proposed. The method includes a discriminator network that combines a global GAN (G-GAN) architecture with a patch GAN.
To consider the limitation of previous works, we propose an image transformation method based on the regions of thermal reflection using deep learning. The summary of a comparison between the proposed method and previous image transformation methods is provided in Table 1.

III. CONTRIBUTIONS
This research is novel in the following four ways compared with previous works: -This study is the first of its kind to remove thermal reflection in thermal images using deep learning. -A general image processing method can remove thermal reflection by transforming an image only within the region where thermal reflection was detected; however, a method for transforming an image only within the detected region using deep learning does not exist currently. Therefore, we suggest a deep-learning-based method for transforming an image only within the region where thermal reflection is detected. -In this study, a pruned fully convolutional network (PFCN), in which the heat information of surrounding walls and ground is considered, is newly proposed for transforming an image only within the area where thermal reflection is detected. -The convolutional neural network (CNN) models for removing thermal reflection developed in this study are disclosed through [45] for an evaluation of performance by other researchers.

IV. PROPOSED METHOD A. OVERALL PROCEDURE OF PROPOSED METHOD
In this section, the method proposed in this paper is explained in detail. In the proposed method, only specific region of an image is transformed using PFCN architecture, which is the improved version of existing FCN [46]. Figure 3 shows the flowchart of the proposed method. Moreover, a method for obtaining output images to be transformed by PFCN is further explained in section IV. B, whereas a method for removing thermal reflection using the output images obtained by PFCN is further explained in section IV. D. The thermal camera used in this study can obtain an image at the speed of 30 frames per second (fps) [47]. It can measure the temperature from -40 • C to +80 • C to make objects visible in both light and dark environments. The database (an image has the depth of 14 bits and the size of 640 × 480 pixels [12]) obtained using the thermal camera was used in the experiment. A mask region CNN (Mask R-CNN) was used to detect the approximate region (input region image in Figure 3) of thermal reflection in input images, and the detailed explanation is provided in [12].  and input region image R of the object are used as inputs to generate the output image O. R is the image attempting to be transformed, and I is the image providing information on the surroundings. For example, when removing a shadow of an object in the visible light image, the pixel intensity within the region is transformed to be similar to the pixel intensity of nearby ground or walls. Accordingly, I is used as an input to FCN to extract the information on the surroundings. Two different structures were experimented for the proposed method. The idea of the first structure (Figure 4(a)) is to generate O by extracting the features of I and R and combining them. The idea of the second structure (Figure 4(b)) is to update the convolution layers with the information extracted from I when transforming R to O. The concatenate layer is used when combining feature maps. Feature maps extracted from Figure 4 Tables 6 and 7. The thermal images used in this study are one-channel gray scale images, not three-channel color images. In this study, a color mapping function [48] is used to map gray scale thermal images to color thermal images for accurately representing the information of the heat and surrounding temperature of the objects in images. In Tables 6 and 7 in Appendix, all the convolution layers are followed by the rectified linear unit (ReLU). In Table 6, (1×1) padding is used for conv2d_13, and (0 × 0) padding is used for the convolution layers. Furthermore, the filter size, stride, and padding are (1 × 1), (3 × 3), and (0 × 0), respectively, in Table 7.
In this study, the PFCN with enhanced performance is proposed instead of using FCN_V1 and FCN_V2 in Tables 6  and A.2 in Appendix as they are. The PFCN is a model with a reduced number of channels and parameters of the FCN based on the pruning (network surgery) [49] technique. The detailed explanation is provided in section IV.C. For a model obtained by training the FCN, a complete black image can be output as shown in Figure 5. This is due to the black area of the input region image in Figure 5. For example, when the input region image is input to the FCN structure, there are completely black feature maps among the feature maps extracted from the first convolution layer. Therefore, in the proposed method, the trained FCN model is pruned using the pruning function. Using the pruning function, the parameters that extract black feature maps as shown in Figure 5 are removed from the proposed structure. After fine-tuning the pruned architecture, VOLUME 8, 2020 the expected final output image can be obtained using the structure as shown in Figure 6. The structure of the PFCN is shown in Tables 8 and 9 in Appendix.

C. DIFFERENCES BETWEEN FCN AND PFCN
The PFCN architectures proposed in this study and the existing FCN architectures have the following 3 differences.
PFCN architectures have a smaller number of channels than FCN architectures.
PFCN architectures have a smaller number of parameters than FCN architectures.
PFCN architectures are optimized versions of FCN architectures. The optimization operation is conducted by removing low effective parameters using a pruning.

V. POST PROCESSING
Moreover, a masking operation is performed using the output image obtained by the PFCN at the post-processing step. When performing the masking operation, the thermal reflection region of the output image in Figure 6 is processed with the input image in Figure 5 as in Equation (1) to obtain the final output image in Figure 6.
Img final output = Img input • Img mask + Img output (1) In Equation (1), Img input , Img mask , Img output , and Img finaloutput are the input image, mask image, output image generated by the PFCN, and final output image, respectively. More specifically, the input image in Figure 5 is Img input , the output image obtained by the PFCN in Figure 6 is Img output , and the final output image is Img final output . The pixel values of Img mask are either 0 or 1 as shown in Figure 7, whereas the pixel values of the region of interest (ROI) in Img mask are 0 and those of the background are 1 as shown in Figure 7. Moreover, the operator (•) is the Hadamard product (elementwise multiplication) [50], whereas the operator (+) is matrix addition.

A. DESCRIPTION OF EXPERIMENTAL SETUP AND DATABASES
The database [12] used in this study consists of thermal images of objects at close and far distances in both dark and bright environments. The database was collected in both indoor and outdoor environments. Furthermore, the database also includes visible light images. The details of the database are shown in Tables 3 and 4, and Figures 8-11 in our previous work [12]. Figure 8 shows the examples of the images in the database. The experiment was conducted as a two-fold cross validation. Specifically, half of the data were used for training, whereas the remaining data were used for testing. Then, the training data and testing data were switched, and the experiment was repeated. The results obtained accordingly were then used to determine the average testing accuracy. The training and testing of the algorithm proposed in this study were conducted with a desktop computer. The desktop computer is equipped with an NVIDIA graphics  card (NVIDIA GeForce GTX TITAN X [51]), Intel CPU (core i7-6700 CPU @ 3.40GHz (8 CPUs)), and RAM (32 GB). The method proposed in this paper was implemented using Python-based Keras application programming interface (API) with TensorFlow backend engine [52] and OpenCV library [53].

B. TRAINING OF PFCN MODELS
When training the proposed models, the image size, batchsize, training epoch, loss, learning rate, and optimizer are set to 224 × 224 × 1, 1, 1000, MSE (mean squared error [54]), 0.0001, and adaptive moment estimation methods (Adam) [55], respectively. MSE loss is calculated between the pixel of ground-truth image and that of restored image by PFCN as shown in Equation (2). The larger MSE loss becomes, the larger penalty is assigned to the updated weights of PFCN whereas the smaller MSE loss becomes, the larger reward is given to the updated weights, which confirms the training convergence of PFCN.
The PFCN obtained after pruning the trained FCN was fine-tuned again with 100 epochs. Figure 9 shows the training loss of each method as the number of epochs increases. As the number of epochs increased, the training loss of both methods converged. In general, for a cycle-consistent adversarial network (CycleGAN)-based method [44], unpaired reference data are used for training, whereas in the proposed method, ground-truth data for input are used. Training was performed using the paired dataset of the input and ground-truth as shown in Figure 10.

C. TESTING 1) TESTING RESULTS OF THERMAL REFLECTION REMOVAL
In this section, the comparison results of the proposed method and the state-of-the-art methods are provided. For the comparison, the accuracies of seven types of methods i.e., CycleGAN [44], PLN [10], Mask R-CNN + PLN [12], SegNet [43]-based removal method [12], Mask R-CNN [18]based removal method [12], FCN_V1 [46], and FCN_V2 [46] are compared with the accuracy of the method proposed in this study. Based on the original parameters provided by authors, the optimal parameters of these seven types of methods were obtained by the further procedure of fine-tuning with the training dataset of our experimental data.
For fair comparisons, same training and testing data were used for both the previous methods and our method. For measuring the accuracy, the similarities between the groundtruth image (GT(i, j)) in which thermal reflection was manually removed and the image (Out(i, j)) in which thermal reflection was automatically removed by the algorithm were compared. Three kinds of metrics as in Equations (2)-(4) and the structural similarity index (SSIM) [56] were used to measure the accuracy.

MN
(2) where M and N represent the image width and height, respectively. SNR and PSNR are the signal-to-noise ratio [57] and the peak-signal-to-noise ratio [58], respectively. Equation (5) expresses the mathematical formula of SSIM.
µ o and σ o represent the mean and standard deviation of the pixel values of a ground-truth image, respectively, µ r and σ r represent the mean and standard deviation of the pixel values of the restored image, respectively, and σ ro is the covariance of the two images. S1 and S2 are positive constants set so that the denominator does not become zero. Table 2 shows the comparison of the measured accuracies. A greater value in Table 2 indicates higher accuracy. As shown in Table 2, the accuracy of removing thermal reflection was the highest for all the methods proposed in this study. Figure 11 shows the results of removal of the thermal reflection by the proposed method and by the state-of-theart methods. The ground-truth image with thermal reflection removed manually is shown in Figure 11 (b). The results of the proposed method are shown in Figures 11 (j) and (k), whereas those of the SegNet-based removal [43], CycleGAN [44], PLN [10], Mask R-CNN + CycleGAN [46], Mask R-CNN-based removal [12], FCN_V1 [46], and FCN_V2 [46] are shown in Figures 11 (c)-(i). As shown in Figure 11, the accuracy of the removal of thermal reflection by the PFCN method proposed in this study is the highest for all the cases.  Table 3. In the experiment conducted with the open database, the accuracy of the proposed method was higher than that of the state-of-the-art methods. Figure 13 (a)-(k) show the source input image having thermal reflection, the ground-truth image with  thermal reflection manually removed, and the results of thermal reflection removed by all the methods. As shown in Figure 13 (j) and (k), images for which thermal reflection was removed by the proposed method were most similar to the ground-truth image.
In this research, we did not measure the accuracy of the detection, and there is no error of false acceptance and rejection. Instead, we measured the quality of restored image by our method by calculating the similarity between our restored and ground-truth images based on Equations (2) ∼ (5).   Tables 2, 3, and Figures 11 ∼ 13, the similarity between our restored and ground-truth images is higher than those between the restored image by previous methods [10], [12], [43], [44], [46] and the ground-truth image, which confirms the superiority of our method for thermal reflection removal.

3) COMPARISONS OF PROCESSING SPEED OF THE FCN AND PROPOSED PFCN
In the next experiment, we compared the processing speed between FCN and the proposed PFCN for an input image. The experiment was performed on the desktop computer described in section V.A. As shown in Table 4, the proposed PFCN was faster than the conventional FCN.

4) COMPARISONS OF OBJECT DETECTION ACCURACY BY OUR METHOD WITH THE PREVIOUS METHOD
Although the generated background by PLN method seems to be more desired than that by our method, there exist lots of errors that the pixels inside of object are incorrectly recognized as backgrounds as shown in Figures 11 (e) and 13 (e) compared to the ground-truth images of Figures 11 (b) and 13 (b). Nevertheless, these errors are much reduced in the result image by our method as shown in Figures 11 (j), (k) and 13 (j), (k). To confirm these observations, we performed the additional experiments of object detection by Mask R-CNN [18] with the result images by PLN method and our method. Accuracy was measured based on five metrics of recall, precision, GlobalACC, F1 score, and Jaccard similarity [12]. Accuracy (ACC) is the percentage of correctly classified pixels for each class as shown in Equation (6). Here, #TP, #TN, #FP, and #FN represent the number of true positive data, true negative data, false-positive data, and false-negative data, respectively. The positive and negative data represent the pixels of the object and the background, respectively. TP represents the data that are positive and correctly classified as positive data whereas TN means data that are negative and correctly classified as negative data. FP represents data that are negative but incorrectly classified as positive data, whereas FN represents data that are positive, but incorrectly classified as negative data.
The global accuracy (GlobalACC) is defined as the ratio of correctly classified pixels to the total number of pixels. The F1 score is calculated based on precision and recall as shown in Equation (7). In this case, precision is calculated as #TP/(#TP+#FP), whereas recall is calculated as #TP/(#TP+#FN).
For a class, the intersection over union (IoU) is the ratio of the correctly classified pixels to the total number of ground truth and predicted pixels in that class. The IoU score is also known as the Jaccard similarity, and it can be calculated with two sets X and Y as shown in Equation (8). In this case, X is the ground truth pixel of the object whereas Y is the detected VOLUME 8, 2020  pixel of object.
As shown in Table 5, our restored image + Mask R-CNN showed a higher detection accuracy compared to that by PLN + Mask R-CNN for all five metrics. In Figure 14, the results of PLN + Mask R-CNN and our method + Mask R-CNN are compared. As shown in this figure, it is evident that the detection accuracy of our method + Mask R-CNN is higher than that of PLN + Mask R-CNN.

VII. CONCLUSIONS
In this study, various methods for removing thermal reflection in thermal images of diverse objects were proposed. Specifically, the new method using PFCN, which considers the heat information of nearby ground and walls, is proposed when an image is transformed only within the region where thermal reflection is detected. In the PFCN model, unnecessary channels and parameters are removed from the existing FCN structure through training, and the performance of thermal reflection removal is improved despite having fewer parameters than the FCN model. The proposed method was compared against various state-of-the-art methods (SegNetbased removal, CycleGAN, PLN, Mask R-CNN + Cycle-GAN, Mask R-CNN-based removal, FCN_V1, FCN_V2), and thus, the accuracy of removing thermal reflection using the proposed method was proven to be higher than that of the state-of-the-art methods when experiments were conducted using our database and additional open databases. As shown in [33], PLN and CycleGAN-based method which were compared in our experiment were used for image inpainting, and we can regard these methods as the classical image inpaining algorithm. As shown in Tables 2, 3, 5, and Figures 11 ∼ 14, our method shows the higher accuracy than these image inpaining methods. The reason why there remains the border around the detected reflection is that the generated mask by our PFCN is a little smaller than the ground-truth reflection area. Nevertheless, it does not give much influence on the detection accuracy of object as shown in Table 5 and Figure 14.
To solve this problem of remained border, we can adjust the output threshold of PFCN, which produces the larger mask and reduces the consequent border around the detected reflections. We would research about this method in future VOLUME 8, 2020 work. A method for transforming low-resolution thermal images to high-resolution images will be examined in future research. Furthermore, a method for detecting the object, thermal reflection, and halo effect in thermal images and removing thermal reflection and halo effect will be studied as well. 75758 VOLUME 8, 2020 APPENDIX See Table 6