Saturated Reflection Detection for Reflection Removal Based on Convolutional Neural Network

Single image reflection removal is a technique that removes undesirable reflections, which occur due to glass, from images. Various methods of reflection removal have been proposed, but unfortunately, they usually fail to remove reflections particularly with very high pixel values. In this paper, we define these saturated reflections and their characteristics, as well as discuss and propose a removal system. The proposed system detects areas of saturated reflections based on our proposed model of convolutional neural networks and restores them by a conventional method of image estimation. In our experiments, the proposed system shows better peak-signal-to-noise ratio scores and perceptual quality than conventional methods of reflection removal.


I. INTRODUCTION
S INGLE image reflection removal has been actively studied, and many methods have been proposed over several decades [1]- [11]. When we take a picture through glass, reflections often occur at the glass surface in the resulting image. The phenomenon is undesirable, and it degrades not only the image quality but also the accuracy in applications of computer vision such as image recognition, object detection, and so on. To remove reflections from images, various methods have been proposed which typically use optimization theory [1]- [11].
Recently, methods based on convolutional neural networks (CNN) have been actively proposed and show better results than the conventional ones [12]- [31]. A method using CNN was firstly proposed in [12]. A method that can be trained with unaligned images is proposed [19]. A method that uses the generative adversarial network and a loss function based on gradient information is proposed [16]. These state-of-theart methods show better results than conventional ones, both perceptually and objectively.
Unfortunately, state-of-the-art methods usually fail to remove reflections particularly with very high pixel values near the saturation because they cannot recognize them as reflec-tions. In this paper, these particular reflections are called saturated reflections. An example of saturated reflections is shown in the red box of Fig. 1 (a) and its ground truth (GT) is shown in Fig. 1(b). Saturated reflections are usually caused by light sources and completely conceal the background information. Fig. 1(c) shows the removed results of (a) by the method [19]. From this figure, it is observed that it fails to remove the saturated reflection. These methods assume that the pixel values of images with reflections are the sum of the background and reflection images. Since this assumption is not valid for saturated reflections, these methods do not recognize them as reflections.
In this paper, we tackle this problem and introduce a removal system using our proposed detection method. First, we discuss the definition, characteristics, and removal procedure of saturated reflections. To remove saturated reflections, we propose a detection method based on CNN. In the proposed method, pixels that have very high values are detected by pre-processing, and then the proposed CNN model classifies resultant pixels into backgrounds and reflections. The proposed CNN model is based on U-Net [32], [33] and uses high-frequency components of the input image as one of the inputs. The proposed system combines the proposed detec-tion method and a conventional image restoration method for reflection removal. In our experiments, we compare the proposed method with humans in saturated reflection detection and the proposed system with two state-of-the-art methods in reflection removal. The proposed system removes saturated reflections and has better PSNR scores than the state-of-theart methods, which is perceptually shown in Fig. 1(d).
The contributions of this paper are shown as follows: • We indicate a new problem of reflection removal and clarify its characteristics, which contributes to the development of reflection removal. • We propose a method and a system for removing reflections with saturated reflections and show their efficacy in the experiments. The system can be directly used with conventional methods to improve them. • Through our experiments, we show the efficacy of the proposed CNN using high-frequency components of images for the detection of saturated reflections. This paper is organized as follows: Section 2 shows the discussion for saturated reflections. The proposed method for detecting saturated reflections and the proposed system using it for reflection removal are explained in Section 3 and 4, respectively. Experiments are denoted in Section 5 and this paper is concluded in Section 6.

A. DEFINITION
We define saturated reflections as reflections with very high pixel values near saturation. For example, when images are represented by 8-bit, pixel values of saturated reflections are close to 255 in at least one color channel. By definition, saturated reflections usually saturate and eliminate the background information in their areas. Saturated reflections are typically caused by high-luminance objects such as light sources and hence have white color. Reflection phenomena within glass [2]. Since the glass has some thickness, there are two reflection planes and the light from one source reflects several times at the planes. Saturated reflections also follow this phenomenon.

B. CHARACTERISTICS
The state-of-the-art methods based on CNN usually fail to remove saturated reflections as mentioned in Section I [16], [17], [19], [21]. These methods assume that pixel values of input images are the sum of reflection and background images. Unfortunately, since the assumption is not valid for saturated reflections due to their definition, the methods usually recognize saturated reflections as light sources in backgrounds and avoid removing them. Therefore, to remove these saturated reflections, another technique is required.
To remove saturated reflections, it is required to first detect and then estimate the pixel values of the background.
Since pixel values of the background in areas of saturated reflections are eliminated as mentioned above, it is required to estimate them using adjacent pixels. Moreover, accurately detecting the saturated areas is required for estimation. Candidate areas can be straightforward detected by thresholding input images, but these areas can belong to either the reflection or the background. Therefore, a technique to classify candidate areas into reflections or backgrounds is required. The above steps, both detection and estimation, are required for the removal of saturated reflections.
We presume that high-frequency components of images are useful for detecting saturated reflections, which is shown in this paper. Since the glass has some thickness, there are two reflection planes and the light from one source reflects several times at the planes [2] as shown in Fig 2. This phenomenon causes the blurring of objects acquired by the reflection and thus they are more blurry than those in the background [2], [17]. Therefore, high-frequency components and gradients are sometimes used for detecting reflections in conventional methods [1], [5], [6], [9], [12], [14], [16]. Specifically, since the light of the saturated reflections has very high-luminance values, we presume that it reflects many times off of the glass and its objects are strongly blurred. Pixel values of saturated reflections are smoothly attenuated towards the circumference, and the gradient information of images is useful for detection.

C. DETECTION BY HUMANS
An experiment with humans was conducted to measure the human ability to recognize saturated reflections from candidate areas. To reduce the experimental costs for subjects, a simple procedure was conducted as follows: A set of two images was shown side by side, where the left image was a natural image and the right image showed highlighted areas that include pixels near saturation. The highlighted areas are either saturated reflections or light sources in the background. Given an image set, subjects were asked to click the highlighted areas that they recognized as reflections. There was no time limitation or constraint on the number of clicking. Subjects were 20 people, Japanese, twenties and thirties, male and female, and not familiar with the field of reflection removal. 30 images were used from datasets of natural images [3], [5], [15], [19]. Table 1 shows the results of this experiment, where these measures are shown in [34], and µ and σ denote mean and standard deviation values for all subjects, respectively. From the recall scores, it is observed that humans sometimes recognize saturated reflections as light sources of backgrounds. From the precision scores, the human ability to recognize reflections is relatively high. Although the procedure of this experiment is simple and was conducted at low cost, humans cannot strictly select saturated reflections from the presented areas. These scores can be recognized as one of the criteria for this theme.

III. DETECTION METHOD FOR SATURATED REFLECTIONS
A. OVERVIEW Fig. 3 shows an overview of the proposed method for detecting areas of saturated reflections. For pre-processing, a map of candidate areas M and high-frequency components C are calculated from an input image I by thresholding and filtering with various highpass filters. The proposed CNN model F ω produces a map of initially detected areasD from I and C asD where κ denotes the learnable parameters of the proposed model.D has real values and is binarized. Finally, a resultant map of detected areas D is calculated via the post-processing from binarizedD and M .  No. Type Size

B. PRE-AND POST-PROCESSING
The pre-and post-processing of the proposed CNN model, mentioned in Section III-A, are explained here. M is a binary map and calculated as where m i and y i are i-th elements of M and the luminance of I, respectively. In experiments of this paper, the luminance is the same as the Y plane of I in YCbCr color space, and ∆ = 0.95 when y i ∈ [0, 1]. To remove outliers, candidate areas whose number of pixels is three or less are eliminated. C is calculated by separately applying four filters, shown in Table 2, to the luminance plane, where No. denotes the filter type and * means that filter C has no coefficient in diagonals. If more than half of the pixels in a candidate area of M have pixel values inD greater than δ, pixel values of the area in D is determined to be 1. In other words, let ω and |ω| be an index set of the area and the number of its elements, then pixel values of the area in D is defined as where d i andd i are i-th elements of D andD, respectively. In this paper, δ = 0.5 whend i ∈ [0, 1]. Morphological transformations are applied to D, and the dilation of 7 × 7 is applied in this paper.

C. DETECTION CNN 1) Architecture
The architecture of the proposed network is inspired by U-Net [32], [33], and its details are shown as follows: Fig. 4 shows the architecture, where white, blue, green, and red boxes denote convolution, max pooling, and transposed convolution layers and squeeze-and-excitation-block (SE-block) [35], respectively. Some layers have multiple input arrows in Fig. 4. In those layers, input features are concatenated along VOLUME 4, 2016  the channel direction and then processed. Table 3 shows hyper-parameters of each layer, where Conv. and Pool. denote the convolution and the pooling, and Ch. rate means the multiplication factor from the number of inputs to the number of outputs, respectively. Conv. A is used at the beginning of each layer, Conv. B is used in the encoder side (left-half side in Fig. 4), and Conv. C is used in the rest. Note that in the SE-block, the number of input and output channels is the same. All layers and blocks use the Swish [36] as activation function followed by batch normalization [37]. Thanks to the architecture, the proposed network has highresolution representations for space and channel direction and produces a map with accurate localization. U-Net is one of the most efficient and well-known structures for image segmentation and can increase the spatial resolution of features without degrading the localization accuracy [32], [33]. The proposed network is constructed based on U-Net because the above property is also necessary for the detection of reflections. We presume that features of C are produced from a different domain of I, and therefore different encoder networks are applied for feature extraction, as shown in the left half of Fig. 4. Moreover, to improve the network representation of the proposed network, SE-block is introduced into features of C at each resolution because SEblock improves the network representation via channel-wise recalibrating features at low computational cost [35].  Fig. 3 are applied to the input image to produce the pre-removed image and the map of saturated reflection regions. From results of the first step, An estimation method such as a conventional method of inpainting produces the removed image.

2) Loss Function
For training, the loss function of the proposed network L is defined as the sum of the cross entropy loss L bin and the focal loss L FP [38], which is shown as follows: where λ is a balancing parameter, L bin and L FP are defined as

IV. SYSTEM FOR REFLECTION REMOVAL USING THE PROPOSED METHOD
To remove reflections with saturated reflections, we propose a system consisting of previous methods and the proposed method, as shown in Section III. The overview of the proposed system is shown in Fig. 5, where I,Ī, andÎ denote an input image, the pre-removed image, and the final removed image, respectively. The proposed method is applied to I for detecting saturated reflections and a state-of-the-art method of reflection removal is also applied to remove these reflections without saturated reflections. Saturated reflections ofĪ are removed via estimating pixel values of backgrounds and D is used as the mask to indicate these areas. We understand that inpainting is one of the most suitable techniques for this estimation, and its method is used in the experiments of this paper. The proposed system can be straightforward improved by developing each technique, and thus encourages that each technique can be studied separately.

A. TRAINING OF PROPOSED CNN
To train the proposed network shown in Section III-C, we artificially created 56000 images that have light sources of backgrounds and saturated reflections. The Places 365 dataset and high dynamic range (HDR) images [39]- [42] were used as backgrounds and reflections, respectively. To create artificial reflections R, HDR images were converted into low dynamic range (LDR) by a gamma correction [34], and blurred by Gaussian kernels with zero mean and random values of variance in [0, 2] [43]. A natural image and R were added directly and the resultant image X was cropped to have a size of 256 × 256. The GT is a binary map G that indicates areas of saturated reflections in X, and it was calculated as where g i is the i-th element of G, x i and r i are the i-th elements of luminance values of X and R in YCbCr color space, respectively, and pixel values are in [0, 1]. The number of images in the training set was 25000. Random horizontal and vertical flipping were applied to the training set. The proposed CNN model was trained with following hyper-parameters: Momentum SGD is used as an optimization method [37]. The momentum is 0.9 and the batch size is 64. The learning rate is first set to 0.001 and multiplied by 0.1 every 25 epochs. The model was trained for 100 epochs. For the loss function explained in Section III-C2, we set λ = 20 and γ = 3.0.

B. EVALUATION OF SATURATED REFLECTION DETECTION
For the evaluation of the saturated reflection detection, the proposed method was compared with humans as follows: The proposed method was applied to the same 30 images in the experiment with humans shown in Section II-C. Although the proposed method produces a pixel-wise binary map, true and false results are counted area-wise. If more than half of the pixels in a candidate area have 1 in D without the dilation, we define that the area is recognized as a saturated reflection by the proposed method. Otherwise, it is recognized as an object of the background. Table 4 shows the detection scores of humans and the proposed method, where Prop. means the proposed method. From Table 4, scores of humans are greatly higher than the proposed method, although humans were not asked to identify the position of the areas. Table 1 shows one of the criteria for this theme and the proposed method needs improvements to achieve more competitive scores. Specifically, since there is the most difference in precision, reducing false detections is an area for improvement.
To visually explain the performance of the proposed method, an example of its results is shown in Fig. 6, where white pixels in blue and green boxes of Fig. 6(b) are light sources of background and the saturated reflections, respectively. Fig. 6 has all cases of detection, true and false positives and negatives. Comparing Fig. 6(b) and (c), false positives, false negatives, and true negatives are shown in the red, green, and blue boxes, respectively. The proposed method detects saturated reflections and avoids background objects, and the proposed system produces natural images by removing saturated reflections as shown in Fig. 1. However, since the proposed method has lower accuracy compared to ideal detection, it often detects background objects as reflection and misses saturated reflections. Specifically, as mentioned above, reducing false detections is necessary because it leads to the elimination of background objects.

C. EVALUATION OF REFLECTION REMOVAL
The proposed system is compared with state-of-the-art methods of reflection removal in this section. The methods [16], [19], [23], [25], [29] are used for comparison, which we refer   to as GCNet [16], ERRNet [19], IBCLN [23], Kim et al. [25], and LRMNet [29] in this paper. Since the proposed system uses a method of reflection removal, each method is compared with the proposed system that uses itself. For this experiment, the proposed system uses a method of inpainting [44] as the Estimation method shown in Fig. 5. 20 sets of images and their GTs are used from datasets of reflection removal [3], [5], [15], [19], and their images have saturated reflections and light sources of background.
The peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS) is used as objective measurement [43], [45]. Table 5 shows PSNR, SSIM, and LPIPS scores that are mean values of 20 images, where "Only" means applying only the method to images, whereas "+Prop." means using the proposed system with it. For each method, the proposed system produces better and comparable images. Unfortunately, the difference is only slight, because the number of pixels in saturated reflections is very small compared to the whole image. Thus, the improvement only slightly influences scores. Fig. 7-10 show resultant images of reflection removal by the conventional methods and the proposed system. Fig. 9 and 10 show the enlarged images in the area bounded by the red box of corresponding images in Fig. 7 and 8, respectively. These figures show that the conventional methods fail to remove saturated reflections while the proposed system can remove these reflections without degrading the image quality. However, some saturated reflections still remain in the output  images of the proposed system. Fortunately, false positives of the propose method are perceptually inconspicuous in the output images because they usually have small areas and are naturally drawn by inpainting.

VI. CONCLUSION
In this paper, we discussed the detection and removal of saturated reflections, proposed a detection method based on CNN with several signal processing techniques, and introduced a removal system based on the proposed detection method and conventional removal methods. First, we discussed the definition, characteristics, and removal procedure of saturated reflections, and showed experimental results of humans in recognizing saturated reflections. The proposed method detects candidate areas of saturated reflections by thresholding, and classifies them into reflections and backgrounds using the proposed CNN with high-frequency components of images. The proposed system detects areas of saturated reflections with the proposed method and restores them using a conventional method of image restoration. In experiments using two conventional methods of reflection removal, it is shown that the proposed system has better PSNR scores than the original method. Moreover, the removal of saturated reflections by the proposed system is perceptually shown.
Unfortunately, the proposed method has lower scores of precision and recall in detection than humans. In future work, we hope to improve the proposed method by reducing false detections and achieve similar detection scores to humans.