HDR Image Reconstruction Using Segmented Image Learning

Converting a low dynamic range (LDR) image into a high dynamic range (HDR) image produces an image that closely depicts the real world without requiring expensive devices. Recent deep learning developments can produce highly realistic and sophisticated HDR images. This paper proposes a deep learning method to segment the bright and dark regions from an input LDR image and reconstruct the corresponding HDR image with similar dynamic ranges in the real world. The proposed multi-stage deep learning network brightens bright regions and darkens dark regions, and features with extended brightness range are combined to form the HDR image. Dividing the LDR image into the bright and dark regions effectively implements information on lost over-exposed and under-exposed areas, reconstructing a natural HDR image with color and appearance that is similar to reality. Experimental results confirm that the proposed method achieves an 8.52% higher HDR visual difference predictor (HDR-VDP) and a 41.2% higher log exposure range than current methods. Qualitative evaluation also verifies that the proposed method generates images that are close in quality to the ground truth.


I. INTRODUCTION
Substantial image processing advances over the past few years have increased the need for technology to produce images similar to real world resolution. High dynamic range (HDR) is the most widely used image quality enhancement technology. Fig. 1 shows that HDR generates images with similar brightness to the range of human perception compared with low dynamic range (LDR) images with low brightness. However, HDR images should be obtained with expensive devices that can operate beyond the standard camera sensor limitations to acquire a wide dynamic range [1], and hence, most people have limited access to HDR images. To overcome these limitations, multiple exposure fusion (MEF) methods [2], [3] have emerged, which generate multiexposure LDR images. The images are then merged to reconstruct an HDR image. However, since multiple images cannot be taken simultaneously, the ghost artifact that projects the flow of time occurs when merging the bracketed images [4]. The areas with missing information also occur in ranges outside a given exposure. Therefore, inverse tone mapping (ITM) methods [5] - [11] have been proposed to generate HDR images using single LDR images. However, previous ITM methods are insufficient for finding a function similar to the inverse camera response function (CRF), and obtaining parameters that satisfy all of the photo environments is difficult. Deep learning methods have been proposed to overcome this problem, including convolutional neural network (CNN)based MEF methods [12] - [18] and CNN-based ITM methods [19] - [23]. CNN-based MEF methods [12] - [18] are able to uniformly restore the saturated pixels by generating and merging multi-exposure images. However, they [12] - [18] have difficulty restoring the saturated pixels which do not exist in the brightness range of multi-exposure images. CNN-based ITM methods [19] - [23] have emerged to reconstruct HDR images from a single LDR image. These methods map the LDR image to the HDR image through a CNN that acts as an inverse CRF. Nevertheless, despite the efforts to obtain near-original HDR images, effective restoration for saturated pixels is still limited. As shown in Fig. 2, the number of saturated pixels in the HDR image is minimal compared to the total number of pixels. Since image restoration typically uses loss functions based on 1, 2, or a mean square error (MSE), weights are updated to reduce the global pixel value differences between the ground truth images and the inferred images. As a result, the network weights are not efficiently updated to restore the saturated pixels. Moreover, since over-exposed pixels have very high values and under-exposed pixels have very low values, it is difficult to restore all of the saturated pixels uniformly. Therefore, this paper proposes a deep learning ITM method to generate HDR images by learning from segmented LDR images. The main contributions of this paper are as follows: 1) The proposed method divides the input image into bright and dark regions and trains the two regions individually. Bright regions are learned to effectively restore over-exposed areas, and dark regions are learned to restore under-exposed areas. Therefore, the proposed approach can effectively restore all saturated pixels.

2) The proposed mask generation method for image
segmentation can avoid artifacts. The proposed allows smooth and clear image segmentation, avoiding boundary plane artifacts that occur when bright and dark regions are combined.
3) The proposed multi-stage network structure increases the dynamic range and uniformly restores the saturated pixels. To extend the dynamic range, bright regions are brightened, and dark regions are darkened. Then, the features are blended to generate the final HDR image. The remainder of this paper is structured as follows: Section II discusses how the proposed method differs from the previous ITM method. Section III describes the algorithm, network structure, and loss function of the proposed method. Section IV describes the datasets and parameters used for training and testing. Section V discusses the qualitative and quantitative comparisons between the proposed ITM method and the existing ITM methods. Finally, Section VI summarizes and concludes the paper and discusses future study directions.

II. RELATED WORKS
The restoration of HDR images from the clipped LDR images is a complex problem. To solve this problem, various approaches have been studied in computer graphics, which can be classified into MEF and ITM methods.

A. MULTIPLE EXPOSURE FUSION METHODS
Debevec et al. [2] generated an HDR image by combining bracket images. Mertens et al. [3] used a Laplacian pyramid and a Gaussian pyramid to fuse the contrasting features in multi-exposure images. [2] and [3] proposed guidelines for the MEF method. However, the values of the bracket images were not adequately mixed during the image merging process, resulting in a halo artifact. The CNN-based MEF methods have been proposed to optimally extract the weight values of bracket images and generate HDR images with evenly merged brightness values. Endo et al. [12] used U-Net [24] to generate HDR images with wide brightness ranges by bracketing images with various exposure values. Similarly, Lee et al. generated HDR images using a chain network to create multiple exposure stacks [13] and a generative adversarial network (GAN) [14], [15]. Since MEF methods generally use more than three LDR images, they require extensive processing time, power consumption, and storage capacity. To overcome these problems, Ma et al. [16] proposed a fast MEF method that generated weight maps for down sampled bracketed images using a multiscale context aggregation network (CAN) [25]. However, this method [16] also requires a large amount of computation since a large number of images were fed into the network. To minimize the shortcomings of the CNN-based MEF methods, two exposure fusion (TEF) methods that use only two images have been proposed. Prabhakar et al. [17] proposed a network that fuses the luminance information of two exposure LDR images. Yin et al. [18] combined two exposure images through a prior aware-GAN, which consists of a content prior guided (CPG) encoder and a detail prior guided (DPG) decoder. However, since TEF methods [17], [18] require two exposure images, generating images has a substantial impact on the results. Moreover, similar to the CNN-based MEF method, it is difficult to restore information that can be observed at exposure values other than preset exposure values.

B. INVERSE TONE MAPPING METHODS
Akyuz et al. [5] implemented HDR images using linear expansion based on gamma correction. However, efficient range expansion is difficult to achieve unless the input parameters are properly set. Banterle et al. generated HDR images using an expanded map and the median cut algorithm [6] along with a power function and an expansion map [7]. However, these approaches require parameters for density estimation, and the HDR image inferences have limitations for restoring over-exposed regions. Huo et al. [8] generated HDR images with fewer parameters using a physiological ITM algorithm based on the human visual system (HVS), and Masia et al. [9] proposed an automatic global ITM based on gamma expansion. Nevertheless, the functions employed were insufficient to reconstruct saturated pixels. In contrast, some proposed methods process a single image by segmenting them into specific ranges. Didyk et al. [10] classified images into diffuse, reflection, and light regions and subsequently applied different enhancements for each area to generate HDR images with a high luminance range. Rempel et al. [11] used Gaussian filtering and an image pyramid to extract pixels with high luminance and assigned information to saturated pixels. These approaches could effectively restore the saturated pixels but were unable to obtain functions satisfying the ground truth level color and texture. The traditional ITM methods generate HDR images by applying a function to LDR images. However, finding a function similar to the inverse CRF is very difficult. Several deep learning methods have recently emerged to solve this tricky problem.
Eilertsen et al. [19] generated HDR images using an autoencoder network with a loss function that considered illuminance and reflectance. They generated HDR images with an increased dynamic range by blending images to convert network outputs and LDR images to linear ranges. However, in contrast to the bright regions generated by the CNN, the dark regions were generated using virtual camera curves. Therefore, they could not effectively restore all saturated regions.
Cai et al. [20] decomposed the input LDR image into a lowfrequency luminance region and a high-frequency detail region with weighted least squares (WLS) [26]. Then, the features of each region were extracted through a two-stage network. Cai et al. [20] generated HDR images with increased contrast, even in low-exposure images. However, they did not effectively increase the dynamic range since they focused on increasing the contrast rather than restoring the saturated pixels.
Marnerides et al. [21] utilized regional and local information from LDR images over an end-to-end network and divided it into global, dilation, and local branches. All reconstructions were performed within the network without human intervention. Although the proposed approach could obtain natural HDR images, there were limitations to infer the clipped pixel values of the under, over-exposed regions to had a dynamic range similar to the real world. Liu et al. [22] proposed a method for generating HDR images using multiple networks based on a conventional camera pipeline structure that generated LDR images. They implemented dequantization, linearization, hallucination, and refinement networks to reverse the process of acquiring LDR images using deep learning. They obtained the final output by sequentially passing the LDR image to these networks. However, there is a significant risk that inference will significantly change if even one network is not properly learned, since this method utilizes various networks. Ye et al. [23] showed that the bright and dark region characteristics differed, suggesting that different restoration methods should be employed for the different regions. Although they achieved effective restoration for both underand over-exposed regions, it was difficult to increase the dynamic range efficiently due to the lack of direct brightness region enhancement. Most previous CNN-based ITM methods reconstruct HDR images by inputting LDR images to the network without  significant pre-processing. Cai et al. [20] segmented the image through pre-processing, but based on only the frequency component. Eilertsen et al. [19] and Ye et al. [23] divided the image based on the brightness value, but [19] divided the image after learning, and [23] segmented it just before the final HDR image was generated. Therefore, it is difficult to claim that [19] and [23] can effectively restore the saturated pixels since CNN weights only consider creating an image similar to the ground truth image. Efficient reconstruction of saturated pixels becomes difficult using raw LDR image inputs. Therefore, this paper proposes separating LDR images into bright and dark regions before input and then learning them in different ways. Fig. 3 shows the overall flow of the proposed method. The image threshold is calculated during pre-processing, and the is generated. The LDR image is then segmented into bright and dark regions ( and , respectively) using the , and and become the inputs of the proposed multi-stage network.

III. PROPOSED METHOD
is learned to produce the increased brightness value in the Brightening block, and is learned by the Darkening block to reduce the lower brightness value. Finally, the two feature maps with the enhanced dynamic ranges are combined to obtain the final HDR image. The original image is also input into the network to help the calculations in the Brightening and Darkening blocks and feature blending.

A. PRE-PROCESSING: LDR IMAGE SEGMENTATION
Bright regions have information for only the bright pixels, and dark regions have information for only the dark pixels. As a result, the proposed network can effectively focus on over-and under-exposed pixels when bright regions and dark regions are fed into the network, respectively. Therefore, the input image is segmented into bright and dark regions during preprocessing to enhance the dynamic range. Otsu [27] proposed the representative image segmentation method that separates the image into two classes using the threshold value that provides a maximum pixel variance over the image. In most cases, the Otsu [27] method classifies the image very clearly into two classes and . However, when the pixel values of the image are similar, the class boundaries are not clear because many pixel values are close to the threshold. To clarify the difference between the two classes ( and ), it is necessary to find the threshold that maximizes the sum of the variances between the two classes, and the threshold should be located at the valley point of the image histogram [28]. Therefore, this paper proposes a method to find the optimal valley point in the histogram. Before applying the proposed method, the brightness value is measured by converting the RGB image to the HSV color space, and a histogram is generated. Alg. 1 describes the algorithm of the proposed method, and Fig. 4 shows an example of finding the optimal valley point. As described in Alg. 1, the first step is to increase the pixel value from the Otsu threshold value to 252 ( Fig. 4 (a)). The second step is to find the valley points by checking the slope between the histogram value at the current pixel and the histogram values at the points k ∈ { ±1, ±2, ±3 }, which represent the distance from the current pixel (Alg. 1 lines 6 -8). Third, the two classes are classified based on the valley point, and the sum of the variance values of the classes, i.e., , is calculated (Alg. 1 lines 10 -14). Then, we include and valley points in the valley set A ( Fig. 4 (b), (c)). Finally, in the valley set, we can find the valley point with the largest and assign it as a threshold, i.e., ℎ . (Fig. 4 (d)).
The proposed generated based on the ℎ can be expressed as where ( , ) is image brightness; , are horizontal and vertical pixels, respectively. Since the square function is used for (1), the pixel with a brightness value slightly larger than ℎ has a very small mask value, which makes the boundary surface smooth when segmenting the image. Fig. 5 and Fig. 6 show the images segmented by the proposed method and Otsu's [27] method. As shown in Fig. 5, the thr values of some images are the same as the Otsu threshold values. However, as shown in Fig. 6, if most image pixels have similar values, the Otsu threshold may not be located at the optimal valley point. As a result, the dividing boundary is created at any false valley point, as shown in Fig. 6 (d). In contrast, the proposed method can find a new optimal valley point that clearly separates the bright and dark regions, as shown in Fig. 6 (c). The generated is then multiplied by the original LDR image and segmented into and , The proposed is similar to those previously used in [15], [19], and [23], but the proposed method enables more accurate image segmentation by setting the ℎ . The generated is also non-linear, i.e., the square of the difference, producing smoother segmentation boundaries. Consequently, the input LDR image can be segmented into and , as shown in Figs. 5 and 6 (c). Fig. 7 shows the proposed multi-stage CNN structure. Network inputs include and , generated in pre-processing, and the original LDR image . expands the dynamic range using the Brightening block ( Fig. 8 (a)), increasing brightness, and expands the dynamic range using the Darkening block (Fig. 8 (c)), reducing brightness. The two feature maps from distinct learning processes are subsequently combined properly to prevent artifacts. The proposed network structure includes three stages. 1) The dynamic range extension stage increases the bright range brightness and decreases the dark range brightness; 2) The global feature concatenation stage prevents outliers; and 3) The blending and generation stage combines the learned feature maps. Details for each stage are explained in the following sections. Fig. 8 shows that the Brightening (Fig. 8 (a)) and Darkening (Fig. 8 (c)) blocks are based on Zhang et al. [29] proposed residual channel attention block (RCAB), with squeeze  parameters obtained from the original LDR image. For example, brightness should be adjusted more strongly for images taken outdoors with strong sunlight because most pixels have fairly bright values. In contrast, image brightness should be slightly weakened for images taken indoors because they have appropriate brightness. Thus, feature brightness changes depending on the input image ( Fig. 8 (b)). The squeeze parameters can be expressed as

1) DYNAMIC RANGE EXTENSION
where is a feature map created after the original image passed through the convolution layer. We use the ReLU [30] activation function; hence, , , and denote the maximum, minimum, and average pooling, respectively. The squeeze parameter for the Brightening block is obtained by maximum pooling with 4 × 4 pixels in and the average pooling with 1 × 1 pixels to properly represent brightness (4). Akyuz et al. [5] showed that HVS is more sensitive to dark than bright regions, suggesting that dynamic ranges should be scaled to different sizes based on the range. In particular, Ye et al. [23] argued that the dark regions texture and color information are ambiguous, in contrast to the bright regions; hence, it may not be able to effectively restore under-exposed ranges if brightness changes significantly. Reibel et al. [31] showed that dark regions filmed with a camera contain more noise than bright regions; hence, unintended artifacts can appear if the change is severe when darkening the dark regions. Thus, the degree of change in should be less than that for . Hence, the squeeze parameter used in the Darkening block was obtained using the minimum rather than maximum pooling (5). The final channel statistics and were determined after calculating ZB and ZD, = ℱ �ℱ ( )� (6) and where is a convolutional layer employing the ReLU [30] activation function, and is a convolutional layer employing the sigmoid activation function. As a gating function for rescaling the size of feature maps, we used the sigmoid as in [28]. The channel size was downscaled to 64/ at and then upscaled to again at . The reduction ratio for reducing the parameter overhead was designated as 16 from Zhang et al. [29]. Finally, feature maps and with adjusted brightness values can be obtained as and where and represent the calculation results from the convolutional layers for and , respectively, and and represent the

3)
HDR image 3 3 Conv (stride 2) + PReLU VOLUME XX, 2017 7 calculation results from the convolutional layers for and , respectively. generated using (8) has increased brightness compared to , and generated using (9) has reduced brightness compared to . In particular, the degree of change in is reduced in comparison to , preventing possible problems when restoring underexposed ranges. When PReLU [32] is used as the activation function, negative parameters may remain in the feature map, which can cause to be darker rather than brighter and to be brighter. To prevent negative parameters, ReLU [30] was used as an activation function of the brightening and darkening blocks. Consequently, the bright image is brighter and the dark image is darker through the Brightening and Darkening blocks, respectively. Naturally inferred feature maps can be obtained since the dynamic range expansion was achieved within a single network.

2) GLOBAL FEATURE CONCATENATION
Unwanted outliers can be generated from the final HDR image because the level at which brightness values change is determined automatically within the proposed deep learning network. Although the image is smoothly segmented during pre-processing, artifacts can appear around the segmented image boundaries when combining the two feature maps, since and learning were conducted using different parameters. Therefore, we used the global branch structure proposed by Marnerides et al. [21] to generate a feature map with abstract features from the original image, which limits significant deformation for the image learned through this feature map. Fig. 7 shows the generated global image feature with size 64 that is subsequently used as = ℱ �� ℎ , ��, and where is a convolution layer that uses the PReLU [32] activation function. Thus, was used to combine and segmented feature maps. Since , , and contain abstract information from the original LDR image, using these maps rather than , , and + prevented excessive outlier generation when increasing the dynamic range.

3) BLENDING AND GENERATION
Although was used to concatenate and (see (12)), a deep learning network structure was required to properly blend the feature maps that were learned under different directions. Therefore, we merged the feature maps using DenseNet [33], which could effectively extract input image features using only a few parameters, as shown in Fig.  4. Thus, passed through a 3-Layer DenseNet, and where − and − are vectors with features and − is a vector with 64 features. DenseNet's growth rate = 32, half that for the input feature channel. Finally, we created the HDR image as A sigmoid function was used in the final layer to map the generated HDR image to the [0, 1] range.

C. LOSS FUNCTION
The Weber-Fechner law [34] states that stimuli should be transformed into a constant ratio rather than a constant quantity for human senses to detect intensity changes because luminance variation is logarithmic; hence, loss functions should be handled in the logarithmic domain to create the HDR image optimized for human senses. Therefore, we used the logarithmic loss function �ℎ � , ℎ� = �ℎ � , ℎ� + �ℎ � , ℎ�, where ℎ � is the HDR image estimated by the network, ℎ is the ground truth HDR image, and is the loss function weight. We set = 0.3 to generate the highest experimental result and implemented the loss function as a sum of , comparing pixel values in RGB and in luminance to reconstruct the HDR image as close to real life as possible, (ℎ � , ℎ) = | �ℎ � + � − (ℎ + )| (18) and (18) and (19) were based on the 1 distance and use a small constant to eliminate specificity at a zeropixel value. (18) computes the difference between the ground truth and the predicted values in the RGB range, whereas (19) computes in the luminance range. According to [1], the pixel value of the HDR image is an approximation of photometric quantities, in contrast with the LDR image, and is linearly related to luminance in general. Therefore, it is reasonable to evaluate the HDR image in the luminance range, extracting luminance values from the RGB range, = 0.2126 + 0.7152 + 0.0722 , where , , and indicate red, green, and blue channel pixel values, respectively. Finally, we obtain the absolute difference between the estimated and ground truth after converting HDR pixel values to the luminance range using (20).

A. Dataset
Datasets are important factors in determining deep learning networks. Moreover, objectivity is maintained by learning in an environment with similar objects. Therefore, we used the most common HDR image reconstruction datasets: Funt [35], [36], Stanford [37], Ward [38], [39], and Fairchild [40]. Since these datasets include only HDR images rather than LDR-HDR pairs, we employed tone mapping [41] - [43] to synthesize corresponding LDR images, obtaining 4,935 total LDR images by randomly cropping 512×512 regions five times for each tone mapping. We also used the HDR-Real dataset organized by Liu et al. [

B. Training
Training and testing were implemented using Python v.3.6 on an Intel i7-7820X CPU with an Nvidia GTX 1080 Ti GPU. We conducted 200 repeated training sessions on the 8,118 images, and the results were checked every 20 epochs to prevent overfitting. The initial learning rate was 0.00007, and we use the Adam optimizer [45]. Weights were reset using Xavier initialization [46], mini batch size = 8, and 1014 iterations were run for each epoch.

V. RESULTS
This section compares the results from the proposed and the existing methods using the peak signal noise to rate (PSNR) [47], structural similarity index measure (SSIM) [48], and multiscale SSIM (MS-SSIM) [49] metrics, which are the most common evaluation indicators. PSNR is calculated based on the MSE, which is the Euclidean distance between pixels, with a higher value implying better similarity to the original. SSIM measures the similarity to the original image using pixel variance and the mean from the two images. MS-SSIM is calculated by adding scale-space concepts to SSIM. Reconstructed images are structurally similar to the original images where SSIM and MS-SSIM are close to 1. PSNR, SSIM, and MS-SSIM metrics express mathematical pixel differences between the original and the comparison image. However, HDR images should be measured based on human visual perception as well as mathematical differences. Therefore, in this paper, we also use the quality Q score for the HDR visual difference predictor (HDR-VDP) [50] as an evaluation metric, which is based on the human visual model for all luminance environments and has the highest level of objectivity to assess HDR image quality. The log exposure range is used to evaluate the brightness range, which is the most critical element for the HDR image, where ℎ and are the highest and smallest luminance, respectively. The dynamic range of the HDR image is evaluated by comparing the log exposure range with reference methods, including the traditional and previous deep learning methods. The traditional ITM methods include Akyuz et al. [5], Huo et al. [8], and Masia et al. [9], referred to as AEO, HUO, and MEO, respectively, and the deep learning methods include Endo et al. [12], Eilertsen et al. [19], Cai et al. [20], and Marnerides et al. [21], referred to as DrTMO, HDRCNN, SICE, and EXP, respectively. Table 1 compares PSNR, SSIM, MS-SSIM, and HDR-VDP for the various methods. The proposed method achieves the highest score on all four evaluation metrics. In particular, the structural detail is increased, returning higher SSIM and MS-SSIM scores than the other methods due to the increased contrast with a wider brightness range. The VDP-Q score is also the highest for the proposed method, i.e., the best HDR image is reconstructed with respect to human cognitive aspects. The deep learning methods [12], [19] - [21] use map-ranged ground truth images in the range [0, 1]. In contrast, the traditional ITM methods [5], [8], [9] multiple the LDR image by specific functions without a specified range to create the

A. QUANTITATIVE COMPARISON
HDR image. Consequently, when mapping the HDR image of [5], [8], [9] generated for objective comparison to a range of [0, 1], the minimum and maximum values of pixels are set to 0 and 1. Therefore, the size of the log exposure range becomes infinite, which makes it impossible to objectively compare with the deep learning methods. Therefore, Fig. 13 compares the AEO, HUO, and MEO luminance ranges with the proposed method by adjusting the exposure values. Table 2 compares the log exposure range between the proposed method and the existing deep learning methods [12], [19] - [21], where the entire test dataset (1,710 images) includes 307 night or dark room images with low brightness and 322 daytime or bright room images with high brightness. The mean log exposure range for the ground truth images = 16.15 Stops, whereas HDR images generated by the proposed method achieve a mean log exposure range = 15.07 Stops. The mean log exposure range of the proposed method is the closest result to the ground truth, with a subsequent decreasing order of HDRCNN, SICE, EXP, and DrTMO. The proposed method also achieves the best results for test images with high brightness, a ground truth error = 2.4% and a mean log exposure range = 12.13 Stops, which means that the saturated pixels are effectively restored, even in images with many over-exposed regions. In addition, for test images with low brightness, the proposed method shows a mean log exposure range = 15.28 Stops, which is 38.49% higher on average than those of the other deep learning methods [12], [19] - [21].

B. QUALITATIVE COMPARISON
In this paper, we use Reinhard [43] tone mapping with the 8bit LDR image for the HDR image to compare with the proposed method. Qualitative image comparisons are evaluated for less saturated and highly saturated LDR image inputs. The dynamic brightness range is compared by increasing HDR image exposure.

1) LESS SATURATED LDR IMAGES
To evaluate images in various environments, photographs taken in bright and dark places were compared. Fig. 9 compares the LDR image for a park in the daytime with the reconstructed HDR image. Sun shape, nearby tree details (red box), and bush color information (blue box) were damaged due to the bright light sources. The HDR image generated by the proposed method (j) was closest to the ground truth (b). In particular, the image captured the sun shape (red box) due to high brightness most naturally and had the most similar color to the ground truth (blue box). The traditional ITM methods (c) -(e) generated HDR images that were similar to the input LDR image (a), but the deep learning methods (f) -(j) reconstructed HDR images that had better quality with SICE (h), generating the second best HDR image after the proposed method. However, SICE (h) did not accurately reconstruct the sun shape (red box), HDRCNN (g) and EXP(i) had lower overall contrast than the proposed method, and DrTMO (f) generated an overall blurred HDR image. Fig. 10 compares HDR images inferred from the LDR image of the park at night. Information regarding streetlights (red box) and grass (blue box) reflected on the streetlights was lost due to over-exposed ranges having a higher brightness than underexposed ranges. Similar to the case for daytime park images, the traditional ITM methods (c) -(e) generated HDR images that did not adequately restore the saturated pixel values for lights (red box) and grass (blue box) and differed little from the input LDR image (a). DrTMO (f) generated the unnatural HDR image compared to the ground truth (b) in the process of merging the bracketed images. HDRCNN (g), SICE (h) and EXP (i) detected saturated pixels for streetlights (red box) and grass (blue box) better than the other methods, but they exhibited significant sharpness loss compared to the ground truth (b). In contrast, the proposed method (j) restored the saturated street light information (red box) closest to the ground truth (b) and was also able to realize similar grass color and tone (blue box). Thus, the proposed method could restore the clearest HDR image.
2) HIGHLY SATURATED LDR IMAGES Fig. 11 compares the outcomes for the LDR image that was severely clipped and therefore lost most of its information before being restored. The original LDR image (a) was taken inside a dark building, with considerable information loss outside the relatively bright building (red box) compared to the interior.
As shown in Fig. 11, the traditional ITM methods (c) -(e) that did not use deep learning did not properly restore the information outside of the building (red box). On the other hand, the methods using deep learning effectively inferred the (g) HDRCNN [19] (f) DrTMO [12] ( (g) HDRCNN [19] (f) DrTMO [12] (i) EXP [21] (j) Proposed (h) SICE [20] saturated pixels based on the limited information, and in particular, the proposed method (j) restored the most out-ofbuilding information (red box) corrupted by over-exposed. DrTMO (f) is the HDR image synthesized stacks of various exposed images, effectively realizing the damaged region in the bright range, and hence, restoring more shapes than the other comparison methods. However, DrTMO (f) was quite blurred, with the overall color tone and contrast differing greatly from the ground truth (b). EXP (i) could not restore as many saturated pixels (red box) as DrTMO (f), but overall reconstructed the good HDR image, since EXP (i) efficiently used local and global information from the input LDR image. SICE (h) extracted the feature of the low-and high-frequency information of the image and generated the HDR image with color and contrast close to the ground truth (b), but it could not restore the detailed information of over-exposed pixels. Fig. 12 shows the example regions where under-exposed information was damaged, except for very bright pixels, since the exposure time was very short. Most LDR image information was damaged, aside from the tree bulbs, making it difficult to visibly identify image shapes (red box). Generating the HDR image from such a damaged LDR image is a painful task. Not all comparison methods could generate images close to the ground truth (b). HUO (d) could not restore most information, and although MEO (e), DrTMO (f), SICE (h), and EXP (i) restored the statue (red box) to some extent, the restored images contain substantial noise and are severely blurred. AEO (c) and HDRCNN (g) were clearer than those by MEO (e), DrTMO (f), SICE (h), and EXP (i), but the colors deviated considerably from the ground truth (b). In contrast, the image reconstructed by the proposed method (j) achieved higher clarity and a similar color tone to the ground truth.

3) COMPARISON FOR DYNAMIC RANGE
The proposed method generates images with a wider dynamic range than the existing approaches (see Table 2). For qualitative analysis of the dynamic range, we observed the HDR image up to 4 exposure value (EV) units. We did not use the Reinhard [43] tone mapping method to compare methods, rather, we multiplied the HDR image by 255 and converted the values into integers. Fig. 13 shows that the proposed method gives the most information for all of the exposure values. All considered methods showed similar information when the exposure was increased by 4 EVs. However, DrTMO (d) lost the most pixel value information when EV was increased by 8, and AEO (a), MEO (c), HDRCNN (e), SICE (f), and EXP (g) lost almost all pixel value information when EV was increased by 12, except for very blurry shapes. In contrast, HUO (b) and the proposed method (h) retained pixel information in most ranges, except in over-exposed regions, and the proposed method retained more detailed information than HUO (b). HUO (b) and the proposed method (h) lost most of the pixel information when EV was increased by 16, but the proposed method retained the most information compared with the original image. Thus, the HDR image generated by the proposed method had the widest luminance range and retained more information, even for high-exposure regions. In particular, not only was the brightness range extended, but it was also very low, while high

C. COMPARISON WITH AND WITHOUT IMAGE SEGMENTATION
The main idea of this paper is to segment the image into and before training to focus on restoring over-and under-exposed pixels. To examine whether the saturated pixels could be effectively restored, we compared the results between the methods with and without segmentation. In the method without segmentation, is fed into the proposed network ( Fig. 7) instead of and . Table 3 compares the quantitative values of the method with and without segmentation. Table 3 shows that the saturated pixels were effectively restored using the method with segmentation by showing higher scores in all metrics, especially with 2.65 dB PSNR higher than the method without segmentation. Nevertheless, the method without segmentation had an average log exposure range that was 33% higher than the other deep learning methods [12], [19] - [21] (see Table 2). Even if only the global image was used, the proposed network could restore the saturated pixels more precisely. However, by dividing the image into two regions, the proposed network could focus more on the saturated pixels in each region. Therefore, segmenting the image is much more effective for restoring the saturated pixels. Fig. 14 shows HDR images reconstructed by the method with segmentation and the method without segmentation. In both (a) and (b), when the segmented image is used as input, the color and shape information of the saturated pixels are restored in more detail, and the contrast of the image increases as the dynamic range improves.

D. COMPARISON WITH AND WITHOUT GLOBAL FEATURE CONCATENATION
To prevent artifacts from appearing, the global feature used in [21] is concatenated with , , and + (d) DrTMO [12] (g) EXP [21] (h) Proposed (c) MEO [9] (b) HUO [8] (e) HDRCNN [19] (10), (11), (12). We compared the results of cases with and without in the proposed network that uses the segmented images as input to verify whether reduces the number of artifacts. Table 4 compares the quantitative evaluation with and without in the proposed network. As shown in Table 4, when was used in the proposed network, the PSNR was 4.1% higher than that without . In other words, using could improve the overall HDR image quality since the number of artifacts could be reduced on the segmentation boundary. Fig. 15 shows the HDR images reconstructed by the proposed network with and the proposed network without . As shown in Fig. 15, when was not used, artifacts appeared at the division boundary near the reflected light area of the ceiling. However, when was used, the boundary of the reflected light area of the ceiling showed a natural shape without any artifacts.

VII. CONCLUSIONS
Recent developments in deep learning technology have substantially advanced ITM approaches, with many studies employing deep learning to reconstruct saturated pixels and regenerate high-quality HDR images [12] - [23]. However, the LDR image should be separated and learned according to brightness values to accurately reflect reality. Reconstructed HDR images using the proposed method achieved average PSNR, SSIM, MS-SSIM, and HDR-VDP2 metrics of 30.4%, 42.8%, 42.9%, and 8.52% higher than the previous methods, respectively. In particular, the proposed method provided a high mean log exposure range = 41.2%, confirming that the dynamic range was the most similar to the real world. Furthermore, qualitative evaluations enabled the generation of an HDR image similar to the original image with the highest resilience, even though the input LDR image had lost considerable pixel information. Generating the image close to the real world using limited information from the clipped LDR image remains difficult. The proposed method could not reconstruct irradiance perfectly, and LDR images with severely damaged exposure values remain unable to generate HDR images close to the ground truth. We think this issue can be resolved by classifying images into more classes for training. In addition to the bright and dark ranges, the LDR image could be further classified by brightness, helping the clipped ranges to become effectively generated.

With Global Feature Concatenation
Without Global Feature Concatenation