Deep Perceptual Image Enhancement Network for Exposure Restoration

Image restoration techniques process degraded images to highlight obscure details or enhance the scene with good contrast and vivid color for the best possible visibility. Poor illumination condition causes issues, such as high-level noise, unlikely color or texture distortions, nonuniform exposure, halo artifacts, and lack of sharpness in the images. This article presents a novel end-to-end trainable deep convolutional neural network called the deep perceptual image enhancement network (DPIENet) to address these challenges. The novel contributions of the proposed work are: 1) a framework to synthesize multiple exposures from a single image and utilizing the exposure variation to restore the image and 2) a loss function based on the approximation of the logarithmic response of the human eye. Extensive computer simulations on the benchmark MIT-Adobe FiveK and user studies performed using Google high dynamic range, DIV2K, and low light image datasets show that DPIENet has clear advantages over state-of-the-art techniques. It has the potential to be useful for many everyday applications such as modernizing traditional camera technologies that currently capture images/videos with under/overexposed regions due to their sensors limitations, to be used in consumer photography to help the users capture appealing images, or for a variety of intelligent systems, including automated driving and video surveillance applications.

applications, such as autonomous driving, security surveillance systems, search and rescue operations, and virtual and augmented reality environments. The quality of images becomes extremely important for these applications, and the systems' performance might be affected negatively by low-quality inputs.
Acquiring a high or optimum quality image is ideal but sometimes impractical. Specifically, smartphone cameras have considerably small apertures, limiting the amount of light captured, leading to noisy images in a low-lit environment [5]. The imaging sensor's linear characteristic fails to replicate the complex and nonlinear mapping achieved by human vision. Another issue that commonly restricts the performance of computer vision algorithms is nonuniform illumination. When the lighting source is not perfectly aligned and normal to the viewing surface, or if the surface is not planar, then the resulting image may have nonuniform illumination artifacts [6]. Another critical requirement for efficient image processing is global uniformity [6]. Similar objects or structures should appear the same within an image or in a series of images. This implies that the color content and the illumination must be stable for images acquired under varying conditions. Illuminations that cast strong shadows also cause problems. The edges and boundaries in an image need to be well defined and accurately located, implying that the image's high-frequency content needs to be preserved to have high local sensitivity. Vignetting is another common pitfall in many photos [7]. While it might be a desirable effect in some cases such as portrait mode photography, it is not ideal for various other use cases that require high accuracy and details. Furthermore, the compression algorithms used to store the images may cause some artifacts [8]. These factors affect the pleasantness of viewing the image and affect the usability of the images for computer vision algorithms and their ability to analyze them.
Traditionally, automatic image quality enhancement methods can be broadly classified into global enhancements and local enhancements. Global enhancement algorithms perform the same operation on every single image pixel, such as linear contrast amplification. Such a simple technique will lead to saturated pixels in high exposure regions. To avoid this effect, nonlinear monotonic functions, such as mu-law, powerlaw, logarithmic processing [9], [10], gamma functions, and piecewise-linear transformation functions, are used to perform enhancements [11].
One extensively used method to avoid saturation while improving the contrast is histogram equalization (HE) [12]. Fig. 1. Demonstration of the proposed DPIENet for a given ill exposed input. This system sets a new SOTA benchmark in terms of measures, such as PSNR, SSIM [2], GSSIM [3], and UQI [4].
Another local image enhancement technique is based on the Retinex theory [13], which assumes that the amount of light reaching the observer can be decomposed into two parts: 1) scene reflectance and 2) illumination components. These algorithms achieve better results than global methods by making use of the local spatial information directly and have become the forerunners for image enhancement. While methods based on Retinex such as MSR-CR [14] can effectively improve the sharpness of the image and increase the local contrast, they introduce the halation phenomenon at high contrast and amplified noise regions [15].
More recently, deep learning-based image enhancement methods have been used to mitigate these problems [16], [17]. These techniques allow for automatic parameter selection and training and have highly scalable architectures. They have been shown to outperform state-of-the-art (SOTA) methods in computer vision tasks, such as object detection, object recognition, segmentation, super-resolution, and enhancement. However, most of the deep learning networks are trained explicitly for either standard exposure images or low exposure images. Thus, they fail to achieve global uniformity for varying exposure inputs of the same scene.
This article proposes a deep learning-based perceptual image enhancement network (DPIENet) to address these issues. This network has a U-shaped structure similar to the U-Net architecture [18]. It consists of two stages: 1) a feature condense network (FeCN) that aims to acquire compact feature representation of the spatial context of the image and 2) a feature enhance network (FeEN) that performs nonlinear upsampling of the input feature maps to reconstruct an enhanced image. The architecture is equipped with skip connections between these two networks to use high-resolution image details during the reconstruction. An example of the result obtained using the network is illustrated in Fig. 1.
Some of the notable contributions of DPIENet include the following.
1) A unified network that can ensure global uniformity by generating perceptually similar enhanced images for input images of both standard and low exposure setting by utilizing dilated convolutions to preserve spatial resolution in convolutional networks and improve spatial image understanding. Furthermore, it incorporates a channel attention mechanism that aims to adaptively rescaling channelwise features by extracting the channel statistics to enhance the network's discriminative ability. 2) A combination of a classical log-based synthetic multiexposure image generation technique-logarithmic exposure transformation (LXT) that employs trainable parameters to improve the performance of the network. 3) A novel loss function-"multiscale human color vision (MHCV) loss." This loss aims at improving the quality of the reconstruction by considering human perception. This loss function promotes the model to learn complicated mappings and effectively reduces the undesired artifacts, such as noise, unrealistic color or texture distortions, and halo effects. The remainder of this article is organized as follows. In Section II, related recent literature is reviewed. A detailed description of the DPIENet architecture and its analysis is provided in Section III. In Section IV, a brief description of the proposed MHCV loss is provided. Section V presents the training details and an ablation study with quantitative and visual experimental results. Section VI discusses the user study performed to measure human perceptual preferences. This section is followed by the computation complexity, application, and conclusion in Sections VII-IX, respectively.

II. RELATED WORK
Various methods have been adopted in the literature for enhancing the quality of the images. Some of the early techniques include gray level slicing, contrast expansion, linear and nonlinear contrast stretching, and various histogram processing [19]. Many extensions to HE-based methods, such as adaptive HE [20], contrast-limited AHE [21], and dynamic HE [22], impose additional constraints while redistributing the luminous intensity of histogram. However, such global enhancement methods may suffer from loss of details in some local areas because of the inherently nonuniformity present in the image.
Most Retinex-based methods, including MSR-CR [14], SSR [23], and HECUP [24], recover the reflectance and illumination component and typically employ varying amounts of the illumination component for enhancing images while preserving naturalness. There exists multiple variations and extensions of the Retinex-based approach, such as AMSR [25], which uses an adaptive weighting strategy, LIME [26], which only estimates the illumination component for low light image enhancement, and NPE [27], which balances the enhancement by utilizing the bio-inspired multiimage fusion framework for image enhancement. Other fusionbased frameworks [28], [29] have also been proposed.
Recently, deep learning-based methods have introduced powerful tools, such as end-to-end trainable networks, generative adversarial networks (GANs) [30], and deep autoencoders [31], to perform image enhancement tasks. In [32], an end-to-end deep learning-based method for photo adjustment was proposed. Ignatov et al. [33] created a dataset of images captured by smartphone cameras and a DSLR camera and used the GAN model to learn the mapping between the two images. In [34] and [35], deep learning was used to approximate existing filters using a fully convolutional network (FCNs). While the methods mentioned above are all supervised learning, meaning they need paired images to learn the mapping, in [36], an unpaired deep learning model for image enhancement was proposed. This model uses an  [31], utilize autoencoders, to extract features from low-light images. They adaptively adjust the image brightness without overamplification or saturation artifacts, thus achieving both image enhancement and denoising.
Furthermore, a few inverse tone mapping techniques utilize deep learning to improve the image's perceptual quality. Eilertsen et al. [47] used the U-Net structure operating in the logarithmic domain to generate a high dynamic range (HDR) output. Endo et al. [48] utilized UNet-based autoencoders to synthesize a set of LDR images with varying exposures to mimic exposure bracketing. These LDR images are then fused using a classical method to generate the HDR output. Table I   where I is an input image of any arbitrary size (m, n). This network addresses the image-to-image translation problem, which transforms an input image with color rendition, ill exposure, and unrealistic color issues to an enhanced output image with desired characteristics. In accordance with this, DPIENet comprises of three main components: 1) logarithmic-based exposure transformation; 2) joint local and multiblock global feature extraction; and 3) dynamic channel attention (DCA) blocks. These components are tightly coupled and trained in an end-to-end fashion. For training, a novel loss is designed to obtain f (I). This loss aims at enhancing the desired characteristics by using reflectance and illumination components. Additional details of these components are provided in further sections.

A. Logarithmic Exposure Transformation
To represent the wide range of luminance present in a natural scene, such as bright and direct sunlight to dark and faint shadows, the exposure range of the image needs to be adjusted. An ideal enhanced image would preserve high-quality details in the shadows while retaining a good Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  [49]; and (c) visualizes the residual network with a DCA mechanism to emphasize more on significant features. contrast in the bright regions. On the contrary, an image with nonuniform scene luminance will have a tradeoff between the bright and dark regions due to the limited exposure and results in the loss of data in those regions. Various SOTA systems have been developed such as HDR imaging, which aim at combining multiple exposures to create an image with a greater dynamic range of light. The main constraint with such system is the requirement of multiple images across time with varying exposures. Inspired by the multiexposure mechanism from the HDR imaging systems, a synthetic simulation of changes in exposures to generate a perceptually enhanced image from a single image is explored. Specifically, the synthetic images need to have under and overexposed images. The underexposed images have bright regions, which are well defined with proper contrast and overexposed images where the finer details in the dark and shadow areas are highlighted Consider an input imageÎ of any arbitrary size (m, n), then the LXT of that image is generated by employing (1). This transform is derived using companding functions, such as μ-law and the power law, and it produces underexposed (U) and overexposed images (O). In (1), α is a learnable parameter and γ x value is empirically set to 1.75 and 0.75 for underexposed and overexposed, respectively, based on the tradeoff between the expansion of underexposed regions and the amount of details in the overexposed areas. To simulate the overexposed image I O , LXT maps the low-intensity values to a broader range of values while compressing the range of higher intensity values Conversely, to obtain the underexposed images I U , the LXT function expands the higher intensity regions and compresses the range of lower intensities. Fig. 3 shows the result of the operation for various values of α. Fig. 3(b) visualizes an overexposed image with α O = 2 and γ O = 0.75, and Fig. 3(c) demonstrates an underexposed image with α U = 0.5 and γ U = 1.75. As seen in Fig. 3(b), the details of the image in darker regions are much clearer, while in Fig. 3(c), the details in highlights are more pronounced. Fig. 4 shows the result of the companding operation for various values of α and γ . As seen in the figure, increasing α decreases the limit of higher intensity values and vice versa. Similarly, increasing γ decreases the expansion of lower intensity values.

B. Joint Fusion of Multiblock Global and Local Features
A novel approach to extract and fuse global and local features is provided in this section. Local features define a portion of information about the image in a specific region or single point [41]. In distinction, global features describe the entire image by considering all pixels in the image [42]. The global features provide information regarding the context of the entire image that can be integrated with local features to obtain visually pleasing results with lower artifacts [50]. For image enhancement, the global features could determine the type of scene, subjects in the scene, and lighting conditions to aid local adjustments in the image. In contrast, local features represent the local texture or object at a given location.
The extraction technique is inspired by the UNet architecture that is developed specifically for biomedical image segmentation [18] and ColorNet architecture that was utilized to colorize grayscale images automatically [51]. Both these architectures encompass an end-to-end encoder-decoder network. The UNet architecture focuses mainly on local features, thereby degrading the performance of image enhancement tasks that highly rely on global features [36]. On the contrary, ColorNet utilizes both local and global features; however, the network requires explicit scene labels for training purposes and requires an extra supervised network to compute global features. Both these networks utilize FCN to perform their respective tasks. Even though these networks perform reasonably well, the model efficiency and performance can be enhanced by incorporating a residual layer instead of the FCN block.
The proposed DPIENet comprises of a novel FeCN and a novel FeEN. FeCN aims at producing local and global features. The local features are obtained through a series of layers, while the global features are extracted from every layer of the condense network rather than just the final layer. FeEN aims at reconstructing the enhanced image by exploiting skip connections from FeCN. A flow diagram of DPIENet with FeCN and FeEN can be visualized in Fig. 2.
1) Feature Condense Network: The condense network comprises of feature group, which can be denoted as C g l where group g = 1, 2, . . . , 8, and l indicates the number of the residual layer in that particular group and ranges from 1, 2, . . . , n. For simplicity, the first feature extraction section is denoted by C 0 and it consists of a convolutional (CONV) layer followed by BN [52] and SELU activation layer [53]. This layer extracts features from the image domain. The CONV layer employs a 3 × 3 kernel and produces 16 feature maps.
The basic structure of the residual layer used in C 1−8 in the FeCN can be seen in Fig. 2

(b) and is formulated in
where l is the input feature map for the lth residual layer, ω l and b l are the associated set of weights and biases, respectively, denotes the combination of layers such as CONV→BN→SELU→CONV→BN, S denotes the SELU activation function, and I is the identity map. In groups C 2−7 , the first layer performs downsampling by striding instead of max pooling since max pool layers lead to high amplitude, high-frequency activations in the subsequent layers, which might increase gridding artifacts [54]. For image enhancement techniques, downsampling may cause loss of spatial information; however, it is required to understand the scenes and reconstruct the image with finer details. Eliminating downsampling may increase resolution; however, it affects the receptive field in subsequent layers, thereby increasing context loss. To overcome this, dilated convolution is employed to adjust receptive fields of feature points without decreasing the resolution of feature maps [55]. It is used in all the layers in the group C 5−7 instead of traditional convolution, as suggested by Yu et al. [54].
Furthermore, to increase the representative power of the global features in the network, the output of the last layer (κ) of each condense group from C 0−8 is connected to a global average pooling (GAP) layer. The GAP layer compresses the information of the residual layers making it more robust to the spatial translation. The outputs from each layer are concatenated, as shown in y fuse = C 0 n ; C 1 n ; C 2 n ; · · · ; C 8 n .
These features generate a total of [ 8 i=0 ς(C i κ ) × 1 × 1] where ς is the number of channels/feature maps. The stacked feature maps are then fed into a dense layer D 0 , which pro-  Table II). The joint fusion comprises stacking the global features from D 1 and the local features from C 5 κ . This aids in incorporating global features into local features. Due to this way of concatenation, the network is independent of any input image resolution restrictions.
2) Feature Enhance Network: Once the local and global feature maps are concatenated, they are fed to the enhance network. The enhance network comprises of feature group, which can be denoted as E g l , where group g = 0, 1, . . . , 4 and l indicates the number of the residual layer in that particular group and ranges from 1, 2, . . . , n. The feature layers of the condense and enhance network are symmetric across the fusion block, as shown in Fig. 2(a). If the condense group C 2 contains two residual layers, then E 2 also consists of two residual layers.
In the case of the condense layer C 0 , E 0 consists of just one residual layer. Each enhance group in E g mainly consists Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. of upsampling layers, compression layers, and residual layers. The input to each enhance group is the fusion of feature maps from the previous enhance group and the output of the corresponding condense group. This helps in propagating context information to higher resolution layers. The upsampling layer consists of transposed convolutions with the kernel size 2 × 2 and stride 2 × 2. This aids in increasing the resolution of the feature maps by a factor of 2. The compressing layer consists of CONV→BN→SELU, wherein the kernel size of CONV is 1 × 1. This is used to compress the feature dimensions by a factor of 2. The compressed feature maps are then fed to the residual layers for further processing. Finally, the output of the group E 0 is connected to a CONV layer with kernel size 3 × 3, and residual learning is adopted by adding the input image to this layer.

3) Dynamic Channel Attention Mechanism:
Most of the deep learning-based image enhancement techniques consider all the feature maps equally, which may not be correct in many real-world cases. Among the residual layers' generated feature maps, few of the features might contribute more when compared to the rest. Moreover, the learned filters in the residual layers have a local receptive field, and each filter output exploits the contextual information outside of the subregion very poorly. Thus, a mechanism is required to recalibrate features such that more emphasis is provided for the feature maps with better mapping compared to the less essential feature maps. Researchers have offered tentative work to apply attention in deep neural networks [56]- [58], which ranges from localization and understanding in images [59] to sequencebased networks [60]. However, these attention mechanisms are not yet mature for low-level vision tasks such as image enhancement.
This mechanism's main objective is to assign different values to various channels according to their interdependencies in each convolution layer. Thus, to increase each channel's sensitivity, an intuitive way is to access the global spatial information by using average pooling over the entire feature map. The channel attention mechanism can be formulated, as shown in where = [ 1 , 2 , . . . , ς ] is the input feature map with ς number of channels/feature maps and H × W dimensions, W ↓ [b ↓ ] denotes weight [bias] of the compression convolution, which reduces the dimension by a factor of r, W ↑ [b ↑ ] denotes weight [bias] of the expansion convolution, which increases the dimension by a factor of r, S denotes the SELU activation function, and σ is the sigmoid activation function. The GAP output can be realized as the fusion of local descriptors whose statistics express the entire feature map [56].
The channel attention mechanism comprises of the convolutions with kernel size 1 × 1 along with the sigmoid activation. This aids in learning the nonlinear interaction between the channels and ensures multiple channels with informative maps are emphasized more [56]. As the number of channels/feature maps ς in the condense and enhance network keeps varying, the gating mechanism needs to be adjusted to accommodate these changes. The factor r is a hyperparameter, which varies the capacity of the gating mechanism. The ratio r was formulated as r = ς i /4 where ς i denotes the number of channels/feature maps at the input of the GAP layer.

IV. MULTISCALE HUMAN COLOR VISION LOSS
Several loss functions, such as L1, L2, cosine similarity measures [61], and perceptual and adversarial losses [36], have been investigated for various computer vision tasks. These perform reasonably well, but losses based on dense pixelwise image differences lead to poor perceptual quality [33]. In [47], an HDR cost function that treats illumination and reflectance separately was proposed. However, the method utilized only the information around the predicted image's saturated areas to compute the loss. This pixelwise blendingbased cost function will be ineffective for image enhancement tasks that require global and local adjustments. Thus, in this article, a multiscale loss function that works on the principle of the Retinex theory is proposed. According to this, the low-frequency information of the image represents the global naturalness, and the high-frequency information represents the local details. By decomposing the image into a low-frequency luminance component and a high-frequency detail component, the loss function incorporates both the local and global information. This loss is driven by the close to the logarithmic response of the human visual system (HVS) in large luminance range areas, which follows Weber-Fechner's law [62].
The loss is constructed under the assumption that the image can be decomposed into illuminance and reflectance components. The illumination component L defines the global deviations in an image, while the reflectance R represents the details and colors. In combination, these components modulate the reconstruction of a perceptually enhanced image P e = L × R. For the simplicity of exposition, consider the case in which the loss function consists of a single scale: the extension to multiple scales is straightforward. Consider a predicted image I and ground-truth image T of any arbitrary size (m, n). The log-based illumination component is where ⊗ denotes convolution and for the illumination component of predicted image, takes the value of I and = T for ground-truth image. The value of σ cannot be theoretically modeled and determined [63]. The choice of right scale σ for the surround filter is crucial for single scale retinex. These can be overcome by utilizing the multiscale retinex, which seems to afford an acceptable tradeoff between a good local dynamic range and a good color rendition. Thus, empirically, σ values were set to 0.5, 1, 2, 4, and 8. The log-based reflectance component is constructed by taking the difference between the image and illumination component. This can be formulated, as shown in (6). The resulting MHCV loss function using these two components can be defined, as shown in (7), as follows: Equal weight is provided to both illumination and reflectance components as both global variations of illuminance and local colors, and details are very important for the successful reconstruction of enhanced images.

V. EXPERIMENTAL RESULTS
This section provides the performance evaluation of the DPIENet. After outlining the experimental settings, chosen datasets, and training details, the performance comparisons with SOTA methods are provided to demonstrate the effectiveness and generality of the DPIENet.

A. Dataset
For training, validation, and testing purposes, the MIT-Adobe FiveK dataset [64] is employed. This dataset contains 5000 photographs taken with SLR cameras by various photographers. These photographs covered a broad range of scenes, objects, subjects, and lighting conditions. Each image was retouched by five well-trained photographers using global and local adjustments. Among these retouchers, the result of photographer C was selected as ground truth because the photographs received a high rank in the user study [64]. The untouched images were considered as input images. This consisted of images with standard exposure ( S ), which comprises of images captured with default camera settings and low exposure ( L ) involves simulated low exposure settings. The dataset was split into three partitions: 4000 images for training, and 500 images (250 low + 250 std exposure) for validation and testing. All the images from this dataset were downsized to 512 along the long side for training, validation, and testing purposes.

B. Training Details
For training, RGB input patches of size 256×256 along with the corresponding ground truth were considered. The training data were augmented using random horizontal, vertical, and 90 • rotations along the center of the image. According to [53], the ideal initialization for SELU is mean 0 and standard deviation √ 1/n. However, this unequivocally causes the gradients to explode. To stabilize the network, the standard deviation was set to √ 0.1/n. For training the model, the AdaBound optimizer [65] with β 1 = 0.9, β 2 = 0.999, ε = 1 × 10 −8 , and γ = 1 × 10 −3 was employed. The batch size was set to 20. The learning rate was initialized as 1e −3 and the final learning rate was initialized as 0.1. The network was trained for a total of 2.85 × 10 6 updates and multistep learning rate scheduler was used to decrease the learning rate by 0.1 at 9.5×10 5 , 1.9×10 6 , and 2.375×10 6 iterations. For training, the proposed multiscale human vision loss was employed instead of L1 and L2 loss. Minimizing L2 is generally preferred as it maximizes the PSNR. However, based on a series of experiments conducted, MHCV loss provides better convergence than L1 or L2 loss. The evaluation of this comparison is provided in the next section.  Blue TEXT INDICATES THE  SECOND-BEST PERFORMANCE FOR RESPECTIVE INPUT SETTINGS. THIS DEMONSTRATES THAT THE PROPOSED  DPIENET PERFORMS SIGNIFICANTLY BETTER THAN SOTA TECHNIQUES   Fig. 5. Visual comparisons with respect to the ground truth. Zoom-in regions are used to illustrate the visual difference. DPIENet not only restores the details but also avoids discoloration. The SOTA techniques tend to exhibit few artifacts, such as variation in color (for example, DPE-UL tends to shift the color toward orange from red, DPED-Blackberry introduced green color), over enhancement (for example, FLLF and FIP over enhance the detail which look dark), and blurriness (for instance, DPED-Sony image look smoothened). Note: UL stands for unsupervised learning, and SL stands for supervised learning.

C. Benchmark Results
DPIENet is compared with other SOTA algorithms using measures, such as PSNR, SSIM [2], GSSIM [3], and UQI [4]. These measures are applied to all the RGB channels of the image. All these measures access the image quality based on the given reference benchmark image that is assumed to have the desired quality [66]. Higher quality value depicts how close the enhanced images are to the ground truth.
The ablation tests comprise of experiments exploring different designs and exposure settings. The quantitative performance of different models is provided in Table III. When the LXT and DCA mechanism is removed from the network, the performance is relatively low. For example, in terms of PSNR, DPIENet without LXT and DCA reaches 21.84 dB; when LXT is added, it increases to 23.31 dB. When both LXT and DCA are combined, it reaches 24.21 dB. This indicates that the proposed LXT+DCA mechanism, along with stacking, is much more powerful than the residual block-stacking method and gives a boost in performance roughly by a factor of 2.3 dB.
Furthermore, to show the effectiveness of DPIENet with MHCV loss, a comparison with existing losses, such as L1, L2, SSIM, Cosine, and single scale HCV loss, is also provided in Table III. This was obtained by applying PSNR on 500 images (a combination of both low and standard exposure) from the validation set. It can be inferred that MHCV loss outperforms with a higher margin of improvements when compared to L1 and L2 loss. The single scale HCV loss performs fairly; however, PSNR fluctuates for each scale; for example, when σ = 0.5, PSNR is 24.02 and when σ = 0.5, PSNR is 24.12. To overcome this variation, multiple sigma levels in MHCV are utilized and it performs slightly better than the single scale HCV loss.
The proposed network is compared with SOTA methods for standard and low exposure settings. For standard Fig. 6. Real-world visual comparisons of DPIENet with the SOTA models. Zoom-in regions are used to illustrate the visual difference. In the first example, DPIENet successfully suppresses the noise, which is visible in CLHE, FIP, and FLLF. Furthermore, it does not have halo artifacts that are introduced by DPE-UL and DPED. In the second example, the structural details of the building are preserved when compared to DPE-UL and CLHE. In the third example, the color of the leaves is preserved when compared to the other techniques. DPE-UL has introduced blue sky, which is not present in the input, and the leaves are yellow. In all the examples, DPED introduces blurring, FIP, and FLLF generate underexposed/darker images. exposure input setting, several recent competing methods, such as CLHE [40], FLLF [37], DPE supervised and unsupervised [36], DPED trained with Blackberry, iPhone, and Sony images [33], and FIP [34], were considered. Table IV demonstrates that DPIENet performs significantly better when compared to the other methods. The visual comparison is provided in Figs. 5 and 6. Fig. 5 illustrates that the enhanced colors of the DPIENet are very similar to the  [5], DIV2K dataset [68], and a database provided in [45] were utilized. The zoomed regions in both these images demonstrate the color and edge-preserving property of DPIENet when compared to the SOTA techniques, which tend to oversaturate, introduce variations in color, and induce blurriness.
The quantitative results for low exposure settings are provided in Table IV. This indicates that the images are restored with superior quantitative performance. The visual comparison of this setting is illustrated in Fig. 7 (with ground truth) and Fig. 8 (real world). The network reconstructed a visually pleasing image close to the ground truth and mimic human perception while retaining natural color rendition. In comparison, the SOTA techniques contain exposure artifacts, and the colors are less perceptually similar when compared to the ground truth. Furthermore, the model is compared with the most recent deep learning-based competing low light IE techniques, such as MBLLEN [34], EnlightenGAN [35], DEEPUPE [36], GLADNet [32], and RetinexNet [30]. The proposed network reconstructs perceptually improved images with a higher correlation with the ground truth when compared to the other models.
The merged images from the Google HDR [5] dataset were utilized to show the effectiveness of DPIENet on real-world images. This dataset contains 153 sets of images-each set comprises of a merged image and a final reconstructed image along with a reference frame. As DPIENet aims at exposure correction, the merged images were used as inputs to the systems. To compute the quality, no reference-based quality measure, such as CRME [70], Brisque [71], and Divine [72], were utilized. Comparative results are provided in Table V. Due to the supervised training of DPIENet, it has to be noted that it tries to enhance the image so that it is close to the reference image, and thus, it is not optimized for

VI. USER STUDY
The user study conducted follows the practice provided in [72]. A paired comparison is adopted to assess the perceptual quality using Qualtrics [73]. For each test, each user was asked to select the preferred one from a pair of images. Using this setup, relative scores and standard exposure input images show minimal perceptual differences between the proposed DPIENet and the SOTA methods, such as CLHE, FLLF, DPED-iPhone, and DPE-unsupervised, for standard exposure methods, and MBLLEN, GLADNet, RetinexNet, EnlightenGAN, and DEEPUPE for low exposure methods are obtained.
For this study, five images per comparison were picked randomly from the Adobe FiveK dataset (testing and validation Fig. 8. Real-world visual comparison of DPIENet with SOTA low exposure methods. Zoom-in regions are used to illustrate the visual difference. The first example, DPIENet, produces visually pleasing realistic colors. DeepUPE and MBLLEN do produce realistic colors; however, they introduce exposure artifacts. The second example, DPIENet, produces images with better details (see zoomed shoe). The third example, DPIENet, provides better visible details and color, as seen in the zoomed regions. Overall, EnlightenGAN and RetinexNet tend to produce unrealistic colors. GLADNet introduces a hazy effect, and DEEPUPE and MBLLEN suffer from exposure-related artifacts.
images) [64], NASA dataset [67], Google HDR [5], DIV2K dataset [68], and a database provided in [45]. Each participant was asked to compare 50 pairs of images. The users were instructed to consider the following aspects: 1) visible noise; 2) over or underexposure artifacts; 3) overenhancement; and 4) unrealistic color or texture distortions. For detailed analysis, Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. the results from 45 participants were considered. The percentage that users chose DPIENet over the SOTA methods for both low and standard exposure images is provided in Fig. 9. The bar plot provides the number of times the user preferred DPIENet versus the SOTA method. For example, DPIENet was chosen 64.44% of the time when compared to CLHE under standard exposure methods. On average, the proposed DPIENet is preferred by 76% and 79% of users for standard and low exposure settings, respectively. These averages are obtained by taking the mean of the graph bars of Fig. 9. The runner-up was CLHE for S and EnlightenGAN for L methods. For further analysis, the global score was obtained by fitting the results of paired comparisons to the Bradley-Terry (BT) model [74]. The normalized zero mean BT score for both exposures is quantized in Table VI. These scores, along with the user study, shows that the results of the proposed method have higher perceptual quality than existing SOTA methods.

VII. CONCLUSION
In this work, a novel deep learning-based image enhancement for exposure restoration is presented. The method is built on multiexposure simulation using LXT. The proposed DPIENet, which is an end-to-end mapping approach, comprises of a condense and enhance network, which leverages the idea of residual learning to reach a larger depth. Furthermore, the skip connection between these networks aids in recovering spatial information while upsampling. In addition, to improve the network's ability to realize the context of the image, global features are exploited from each group in the condense network. A DCA mechanism to adaptively rescale channelwise features is employed to boost the network's channel interdependencies further. To obtain realistic images that correlate to human vision, a novel multiscale human vision loss is presented-these aid in accounting for the global variation in illumination, details, and colors. Extensive quantitative, qualitative, and user study evaluations conducted on the presented technique demonstrate DPIENet's performance surpasses the existing methods and achieves SOTA results. Furthermore, DPIENet overcomes artifacts, such as halo effects, noise amplification in dark regions, and artificial color generation, which occur in a few existing techniques. As a part of the future work, the authors intend to test the accuracy of the system for various low-level computer vision tasks, such as super-resolution, image recoloring, and image denoising.

ACKNOWLEDGMENT
The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Army Combat Capabilities Development Command, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.