High Dynamic Range Imaging via Visual Attention Modules

Thanks to High Dynamic Range (HDR) imaging methods, the scope of photography has seen profound changes recently. To be more specific, such methods try to reconstruct the lost luminosity of the real world caused by the limitation of regular cameras from the Low Dynamic Range (LDR) images. Additionally, although the State-Of-The-Art (SOTA) methods in this topic perform well, they mainly concentrate on combining different exposures and pay less attention to extracting the informative parts of the images. Thus, this paper aims to introduce a new model capable of incorporating information from the most visible areas of each image extracted by a Visual Attention Module (VAM) which is a result of a segmentation strategy. In particular, the model, based on a deep learning architecture, utilizes the extracted areas to produce the final HDR image. The results demonstrate that our method outperformed most of the SOTA algorithms.


Introduction
In the scope of photography, the real world consists of an unlimited range of luminance.However, most devices are capable of capturing merely limited of that light.Therefore, the taken images are not desirable and consist of saturated regions, in which some parts of the images are too dark (underexposed) or overly bright (overexposed).These types of pictures are called LDR images.Thus, in order to cope with this problem, highly advanced cameras [1][2][3][4][5][6][7] can be used, which have special sensors to capture more light.However, such devices are mainly too expensive and overly heavy, which are not suitable for daily life, and instead, are mostly used in industries.
A possible resolution for this drawback is developing software algorithms called HDR imaging techniques.Moreover, HDR images can be implemented by a single image [8][9][10][11] or fusing a stack of images with different exposures, which are called single-and multi-exposure methods, respectively.In algorithms with a single LDR image, an HDR image can be produced from one image.However, the generated picture might not be as informative as the HDR image produced by several LDR images because the amount of detail in one single picture is limited compared to several images with different exposures.More precisely, [8] implemented an algorithm that only reconstructs the detail of bright saturated areas.However, the model is not only not capable of restoring the detail of dark regions but also does not perform well if the amount of bright saturation is too much.Thus, [12] first combined several LDR images and then fed the low-frequency response of the wavelet transform to the network to produce more detail in a shorter time.
Luckily, multi-exposure methods are more effective and informative compared to single-exposure techniques.Moreover, these methods perform well when the images are static [13,14], while when there are movements in the sequence of pictures, the ghosting problem emerges, which is almost solved in [15][16][17][18][19][20].
Deep learning has been a significant means of producing an HDR image for the past decade.For instance, [8] produced an HDR picture in the logarithmic domain with the help of a deep neural network.Additionally, [21], used a neural network to reconstruct the detail of an image with different exposure in each row in the irradiance domain.Moreover, unlike other multi-exposure methods, [13,14] used a neural network to produce synthetic LDR images with different exposures from a single image.Furthermore, [16] proposed to first align images with the help of the optical flow method, and then use a deep neural network to combine them.Therewith, [15] instead of using optical flow for alignment, proposed to use two different neural networks first to align them and then combine the aligned images with the second neural network.Finally, [22] used a neural network to learn the relative relation between the inputs and the Ground Truth using input images in different scales.
In this article, we would like to exploit image segmentation with the help of the Otsu method [23] in HDR imaging to extract the most visible areas of the images and help the model produce pictures with more detail.Thus, to reach this point, Visual Attention Modules (VAMs) will be proposed to obtain such regions.Moreover, in this research, Spatial and Attention modules have been used from a State-Of-The-Art method, and a new architecture for the Reconstruction stage was designed and implemented, in which the visual attention and the reference image were used in the decoder part.Finally, although VAMs helped in producing pictures with more details and outperformed most of the State-Of-The-Art methods, the results still illustrated a slight amount of noise that was extracted from the input images.
In section 3, the State-Of-The-Art in HDR imaging and related image segmentation is presented.In section 4, the proposed method is discussed in detail.Section 5 demonstrates the experimental results and comparison with the State-Of-The-Art methods.Moreover, section 6 concludes this article with ideas for further works.Finally, the code will be available at the github page.

Related work
In this section, we will discuss the State-Of-The-Art methods in the scope of HDR imaging in the Multi-Exposure category (Section 3.1) and survey unsupervised Image Segmentation methods for extracting regions (Section 3.2).

Multi-Exposure Methods
[24] proposed a two-stage algorithm, in which the first phase they extracted features from the input images, and merged them to produce the HDR image in the latter one.Additionally, to cope with the appeared noise from the gamma correction operation on input images, i.e. the gamma-corrected Short-Exposure image becoming similar to Medium-Exposure, they used a U-net to extract noiseless features from it.Moreover, [25] implemented a model in which images with lower scales were used to reduce the consuming sources.Additionally, a novel loss function was defined to focus more on the motion.Furthermore, [26] forwarded features with different scales to deformable and spatial attention blocks to align images in the feature space and also extract the features of the specific areas of the input images.Moreover, [27] proposed a model that at first estimated the optical flow from the two input images in different scales and then fused them to produce the final output.In [28], the features are extracted from different scales and then are processed by sampling and aggregation modules to align the pixels of the non-reference features.
The work [29] implemented a baseline that had lower computational resources and acceptable results compared to the other State-Of-The-Art models.They used a dual attention module, which includes both spatial and channel attention modules, to cope with misalignment and to better learn the details of the produced areas.In [30], the authors proposed a model that first extracts features from input images by multi-scale encoding modules and then produces an HDR image by progressively dilated U-shape blocks.
[31] demonstrated that the ghosting problem is mainly in short-frequency signals, and therefore, they proposed a wavelet-based model to merge images in the frequency domain and avoid any ghosting problems.[32] implemented an algorithm that extracted dynamic areas of the images with the help of image segmentation and applied two neural networks separately on the static and dynamic scenes.Finally, they merged the information to produce an HDR image without ghosting.In [33] a model based on bidirectional motion estimation was proposed, in which, the amount of optical flow between LDR images was esti-mated by motion estimation with cyclic cost volume and spatial attention maps, and eventually, an HDR image was produced with the help of the extracted local and global features.[34] implemented the first multi-bracket HDR pipeline using event cameras, in which they merged the extracted features of images and the events to produce an HDR image.[35] proposed a transformer-based baseline, in which they used a context-aware vision transformer to extract local and global features to model the movement of objects and the diversity of intensity.

Image Segmentation
Image segmentation is a crucial task in computer vision, which tries to partition images into segments to analyze the pictures more easily.Additionally, image segmentation not only can be used for object recognition, detection, and medical purposes but also can be applied for extracting regions of pictures with more details.In [36] images were analyzed in HSV color space to segment pixels based on Intensity or Hue value.Moreover, two image segmentation methods were proposed based on luminance: histogram division [37] and clustering based on Gaussian Mixture Model (GMM) of histogram [38].Furthermore, [39] calculated an optimal valley point based on the slope between the histogram value of each pixel and the neighboring points, and used the computed valley point to segment regions.The literature on the topic is endless, depending on applications and methodologies, from level set methods [40] to graph cut [41] to recent deep learning-based frameworks [42].

Overview
As cited in [43], it might be beneficial to first segment images based on exposure information to extract the best and more detailed regions from the Over-and Under-Exposure regions and exploit this knowledge in reconstructing an HDR image.Following this idea, in this paper, a model is proposed in which, with the help of image segmentation, regions with more detail are segmented first in the preprocessing stage.Finally, they are fed to the model along with the input images to produce an HDR image with the help of VAMs.
Generally, the model can be divided into several sections.Firstly, the input images are fed into the feature extraction module, and afterward, the extracted features enter the attention and spatial alignment modules to cope with any possible misalignment.Moreover, the input images with their corresponding masks go to the VAM simultaneously to extract the visible areas of the LDR images.Next, the outputs of the three modules are fed to the Reconstruction stage to produce the initial HDR image.Finally, the generated outcomes with the features of the reference image enter the refinement section to construct the final HDR image.

Preprocess
In this article, the inputs are three LDR images with different exposures, and the image with Medium-Exposure is considered the reference image.Moreover, before feeding the input images to the model, they are first mapped to the HDR domain with the help of gamma correction.Finally, they are concatenated channel-wise with their corresponding LDR images.
Where t i is the exposure time of I i .γ is the gamma correction parameter, which was 2.24, and Îi is the gamma-corrected image.

Segmentation
Most of the present algorithms in HDR imaging focus more on the approach of image production, but not many pay attention to how to extract the most helpful features.Thus, in this research, the regions of the pictures with more details are segmented and extracted as a preprocess and finally are fed to the proposed model along with the LDR images as the inputs.
Different methods, such as the neural network and Otsu method were used for the image segmentation stage; however, the neural network resulted in overfitting.Thus, the Otsu method has been selected to segment the visible areas of the pictures.Therefore, the images are converted into the YUV color space to calculate a threshold based on the histograms of Short-and Long-Exposure images.
In which Y i is the luminance channel of the LDR image, G() is the Otsu function, and thresh i is the threshold value of image i.
In the Short-Exposure image, because most of the pixels are dark, and the objective is to extract the regions with visible pixels, the values equal to or more than the threshold are considered one, and the rest are zero for the Short-Exposure mask.
Where thresh 1 is the threshold value of the Short-Exposure image, and p is the pixel.
On the other hand, because most of the pixels in the Long-Exposure image are saturated, and the visible pixels have the lowest values, the values that are less than the threshold were considered one, and the rest as zero in the Long-Exposure mask.
By doing so, the masks of the areas with more detail are extracted and can help to produce an HDR image.
Generally, most of the pixels in Short-and Long-Exposure images are too dark or bright, respectively.Therefore, the location of the areas with surplus information is extracted and fed to the model.Doing so reduces the amount of calculation and helps in producing an HDR image with more detail.Fig. 1 demonstrates the segmented and visible regions of both Short-and Long-Exposure pictures.Moreover, during experiments, three input images with different exposures were used for image segmentation, in which, after obtaining the suitable areas of Short-and Long-Exposure images, the remaining regions were extracted from the Medium-Exposure image.However, the acquired areas of the Medium-Exposure were not sensible, as most of them were only a few pixels.Thus, two reasons exist for not using Medium-Exposure in the segmentation stage.First, it would be challenging to calculate a range for the visibility of the pixels.Second, Medium-Exposure is the reference image, and the picture will be used in the neural network.Therefore, it is not necessary to use segmentation for it.

Proposed Method Structure
As shown in Fig. 2, the proposed algorithm consists of six stages, which will be discussed separately and in detail.

Feature Extraction
Fig. 3 illustrates the Feature Extraction block, in which a SepConv is applied to the image to extract 32 feature maps.Afterward, a Max Pool and an Average Pool are used to not only smooth the features and focus on the details but also pay more attention to the edges.Next, the outputs of Poolings are concatenated, and another SepConv + ReLU is used to reduce the number of channels to 32.Finally, the extracted features are Upsampled to make them the same size as the input image.The feature extraction can be written as follows: for i = 1, 2, 3, where A() and M () functions are Max Pooling and Average Pooling, respectively, and C i is the output of Concatenation.Finally, F i is the output of the Feature Extraction Block.

Visual Attention Module
As it was mentioned, in this article, Image Segmentation is used to help the model to produce a better image.Therefore, as shown in Fig. 4, the input images are multiplied element-wise by their corresponding masks first.By doing so, the regions with more details are kept, and those that are overly dark or too bright will be removed.Next, they are fed to the Feature Extractor to extract Features.Finally, they are added together element-wisely.The VAM can be formally defined as follows: features L = F (multiply(mask L , I L )) (7) features H = F (multiply(mask H , I H )) (8) Where F is a feature extractor function, and V is the output feature of the VAM.

Spatial Alignment Module
Because the input LDR images are not aligned, the extracted features from the LDR images without the gamma correction images are fed to an ad hoc module for aligning them.To this end, we used the same Feaure-alignment Module used in [30].As can be seen in Fig. 5, first a Conv + ReLU is applied to the Reference Features, which can be called as Ref 1 .Next, a Conv + ReLU is applied to Ref 1 and is multiplied element-wisely by the input LDR features, which can be called M i (for i = 1, 3).Finally, another Conv + ReLU is applied to the Ref 1 and is added element-wisely with M i .Formally, the operation in the module can be written as follows: The Attention Module is almost similar to [30], in which, as shown in Fig 6, feature maps are produced for Short-and Long-Exposure images to merge them with the reference image as guidance.After feeding the features of gammacorrected images with the reference image, they are concatenated.Afterward, SepConv + ReLU and SepConv + Simgoid operations are applied to them.The module can be considered as follows: Where f i and f r are the features of gamma-corrected and reference images, respectively.
Figure 7: The total Scheme of the Reconstruction stage.

Reconstruction
All the extracted features from the modules are concatenated and fed to the reconstruction stage.As shown in Fig. 7, with the help of four encoder blocks, the input is merged, and new features are produced.Next, each decoder block receives features from the encoder along with features of the reference image and VAM.Finally, a SepConv + ReLU is used to produce the output of the stage.Each encoder block (Fig. 8, left) initially applies SepConv, Batch Normalization, and ReLU layers to the inputs.Afterward, similar to Feature Extraction Module, Max and AVG Poolings are used.Finally, they are concatenated and sent to the next block.
Moreover, each decoder block (Fig. 8, right) consists of three inputs, which are features of the VAM, features of the reference image, and the output of the previous block.First, AVG pooling is applied to the first two inputs to make them the same size as the output of the previous block, and then they are concatenated with each other.Finally, SepConv + ReLU and Upsampling are used, respectively.Unfortunately, the output of the reconstruction stage may have blurry, saturated, or dark areas; therefore, to cope with such possible issues with the help of features of the reference image, a refinement section also has been added.

Refinement
As Fig. 9 illustrates, SepConv + ReLU is applied to the features of the reference image to reduce the number of feature maps.Furthermore, after concatenating the inputs, SepConv and SepConv + ReLU are used, respectively.The process is repeated two more times, and eventually, Conv + Sigmoid is applied to produce the final image in Sigmoid space.The process in Refinement can be represented in pseudo-code as shown in Algorithm 1.
Notice that, in this research, the Ground Truth images are mapped from HDR Space into sigmoid space.Indeed, based on our experiments, transforming the values into sigmoid helps the network converge more conveniently.The reason for changing the space is that the values in HDR space are too large, and a model with a low number of parameters is not able to learn to produce an HDR image correctly; therefore, by mapping them to sigmoid space, the proposed model outperforms the model in the HDR space.
5 Experiments and Results

Dataset
A new dataset was collected for HDR Imaging Challenge [44,45].In this dataset, two types of pictures (Single-Exposure and Multi-Exposure images) were provided; however, Multi-Exposure images only were used in this research.More specifically, this dataset includes images from [46] that were generated as follows.First, HDR images were produced natively by two Alexa Arri cameras with a mirror rig; then, their corresponding LDR images were generated synthetically with noise sources.There are approximately 1500 pairs of HDR/LDR images in this dataset for the training set, 40 for the validation set, and 200 pictures for the test set with a resolution of 1900x1060.However, in this research, we randomly selected 200 images of the training set as a test set and trained the model with around 1300 pairs.

Implementation Details
The highlights of the model are demonstrated in Table 1 briefly.Additionally, the weights of the model were initialized randomly and no pre-trained weights were used.Finally, the information regarding the proposed method will be discussed in the following subsections.

Loss function
The Mean Absolute Error (MAE) loss function is used to train the model.The difference is that the Ground Truth is first mapped to Sigmoid Domain, and eventually, MAE is calculated in Sigmoid space between the Ground Truth and the output of the model.GT n = sigmoid(GT ) ( 15)

Dataset
Where GT n is the Ground Truth image in the new domain, and L is the loss between Ground Truth and the output.Furthermore, after training the model in sigmoid space, inverse sigmoid is used to re-map the output to HDR space.The inverse sigmoid can be written as follows: Where HDR is the output in HDR space and ŷ is the image in the sigmoid domain.

Training
Flipping the images vertically or horizontally is also used as an augmentation method during training.Moreover, before feeding the images to the model, they are resized into 256x256.The reason for doing so instead of producing patches is that some generated patches from the masks may be totally black or completely white, which causes the model to pay less attention to the images with Short-Exposure.
Moreover, batch size and the number of epochs are set to 16 and 100, respectively.In this article, Adam Optimizer with an initial learning of 0.001 is used, and it will be reduced by a factor of 0.1 if the validation accuracy does not improve.Finally, the whole model is implemented in Tensorflow (Keras) framework and is trained on a DGX-A100 GPU.

Validation
The images are first padded from 1900x1060 to 1920x1080 and then fed to the model without any augmentation methods during validation.

Quantitative Comparison
The results in this paper are compared with the State-Of-The-Art methods by P SN R in HDR and Tone-mapped domains.The P SN R−µ is the tone-mapped version, where the images were tone-mapped in µ − law.Moreover, the results are compared with the State-Of-The-Art methods in GM ACs and the number of parameters.As mentioned in [45], the challenge focused on two tracks, which were Fidelity and low complexity.In the first one, the methods were required to obtain the highest µ − P SN R while the GM ACs value is less than 200.In the latter track, it was asked to reduce the GM ACs value to less than the baseline method while the P SN R and µ − P SN R values are almost the same as the baseline method.The proposed method has been compared with GSANet [24], DRHDR [26], and Vein et al. [33] methods.As can be seen, Table 2 shows the proposed method has the highest value in terms of P SN R, while having the second highest value in µ − P SN R. On the other hand, Vien et al. [33] had the lowest GMACs value, and GSANet is ranked second lowest.Moreover, it is visible that in terms of the number of parameters, GSANet has the lowest and the proposed method is in the second place among the algorithms.Furthermore, for more study, the proposed method was trained and tested in HDR and Sigmoid Spaces to check which space is superior for training the model.Thus, as Table 3 demonstrates, the proposed method in Sigmoid Space outperformed the algorithm in the HDR domain.Moreover, during training, the model in Sigmoid space converged quicker than the model in the HDR domain.

Qualitative Comparison
As can be seen in Fig. 10, the produced images by ours, worked better in terms of image reconstruction compared to DRHDR and An et al. methods.More specifically, Fig. 10 demonstrates the results of ours, DRHDR [26], An et al. [33], and GSANet [24].As can be seen, the output of An et al. in the first scene has distortion in the bright areas, and it is visible that the algorithm cannot restore the details from these areas correctly.Furthermore, there is some degradation in the dark regions too.Moreover, although DRHDR worked great and reconstructed both areas, this method was not able to acquire the details in over-saturated areas.For instance, looking at the two red and green boxes, the model did not reconstruct the details of the hands and the shirt, while the proposed method produced more detail in these two regions.Moreover, produced image from the GSANet method shows significant details and is almost similar to ours.More precisely, although both methods could reconstruct the shirt nicely, the details of the hand in the GSANet are more than ours.
Additionally, in the second scene, the DRHDR and An et al. methods were not able to reconstruct the branches that were only visible in the short exposure image and restored only a part of them.In contrast, the proposed method and the GSANet worked almost well in this regard.Finally, looking at the last scene, it is visible that the proposed method outperformed the first two algorithms and reconstructed more details in both dark and bright areas, and the details of the sky show this point.
Furthermore, although the segmentation helped the model to produce better results, the method might encounter two possible issues.Firstly, due to plausible noise in input images, using segmentation for extracting visible areas may also acquire the noise, and the produced image might become noisy.Lastly, although spatial alignment and attention modules are used to avoid any possible ghosting problems, if the input images have a severe amount of movement, the output might also encounter a ghosting issue.Because the segmentation is applied to the Short-and Long-Exposure images and extracts their visible areas.Therefore, some parts of the images might not be aligned.Moreover, for future research, we would like to investigate possible methods to use segmentation and avoid any likely noise or misalignment.

Conclusion
In this article, we proposed a new method for HDR imaging with the help of image segmentation.More specifically, we first applied the Otsu method on Short-and Long-Exposure images to acquire the areas with more details.Afterward, the input images along with the segmentation outputs were fed to the model to produce the HDR image.The results show that the proposed method outperformed the State-Of-The-Art and generated more details.However, the proposed model is not free of issues, and in case of possible noise or misalignment in input images, the output might have a slight amount of noise or misalignment due to extracting areas of input images.Therefore, for future research, we would like to focus on investigating these two problems.

Figure 1 :
Figure 1: Produced masks of Short-and Long-Exposure images.

Figure 2 :
Figure 2: The total pipeline of the proposed.

Figure 3 :
Figure 3: The structure of the Feature Extraction Block.

Figure 4 :
Figure 4: The structure of the Visual Attention Module (VAM).

Figure 5 :Figure 6 :
Figure 5: The structure of the Spatial Alignment Module.

Figure 8 :
Figure 8: The structure of the blocks in the encoder (left) and the decoder (right).

Figure 9 :
Figure 9: The structure of the Refinement Stage.

Figure 10 :
Figure 10: Qualitative Comparison with the State-Of-The-Art.The first row of each scene contains short, medium, and long exposure images, respectively.The second row includes ours, DRHDR, An et al., and GSANet outcomes, respectively.

Table 1 :
Brief highlights regarding the training and validation settings for the proposed method.

Table 2 :
Comparison with the State-Of-The-Art methods.The bold numbers are the best values, and the underline ones are the second best.

Table 3 :
Comparison between the proposed method in HDR and Sigmoid Spaces.