Deep Gradual Multi-Exposure Fusion Via Recurrent Convolutional Network

The performance of multi-exposure image fusion (MEF) has been recently improved with deep learning techniques but there are still a couple of problems to be overcome. In this paper, we propose a novel MEF network based on recurrent neural network (RNN). Multi-exposure images have different useful information depending on their exposure levels, and in order to fuse them complementarily, we first extract the local detail and global context features of input source images, and both features are separately combined. A weight map is learned from the local features for effectively fusing according to the importance of each source image. Adopting RNN as a backbone network enables gradual fusion, where more inputs result in further improvement of the fusion gradually. Also, information can be transferred to the deeper level of the network. Experimental results show that the proposed method achieves the reduction of fusion artifacts and improves detail restoration performance, compared to conventional methods.


I. INTRODUCTION
Multi-exposure fusion (MEF) has been a popular approach for HDR image generation. It is a method to fuse a couple of low dynamic range images obtained by taking the same scene at different exposure levels. Due to the limited dynamic range of digital camera sensors, some regions of the scene may be under-exposed or over-exposed (even leading to saturation), depending on exposure time. Thus, multi-exposure images can be complementarily combined for generating a single HDR image, which contains the whole dynamic range of the scene. In general, over/under-exposure images lose detailed information, which leads to low contrast and quality. Therefore, MEF aims to fuse a high-quality image with better brightness and detailed information restoration.
Since MEF was proposed by Mertens et al. [1], various researches have been conducted in literature. The conventional MEF approach can be divided into spatial and transform domain based methods. The spatial domain based methods fuse multi-exposure images on the spatial domain. The weights for their fusion are calculated by analyzing The associate editor coordinating the review of this manuscript and approving it for publication was Hengyong Yu . MEF images from various perspectives such as contrast ratio, saturation [1], image block [2], and gradient [3]. In contrast, in the case of the transform domain approach, source images are first transformed into another domain and fusion is proceeded. A variety of the relevant methods have been proposed, including Wavelet [4], [16], multi-scale decomposition [6], [15], and sparse representation [7], [8]. However, the performances of both MEF approaches are fundamentally limited in that they mainly rely on hand-crafted features for image fusion. For further performance improvement, we need well-designed feature extraction and fusion rules, which are a challenging task.
Recently, convolutional neural networks (CNNs) have been popularly used for image fusion [9], [12], [23]. While this CNN-based fusion approach achieves better performance than non-deep learning, there are still some challenges to be overcome. First, a deep learning framework is used only in the limited part of MEF such as feature extraction, and conventional fusion strategies such as weighted sum are used identically. In addition, the image quality of the fused image is deteriorated by using features extracted only limited information (e.g., Y channel) from the source image. As a result, detail restoration performance is degraded in the over/ under-exposure region. Finally, it was observed through experiments that artifacts such as local dark region often occur in fused images in particular on the regions whose brightness between multi-exposure images is significantly different.
In this paper, we propose a novel CNN-based MEF architecture, which is called Deep Gradual Multi-Exposure Fusion via Recurrent Convolutional Network (DGMEF-RNN). In general, deep learning based MEF methods go through the processes such as feature extraction, fusion, and reconstruction. In the proposed network, features are extracted in both global context and local detail using a dilated convolution filter. Fusion and reconstruction are done through RNN and a residual network, respectively. Unlike the conventional methods where auto-encoder is mainly adopted as a backbone, we propose a novel RNN-based fusion model. RNN builds a connection between the output and the next input of the network. As shown in Fig. 1, the stepwise fusion process of the proposed RNN-based network allows longrange dependency so that each source image information is transmitted to a deeper level in the network to generate a high-quality fusion image. The fusion module consists of two blocks: The global context fusion block naturally fuses global components such as color and style of source images. And the local detail fusion block preserves the detail components of source images by learning appropriate weights for fusion. The experimental result of the proposed method is richer in color and contains better illumination for all regions, thus more fully revealing the details of source images with higher saturation. The contributions of the paper are summarized as the followings: (1) We propose a novel RNN-based MEF architecture, which sequentially transmits global context information to the entire network. As far as we know, this is the first work to implement the sequential fusion of multiple exposed images with RNN. (2) The proposed method strengthens the detail feature fusion of source images through the learned weight map and effectively restores the local detail components of the fused image. The rest of the paper is organized as follows. Section II introduces related works on CNN-based multi-exposure image fusion. Section III describes our RNN-based multiexposure image fusion architecture. Section IV verifies the effectiveness of our proposed MEF method visually and quantitatively. Finally, Section V provides the concluding statements.

II. RELATED WORKS
Deep neural networks have been recently applied to various image fusion problems. Liu et al. [9] studied convolutional sparse representation (CSR) for image fusion, where a fusion weight map is learned to distinguish the focus and unfocus regions of the source image. Liu et al. [10] proposed a medical image fusion method based on CNN which is used to generate a weighted map to represent the extent of pixel activity in the source image. They also introduced a local similaritybased strategy to adaptively adjust the fusion rules through decomposed coefficients. In the above methods, CNN is only adopted to generate a weighted map that incorporates pixel activity information, and the entire fusion process is still performed in a traditional way of multi-scale image pyramids.
Li and Wu [11] proposed DenseFuse for the fusion of infrared and visible images. Dense blocks are used in the encoder and it proposes a new fusion strategy to fuse feature maps. The feature maps of source images in the fusion layer are combined into two manually designed fusion strategies (additional and 1-norm). And it uses a non-referential metric (structural similarity index measurement and Euclidean distance) as a loss function for unsupervised learning. Li et al. [12] decomposed source images into base parts and detail content. Then, the base parts are fused by weightedaveraging. For detail content, deep learning networks are used to extract multi-layered features. These features are used to generate multiple candidates of fused detail content using l1-norm and weighted average strategies. Finally, the two parts are combined for reconstruction. Due to availability and effectiveness of generative adversarial network (GAN), Ma et al. proposed FusionGAN [13] to fuse infrared and visible images. The fused image generated by the generator is forced to restore more details existing in the visible image by applying the discriminator to distinguish differences between them. Kalantari and Ramamoorthi [14] proposed to obtain tone-mapped and ghost-free fused images from multi-exposure images through CNN. They collected a static set of low dynamic range (LDR) images, and then fused them into a high dynamic range (HDR) image using a simple triangle weighting scheme. In this way, fusion research is being conducted in various fields. Recently, a study was proposed by Xu et al. [25] to solve the fusion problem of several cases including multi-modal, multi-exposure, and multifocus at once.
Deep learning was first introduced into the field of multiexposure image fusion by DeepFuse [23]. It was designed as an encoder-decoder based image fusion architecture. Deep-Fuse uses the MEF-SSIM [33] metric as a loss function and trains the network through unsupervised learning. In Deep-Fuse [23], CNN is used only to Y channels for feature extraction and reconstruction, and the fusion rules of the chrominance channels are still designed manually. However, this manual fusion of chrominance channels may fail to restore color information accurately. Recently, MEF-NET was proposed by Ma et al. [24]. It trains a high resolution weight map from source images using a context aggregation network based on dilated convolution filter and a guided filter. Although it exhibits a very high performance, artifacts are often observed on a region whose brightness is significantly VOLUME 9, 2021 distinct among source images. This performance degradation is highly improved in the proposed RNN-based MEF method as demonstrated in experimental results.

III. THE PROPOSED METHOD
This section describes the overall network architecture and fusion modules of the proposed method in detail.

A. NETWORK ARCHITECTURE
RNN has a characteristic that the connection between units has a recursive structure. It is distinguished by its hidden state (memory) as it takes information from prior inputs to influence the current input and output. This structure makes it possible to store a current state inside a neural network so that time-varying dynamic features can be modeled. In the proposed method, we design a multi-exposure fusion architecture to utilize the characteristics of RNN. The reason for using the RNN structure as a backbone network in the proposed method is that it is suitable for MEF, considering the characteristics of multi-exposure images. Multi-exposure images have different brightness for the same scene, and accordingly, each image contains different information (e.g., brightness, detail, and color). We observed that the behavior of the increasing brightness along multi-exposure images is similar to information changes over time in time series data. In addition, in order to generate a high-quality fused image, it is important to naturally fuse multi-exposure images. To this end, we propose an RNN-based step-by-step fusion structure. When multi-exposure source images with different brightness are fed as inputs, the fusion process is made step by step by leveraging the features of the fused result in the previous step. As a result, the current output can be used as the next input, and then it is continuously fused with the next new input, consequently leading to natural fusion and gradual enhancement. Fig. 2. illustrates the entire architecture of the proposed method. Four multi-exposure source images were used as the input of the RNN network. A set of four given source images is denoted by I n (n=0, 1, 2, 3), and they are arranged so that the brightness of the image increases sequentially as shown in Fig. 2.
The proposed RNN fusion network generates a fused image through three processes; Initial feature extraction (Dilated encoding block: DEB), fusion module, and reconstruction. In DEB, the global context feature and local detail feature of the source image are extracted. The extracted features are fused through both the global context fusion block (GCFB) and the local detail fusion block (LDFB), and then, the fused feature is transferred to the next fusion module. Finally, the fused features generated in each fusion module are concatenated and reconstructed. Each process is described later in order.

B. DILATED ENCODING BLOCK
In order to generate a high-quality multi-exposure fusion image, it is important to acquire and restore detail information from each source image. In addition, a process of naturally fusing context information such as color and style  of source images is also required. Therefore, in this paper, dilated encoding block (DEB) is adopted to extract the local detail feature and the global context feature of the source image. As shown in Fig. 3, DEB uses the context aggregation network (CAN) proposed by Yu and Koltun [37]. The CAN structure to utilize the characteristics of the dilated convolution filter gradually expands the receptive field of the convolution filter. In this paper, we construct a convolution filter for each step, as shown in Table 1. We employ adaptive normalization and leaky rectified linear unit (LReLu) right after convolution. The local detail feature (L n ) of the source image was generated by concatenation of the features in steps 1 and 2 extracted from the small receptive field. The feature extracted from the relatively wide receptive field in the last step is defined as a global context feature (G n ).

C. FUSION MODULE
The proposed fusion module consists of the local detail fusion block (LDFB) and the global context fusion block (GCFB). LDFB and GCFB fuse local detail features (L n )  and global context features (G n ) extracted from DEB, respectively. And the fused global context and local detail features are sequentially transferred to the next RNN fusion module.
The LDFB is structured as shown in Fig. 4. In the conventional method, it was difficult to reconstruct the detail of the fused image in too saturation and dark areas. In addition, the restoration performance of texture details can be further improved. In order to improve the detail reconstruction performance of the fused image, we generate a weight map to reinforce the local details of source images, and the local  [1], GFF [19], GGIF [20], SPD-MEF [21], MEF-Opt [22], DeepFuse [23], MEF-Net [24] and U2Fusion [25]) and the proposed DGMEF-RNN are shown on the upper right. The bottom rows are the highlighted regions corresponding to the marked boxes in the ground truth. detail features are fused by a weighted sum. As shown in Fig. 4, each weight map is generated using the local detail feature (L n+1 ) extracted from the newly source image at each stage of the RNN and the ML n transferred from the fusion module at the previous stage.
The GCFB for global context fusion is designed as shown in Fig. 5. Like LDFB, GCFB concatenates the global context feature (G n+1 ) extracted from a newly arrived source image at each stage of the RNN and MG n transferred from the previous stage of the fusion module and then, they passed through the four convolution filters to generate a fused global context feature.

D. RECONSTRUCTION
The reconstruction network for the restoration of a fusion image consists of five ResBlocks and one convolution filter. It is fed with the concatenation of the output features (F n ) of the RNN fusion module as shown in Fig. 2. Note that F n is the fusion of MG n and ML n .

E. LOSS FUNCTION
The loss function L used in the proposed method is given by which is a combination of l1 loss L L1 , structural similarity index (SSIM) [35] loss L ssim and MEF-SSIM [34] loss L mef with a weight λ. L L1 , L ssim , and L mef are losses between the fused image and groundtruth. As described above, the above loss equation is used to reinforce the local detail information of the fusion image and for natural fusion of the global context information.

IV. EXPERIMENTAL RESULTS
In this section, we first compare the proposed DGMEF-RNN with conventional and recent MEF methods in qualitative and quantitative ways. We then conduct a series of ablation experiments diversely to demonstrate the usefulness of DGMEF-RNN.

1) DATASET
To validate the performance of the proposed DGMEF-RNN, we perform qualitative and quantitative experiments on the publicly available dataset provided by [53] and [54], with multi-exposure sequences including indoor and outdoor, human-life, day and night scenes and the corresponding high-quality reference images (ground truth). We use a sequence of multi-exposure images under different exposure settings which have been accurately aligned. In each scene, dataset was constructed by selecting four multi-exposure images with different brightness, and total of 270 scenes dataset was obtained. We train DGMEF-RNN on 227 scenes and use the remaining 43 scenes for testing. During training, the resolution of the training dataset was reduced to 1/5∼1/7 for the reduction of the GPU memory cost while maintaining the aspect ratio. The resolution of the image is kept from 500 to 700 pixels at least. 227 scenes of multi-exposure images, and corresponding ground truth images are cropped into 10000+ patches for the training data. All patches are of size 160 × 160.

2) IMPLEMENTATION
To train our network, we used a sequence of four multiexposure source images and the corresponding groundtruth images with a batch size of 8. It is implemented using the PyTorch framework on a PC with 2 NVIDIA RTX 2080ti GPUs. For loss optimization, we adopted the Adam optimizer with a learning rate of 10 −4 which is divided by 10 for every 500 iterations. Finally, DGMEF-RNN is evaluated at a full resolution during testing

B. QUALITATVE COMPARISON
The proposed DGMEF-RNN is compared with the eight state-of-the-art methods, including Metens09 [1], GFF [19], GGIF [20], SPD-MEF [21], MEF-Opt [22], DeepFuse [23], MEF-Net [24] and U2Fusion [25]. Mertens09 [1] is one of the representative methods for MEF. GFF [19] is a guided filter-based fusion method, and GGIF [20] extends it to  the gradient domain. SPD-MEF [21], MEF-Opt [22], and MEF-Net [24] are inspired by the MEF-SSIM proposed by Ma et al. DeepFuse [23] is the first work to propose a deep learning-based MEF and has been used as a reference in so many papers. U2Fusion [25] is for solving three fusion tasks at once: multi-modal, multi-exposure, and multi-focus. For all the comparison methods, we used the code and setting provided by the original authors. However, in case of Deep-Fuse [23], it accepts only two images of under-over exposure, and for a fair comparison, its input is expanded to accept four source images. In addition, the MEF-Net [24] is also trained using four source images. U2Fusion [25] conducted an experiment using the author-provided code and model. For fair comparison, four multi-exposure images were used, and as described in the U2Fusion [25] paper, two multi-exposure images were fused step by step.
Subjective comparisons were made with 43 image scenes, some of which are shown in Figs. 6-9. We analyzed the visual quality factors of the fused image such as brightness, color and detail restoration. Through these analyses, we can confirm the usefulness of LDFB and GCFB modules in the proposed network, and the merit of the RNN structure.
In terms of subjective image quality, it can be confirmed from the fused images of Figs. 6-9, that the proposed method achieves high performances by restoring sufficient brightness and color. In particular, for DeepFuse [23], its fused image is generally grayish and suffers from insufficient color saturation. In U2Fusion [25], the color saturation of the fusion image is not sufficient, and as shown in Fig. 6, the restoration performance of dark areas is deteriorated. Also, for SPD-MEF [21], color artifact occurs in several experimental images. But the proposed method achieves a superior performance in terms of color distortion. Furthermore, as shown in Fig. 8, the proposed method naturally restores the original luminance without local dark region artifact observed from GFF [19], MEF-Opt [22], and MEF-Net [24], leading to the improvement of texture detail. In GGIF [20], the detail reconstruction ability is good and there are few artifacts, but the global contrast decreases as shown in Fig. 7.
As shown in Figs. 8 and 9, GFF [19], MEF-Opt [22], and MEF-NET [24] suffer from local dark region artifact. Consequently, the fused image is deteriorated by artificial and uncomfortable appearances. This is caused by inaccurate weight determination for each input image when the brightness of the local region among source multi-exposure images is significantly different. For example, looking at Fig. 10, (a) and (b) are the weight maps trained by MEF-Net [24] and the proposed method for the experimental image in Fig. 8, respectively. MEF-Net [24] surely determines the weight of each source image according to its important information, but the weight is heavily placed on a specific source image for some local regions. As a result, it is not naturally fused in the boundary region. For this reason, local dark region or halo artifact occurs when multi-exposure images with a large brightness difference are fused. In the proposed method, this phenomenon is highly minimized by the global context fusion block GCFB and RNN-based fusion architecture.
Figs. 6-9 show the local detail restoration performance of the proposed method. It can be seen that the texture detail is preserved in the glass of Fig. 6, the walls and leaves of Fig. 7 and the windows of Fig. 9. The detail restoration performance of SPD-MEF [21] is also good, but it suffers from color distortion and halo artifacts. In addition, the proposed method has excellent restoration performances in very bright and dark regions such as the windows of Fig. 9 and the vases of Fig. 6.

C. QUANTITATIVE COMPARISON
We conducted quantitative evaluation using peak signal-tonoise ratio (PSNR), SSIM, and MEF-SSIM metrics, and the results are shown in Table 2. Red indicates the best performing MEF method, and blue represents the second one. We can see that the proposed method shows the highest performance on SSIM and PSNR scores. The MEF-SSIM score is slightly inferior to that of GGIF [20] and MEF-Net [24], but visual artifacts occur in both methods as described above. Through experimental results, it can be confirmed that the proposed method generally achieves a superior performance compared to the conventional methods.   Table 3 is the result of the running time comparison. We compared the running times of DeepFuse [23], MEF-Net [24], U2Fusion [25], and the proposed method. The experiments were conducted in an i7-8700 CPU and NVIDIA TITAN RTX pc environment. Running time was measured using 42 test images of various resolutions. The contents of Table 3 measure the average running time to process 1 million pixels. On average, the DeepFuse [23] method is the fastest. However, the DeepFuse [23] network structure is relatively simple compared to the other methods, and thus the fused image quality is deteriorated. The proposed method takes relatively less running time than the existing methods. Considering the fused image quality and running time, we believe that the proposed method surely has merits over the existing methods.

D. ABLATION STUDY
A number of ablation studies are conducted to verify the importance of each component in the proposed deep network. The fusion module of the proposed method is composed of GCFB and LDFB, and LDFB is a module to strengthen local details in the fused image. To verify the effect of LDFB, an experiment is performed without it. The results of this experiment are shown in Fig. 11 (a). Compared with the proposed method DGMEF-RNN in Fig. 11 (b), the fused image is blurred on the whole and the reconstruction performance is degraded in texture details. Through this experiment, it can be conformed that LDFB is effective in enhancing local details. The quantitative comparison is shown in Table 4. In addition, when the experimental results in Fig. 11 (a) and (b) are compared, local dark region artifact does not occur yet even if LDFB is excluded. That is, we can see that the aforementioned gradual fusion using GCFB and RNN-based proposed network architecture is effective for natural image fusion.
Next, some experiments were conducted by changing the number of source images and the order of the source images in the RNN network input (note that the input images are originally fed with the network in the order of increasing exposure level). Fig. 12 illustrates the results of the ablation study. The proposed method basically accepts four source images. We evaluate the performance of the proposed network when the number of source images is two and three. As the number of source images decreases, the number of fusion modules configured in RNN should be reduced accordingly. Multi-exposure images have different information depending on the exposure level. As shown in Fig. 12 (a), the source images I 0 and I 1 (which are excessive dark) have little information, so their contributions to the image fusion is marginal. Even though the source image information is insufficient, the entire structure of the fused image is correctly restored in the proposed method, but the color and detail restoration performances can be further improved. Fig. 12 (b) and (c) are the fusion result by adding the source image one by one. As confirmed in Fig. 12 (a), (b) and (c), the color and detail restoration performances are gradually improved as the source image is added. This can be confirmed that the proposed network progressively enhances the fusion according to the number of input images. Fig. 12 (d) and (e) are the experimental results by changing the order of the input source images. In (d), the inputs are changed in the reverse order from (4) to (1). In (e), the order is taken randomly (I 1 -I 3 -I 2 -I 0 ). From the results (c), (d) and (e), it can be seen that there is no significant difference in color and detail restoration performances of the fused image. It can be thought that the DEB, which is used for feature extraction in the proposed method, accurately extracts important information of each source image and transfers it to the deep level of the RNN-based network. Through such experiments, the RNN-based proposed network can be easily extended with inputs and the fusion can be improved accordingly.

V. CONCLUSION
We propose a novel multi-exposure image fusion method based on RNN so called DGMEF-RNN. The goal of DGMEF-RNN is to transfer source image information to the entire network to maintain long-range dependency, and to generate a fused image with artifact reduction. Moreover, we attempted to accomplish the good performance of detail restoration in the fused image. For this, we designed both GCFB and LDFB modules and demonstrated their superior performances through experimental results when compared with the conventional MEF methods. Our proposed network is trained with four multi-exposure images. Through comparisons with 7 state-of-the-art MEF methods, it was confirmed that the proposed method outperforms from both qualitative and quantitative perspectives.