A Novel Patch-Based Multi-Exposure Image Fusion Using Super-Pixel Segmentation

A novel multi-exposure image fusion method is proposed for solving the problems of color distortion and detail loss through adaptive image patch segmentation. First, we use the super-pixel segmentation approach to divide the input images into the non-overlapping image patches composed of pixels with similar visual properties. Then, the image patches are decomposed into three independent components: signal strength, image structure and intensity. The three components are fused separately based on characteristics of human vision system and exposure level of input image. While, guided filtering is used to remove the blocking artifacts caused by patch-wise processing. In contrast to the existing methods which use fixed-size patches, the proposed method avoids blocking effect and preserves the color attribute of the input images. The experimental results show that the proposed method has advantages both in subjective and objective evaluation over the state-of-the-art multi-exposure fusion methods.


I. INTRODUCTION
The dynamic range of the natural scene is much larger than that of images captured by ordinary consumptive cameras [1]. The difference between the two dynamic ranges makes it is difficult to retain all the content of natural scene in a single image. In both under-exposed and over-exposed regions of images, a lot of details are lost. There are two solutions to this problem: high dynamic range (HDR) imaging [2] and multi-exposure image fusion (MEF) [3]. HDR imaging usually consists of two main steps: HDR reconstruction and tone mapping [4]. Firstly, multiple low dynamic range (LDR) images with different exposure levels in the same scene are taken, and then the HDR image is reconstructed by inverting the camera response function (CRF). Finally, in order to display on ordinary equipment, HDR image must be converted to LRD image by tone mapping. HDR imaging technology can recover the whole dynamic range of the scene and make all the details visible in a signal image. However, the estimation of CRF itself is a difficult problem [5]. MEF provides a more efficient alternative which can directly generate high quality LDR images without intermediate HDR images [6]. MEF takes a sequence of images with different exposure levels as inputs and synthesizes the fused image that is more The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . informative and perceptually appealing than any of the input images [7].
Most of the existing MEF methods generate a fused image by weighted average of input images with different exposure levels, where one of the key issues is to design a proper weighting scheme. An intuitive idea is that the clearer pixels are assigned the greater weight. The different measures of pixel quality have been proposed based on different assumption to calculate appropriate weights. However, weights obtained based on a single pixel are susceptible to noise and easy to produce visual artifacts in the fused image. In recent years, patch-based MEF methods have attracted more attention [8]. These methods divide multi-exposure images into fixed-size rectangular patches, and perform image fusion process patch-wisely. In contrast to most pixel-wise MEF methods, the patch-based MEF methods can improve color fidelity of the fused image [9]. For the patch-based MEF methods, the division of image patches is an important issue. The MEF methods using fixed-size patches tend to cause the problems such as color distortion and detail loss in the fused image. However, few researchers studied the impact of patch division on the quality of the fused image.
To address this issue, we propose a novel patch-based multi-exposure image fusion method. Different from the existing methods which use fixed-size image patches, we utilize a super-pixel segmentation approach to divide the input images into non-overlapping image patches composed of pixels with similar visual properties. First, the best-exposed image is selected as the reference image from the input multi-exposure images, and a super-pixel segmentation approach [10], [11] is performed on the reference image to obtain the division of image patches. All of the input images use this division to obtain non-overlapping image patches. Then, the image patches are decomposed into three independent components: signal strength, image structure and intensity. The different fusion rules are designed according to the characteristics of these three components. In addition, in order to ensure the spatial consistency of the fused image, guided filtering [12] is used to refine the weight maps, signal strength and intensity component. Three merged components are constructed by using the refined weight maps. Finally the final fused image is reconstructed according to the merged component.
To the best of our knowledge, the paper by Li et al. [8] is the most similar work to our method. Their method selects the optimal patch size based on the texture entropy of the input images. However, the shape and size of the image patches are still fixed in a fusion process. In contrast, the patch division approach used in the proposed method can change the shape of the image patches to fit the image content. The proposed fusion method brings two main advantages: 1) The patch division takes into account the characteristics of the image content. As a result, the proposed method avoids the blocking artifacts which are often generated by the patch-based MEF methods. 2) The weight map is calculated patch-wisely based on characteristics of human vision system (HSV) and exposure level of the input images. Therefore, the color attribute of input images is preserved as much as possible. We evaluate the proposed method by comparing with 7 MEF methods on 24 sets of multi-exposure image sequences. The experimental results show that the proposed method produces better fused images.
The rest of the paper is organized as follows. In section 2, the existing MEF methods are briefly reviewed. Section 3 describes the proposed method in detail. The experimental results are discussed and analyzed in section 4. Finally, a conclusion is given in section 5.

II. RELATED WORK
The MEF methods can be classified in two categories: transform domain-and spatial domain-based fusion [8]. The main processes of fusion method based on transformation domain are as follows: First, input images are transformed into the transform domain. Then, the fused coefficients are obtained by applying the fusion rules on the coefficients of input images. Finally, the fused image is reconstructed by inverse transformation. The transformation methods that are often used in MEF include pyramid transform [13], wavelet transform [14], nonsubsampled contourlet transform [15]. However, this kind of methods may produce serious color distortion in the fused image [16]. The pixel-based methods evaluate the quality of each pixel in input images and incorporate the best quality pixels at the same location into the fused image. Song et al. [17] proposed an image fusion method by integrating locally adaptive scene detail capture. This method firstly calculates the brightness, contrast and gradient of input images, and then synthesizes the fused image by using a probabilistic model that suppresses reversals in the image luminance gradients. Gu et al. [18] proposed a gradient field based image fusion method. This method takes into account that the human visual system is sensitive to contrasts between pixel intensities, not the absolute values.
The gradient values of fused images are obtained by maximizing the structure tensor and the fused image is derived from the fused gradient field. Li and Kang [19] proposed a weighted sum based multi-exposure image fusion method. Firstly, this method constructs the weight map using three image features: contrast, brightness and color dissimilarity; then the weight maps were refined by recurrent filtering; finally, the fused image was obtained by weighted sum of the input images. Pixel-based image fusion method is simple and easy to implement. However, these methods only consider single pixel and ignore the relationship between adjacent pixels, which tend to produce the visual artifacts in the fused images [20]. To solve this problem, the patch-based methods design fusion rule by considering image patches instead of single pixel. Goshtasby [21] proposed a fusion method that produces a fused image with maximum information content. This method partitioned the input images into fixed-size image patches and used information entropy to evaluate the amount of information in image patches. The selected images are then blended based on weight map. In the fusion method proposed by Zhang et al. [9], a contrast criterion is introduced to measure the quality of exposure and generate weight maps. The fused image is obtained by merging the input images based on weighted average scheme. Ma et al. [22] proposed a structural patch decomposition (SPD) approach for MEF. This method decomposes an image patch into three conceptually independent components, and then fuse these three components separately. The fused image is reconstructed using the three fused component. Huang [23] improved the SPD method by designing new fusion scheme based on the quality of image patches. In the above patch-based fusion methods, the size or shape of image patches are fixed to facilitate subsequent processing. However, this results in image patches containing pixels with different color and brightness characteristics. If the same fusion schemes are used to fuse these pixels with different characteristics, the color or detail information of the fused image tend to be lost.

III. MULTI-EXPOSURE FUSION METHOD
In this section, the MEF method based on adaptive patch segmentation is presented in detail. Fig.1 shows the framework of the proposed method. Assume that all input images are already registered. Firstly, the selected reference image is partitioned into non-overlapping image patches by super-pixel segmentation, and the segmentation results are applied to all other input images. Then, each image patch is decom- posed into three independent components by structure patch decomposition: signal strength, image structure and intensity. According to the different characteristics of each component, we design the corresponding fusion rules and use these rules to obtain the fused components. Finally, the fused image is reconstructed from three fused components.

A. ADAPTIVE IMAGE PATCH SEGMENTATION
The super-pixel segmentation approaches can divide the image into irregular image patches composed of pixels with similar visual properties. To achieve the division of the input images, we first select a reference image from the input multi-exposure images. In this paper, the image with the least number of under/over-exposure pixels among all the input images is used as the reference image X ref . Then, the reference image X ref is divided into non-overlapping N image patches by using super-pixel segmentation approach SLIC (Simple Linear Iterative Clustering) [24], which is described as follows: where Slic (·) denotes the super-pixel segmentation operation. P n represents the spatial location of the n-th image patch.
In order to ensure that the segmentation of all input images is consistent, we directly apply the segmentation result of the reference image X ref to the other input images, as shown in Fig. 2.

B. STRUCTURAL PATCH DECOMPOSITION
For all the image patches ({X k (P n )} k=1,··· ,K ,n=1,··· ,N ) obtained by segmentation, we decompose them into three independent components: signal strength, image structure and intensity, where X k (P n ) denotes the n-th image patch of the k-th input image. For convenience, P n is omitted in subsequent parts of this paper because the image patches in different spatial location are treated independently. The image patches at the same position of the k-th input source image are directly represented by x k . Therefore, the patch decomposition can be defined as follows [22]: where, · denotes the l 2 norm of a vector, µ x k is the mean value of the image patch x k . c k , s k and l k represent the signal strength, image structure, and intensity component, respectively.

C. WEIGHT MAP CONSTRUCTION 1) SIGNAL STRENGTH
Signal strength is defined as c k =| |x k − µ x k ||, which represents the l 2 norm of the image patch x k . c k can be regarded as a measure of the amount of information which x k contains. The larger the c k , the more information the image patch x k contains. In order to preserve the information of the input images in the fused image as much as possible, the fusion rule of ''winner-take-all'' is adopted for the signal strength component. The weigh map of signal strength component is defined as follows:: where w c,k denotes the weight map that determines the contribution of the signal strength component of the k-th input image to that of the fused image patch.

2) IMAGE STRUCTURE
Image structure component contains details of texture and structure of image patches. In order to preserve the details of the input images in the fused image as much as possible, we need to evaluate the richness of the perceptible detail contained in the image patches of the different input images and assign a larger weight to the image patch containing more detail. JND refers to the minimum amount of change that can be perceived by HSV [25], [26]. We adopt a JND model to measure signal strength from the perspective of HSV. The perceptible strength of image details can be regarded as saliency weight J k (i, j) which is defined as [23], [27]: where J l k (i, j) and J t k (i, j) represent the luminance weight and texture weight at position (i, j),respectively. K l,t (i, j) describes the overlap effect of the weight with values ranging from 0 to 1. The relation between the luminance weight J l k (i, j) and background luminance is modeled with two parts, i.e., J l k (i, j) is modeled as root function of average background luminance for low background luminance and as linear function in other case, which is defined as [23], [28]: where B(m, n) is a low-pass filter, x k (i, j) denotes the background luminance. Texture weight J t k (i, j) is usually calculated by local spatial gradients. In this paper, we calculated the gradients in four directions and choose the strongest gradient as the texture weight J t k (i, j) which is written as [23], [28]: where ⊗ is a convolution operator. g h (i, j) is a high-pass filter in the h-th direction(h = 1, 2, 3, 4), which is defined as follow [23], [28]: On the other hand, pixels with well-exposure should be given larger weight because they contain more meaningful information for HVS than the over-exposure and under-exposure pixels. In this paper, we use a Gaussian function to construct the exposure weight, which is defined as follows: where x gray k (i, j) denotes the gray value of the pixel x k (i, j). σ is standard deviation, which describes the degree of spread of x gray k . σ is an empirical value which is set to 0.2 in all experiments.
In summary, weights of the structure components are constructed by saliency weight J k (i, j) and exposure weight E k (i, j)

3) INTENSITY
The weight of intensity component is calculated based on the global mean intensity µ k of the input image X k and the local mean intensity l k of image patch x k When the average intensity of an image patch is close to the middle value of intensity range, it is considered to have a good exposure and assigned a large weight. Otherwise, it is regarded as VOLUME 8, 2020 the under-exposure or over-exposure patch and given a small weight. The weight map for intensity component is defined as follow [6,8,22]: where σ g and σ l are standard deviation, which control the spreads along µ k and l k dimensions, respectively. This method assumes that all pixels in image have the same desired intensity, i.e. the middle value of intensity range 0.5. In fact, bright areas and dark areas have different intensity. Therefore, we adopt an adaptive expected intensity, i.e.
where p g and p l are the global expected intensity and the local expected intensity, respectively. X D and X B are the mean values of the darkest and brightest input images respectively.
x D and x B represent the mean values of the darkest and brightest image patches containing pixel (i, j) respectively. α and β controls the tradeoff between prior desired intensity and the average intensity from the input image. In this paper, α and β are set to 0.5. The modified weighting function is defined as follows:

D. REFINEMENT
The weight maps of the k-th input image is constructed by combining the weight maps of all image patches of the k-th input image, which is denoted as W c,k ,W s,k ,W l,k , respectively. Similarly, three component images of the input image are written asC k ,S k ,L k , respectively. In order to ensure spatial consistency, we refine the signal strength, intensity component and weight maps of input images with guided filter, respectively. It is worth to notice that the structural components are not filtered since the structural components contain a lot of detail of image patches and the filter operator may cause the loss of fine-detail. Guided filter is an edge-preserving filter which can smooth images without blurring edges. We use the grayscale version of the input images as the guide images. The refined weight maps can be written as: where GF (·, ·, ·, ·) represents the guided filter operation. Similarly, the refined signal strength C k and intensity components L k are obtained using guild filter, respectively:

E. FUSION
According to the refined weight maps, refined signal strength, refined intensity components and image structure of the input image, the signal strengthĈ, image structureŜ, and intensity componentsL of the fused image are calculated as follows, respectively.
The fused imageX is reconstructed from the three components by:X =Ĉ ·Ŝ +L (20)

IV. EXPERIMENTS
We select 24 sets of multi-exposure source images to verify the performance of the proposed method. The test sets include various scenes such as day and night, indoor and outdoor setting, as listed in Table 1. All experiments are implemented in Matlab2015a on a computer with Intel Core i5, 3GHz CPU, 4GB of RAM, and Microsoft Windows 7 operating system.

A. EXPERIMENTAL RESULTS AND ANALYSIS 1) SUBJECTIVE ANALYSIS
In this section,we compare proposed method with seven state-of-the-art MEF methods, including BLP [28], DSIFT [29], FMMR [19], GF [30], EPS [31], Mertens09 [32], SPD-MEF [22]. For intuitive analysis of the experimental results, we select four sets of fused images for demonstration among the 24 sets of fused images. The experimental results of different methods for ''House'' image sequence are shown in Fig. 3. As we can see, the fused image obtained by BLP method suffers from color distortions and sudden intensity changes. The fused image obtained by FMMR has the brightness inversion, i.e., the bookshelf is brighter than the region outside the window. DSIFT and GF methods produce obvious color distorted as shown in Fig.3(c) and (e). The two chairs of the same color in source images exhibit distinctly different colors in the fused images. The fused image by EPS method loses the color information and local details of the over-exposure area outside the window. In the fused images obtained by Mertens09 and SPD-MEF, the overall appearance is good, but the details of the scene outside the window are blurred. The fused image of the  proposed method appears better in detail preservation and brightness distribution than the other methods. Further, the color fidelity of the proposed method is also good. Fig.4 shows the fused images of ''Tower'' obtained by the eight different methods. As can be seen from Fig.4 (b), the BLP method produces massive artifacts, especially in the cloud region. In the fused image produced by DSIFT and GF methods, the right side of the tower is obviously brighter than the left, which is inconsistent with the input images. FMMR, EPS and Mertens09 methods not only fail to preserve good contrast in the sky, but also cause color distortion in the lawn. EPS and Mertens09 lose the local detail loss of the tower. The SPD-MEF method performs excellent in color. Compared to the other seven fused results, the proposed method increases the overall contrast while preserving texture details. Besides, the overall appearance of the fused image obtained by the proposed method is appealing.
All fused images of ''Chinese Garden'' obtained by eight different methods are illustrated in Fig.5. The BLP method produces the image with unnatural colors and artifacts. In the fused images obtained by FMMR, GF and Mertens09, the brightness of sky is dark. In the fused result of EPS, the color is pale. Although the SPD-MEF method performs well in the color saturation and the global contrast, the local detail of the fused image is blurred. The proposed method not only contains rich details, but also achieves excellent performance in global contrast and color saturation compared with the other MEF methods. In addition, the fused image resulted from the proposed method has more natural appearance with respect to the human visual system. Fig.6 demonstrates the performances of different methods on the ''Window'' image sequence. As shown in Fig.6 (b), there are obviously black shadows around the lamp. In the fused images obtained by DSIFT and GF methods, the color of the bed is distorted. In the fused images by GF and Mertens09 method, the brightness of wall is too bright. The local details of the magnified area in Fig.6 (f) are blurred. In the fused image obtained by SPD-MEF method, there are obvious black artifacts around the lamp. By contrast, the proposed method better preserves the details and color information, and achieves the best overall visual effect.

2) OBJECTIVE EVALUATION
In order to quantitatively evaluate the performance of the proposed methods, three objective criteria are used. The first criterion is mutual information (MI), defined as the sum of mutual information between each source image and the fused image [37], [38]. The second criterion is correlation coefficient (CC), which measures the degree of linear correlation of the fused image and source images [39]. The third criterion is standard deviation (SD) [39], which measures the contrast in the fused image. For all three criteria, the larger the value is, the better is the image quality.
The MI value reflects the total quantity of information in the fused image which is obtained from the input source images. The comparison results of 8 different MEF methods   Table 2, in which the largest MI value is shown in bold. It is clear that the proposed method performs better than the other methods in most scenes. In other word, the proposed method outperforms the other methods in preserving information from the source images. VOLUME 8, 2020   Table 3 lists the performance comparison of the proposed method with 7 other MEF methods using CC metric. The metric CC measures the similarity between the source image and the fused image, ranging from -1 to 1. The larger CC value indicates that the fused image better preserves the information in the source images. Table 3 shows that the proposed method has the best performance. The fused images obtained by BLP have poor performance since they have obvious artifact as shown in Fig. 3-6.
The comparison results of the metric SD are listed in Table 4. The proposed method shows the best performance in 13 sets of image sequences. For the rest of the image sequences, the proposed method ranks the second in most cases. Generally, the fused images obtained by the proposed method have better global contrast than that of the other methods.
In order to make it easy to compare the performance of the different MEF methods using the objective met-  rics, the line charts are given in Fig. 7. It is clearly shown in Fig. 7 that the proposed method achieves best performance in most case with respect to all three metrics.

B. IMPACT OF PATCH DIVISION
To illustrate the effect of adaptive patch division, we compare the fused results using different patch division schemes based on the proposed fusion rule. The patches of six different VOLUME 8, 2020  sizes are used, where the fixed-sized patches are rectangle of 29 × 29, 21 × 21 and 15 × 15, and adaptive patches are obtained by SLIC algorithm which specify the number of super-pixels as 200, 400, and 800, respectively. The fused images obtained using fixed-sized patches are denoted as Ours-29, Ours-21 and Ours-15, respectively, and the fused images obtained using adaptive patches are Ours-200, Ours-400 and Ours-800. For convenience of comparison, the size of input images is scaled to 512 × 341 or 314 × 512. When an image is divided into 200 patches, the average size of patches is approximately equal 29 × 29, and so on. Fig.8 shows the fused images of the ''Candle''. The results in Fig.8(b)-(d) are obtained using adaptive patches. The Fig.8(e)-(g) show the fused image produced using fixed-size patches. When using fixed-size patch, the obvious blocking artifacts appears in the fused image as the patch size increase, while the fused images obtained using adaptive patches have less difference. Comparing the results obtained by adaptive patch-based and fixed-size patch-based methods with similar patch size, we can see that adaptive patch-based method have the better performance than fixed-size patch-based method. Table 5 shows the average metrics of 6 fusion methods on 24 sets of fused images. It can be seen that the smaller the image patch, the larger the MI and CC values, the smaller the SD values. When the patch size is approximately equal, all of the three metrics indicates that adaptive patch-based method outperforms fixed-size patch-based method.

V. CONCLUSION AND FUTURE WORK
In this paper, a multi-exposure image fusion method based on adaptive patch has been proposed. The proposed method uses a super-pixel segmentation approach to divide the input images into the image patches composed of pixels with similar visual properties. Then, the image patches are decomposed into three independent components: signal strength, image structure and intensity. The three components are fused using different fusion rules which are designed based on characteristics of HVS and exposure level of input images. To remove the blocking artifacts caused by patch-wise process, guided filter is performed on signal strength component, intensity component and weight maps. As a result, the proposed method generates little blocking artifacts and preserves well the color attribute of input images. The comparative experiments show that the proposed method outperforms the state-of-the-art multi-exposure fusion methods both in subjective and objective evaluation.
Although the proposed method can produce high-quality fused images, it is not suitable for real-time application. In the future, both algorithm and implementation will be optimized to improve the efficiency of the fusion method.
SHUPENG WANG received the B.S. and M.S. degrees in communication and information engineering from the Xi'an University of Science and Technology, in 2000 and 2003, respectively, and the Ph.D. degree in pattern recognition and intelligent system from Xidian University, in 2009. He is currently an Associate Professor with the Xi'an University of Science and Technology. His current research interests include image processing and pattern recognition.
YAO ZHAO is currently pursuing the master's degree with the Xi'an University of Science and Technology, China. Her current research interests include multiexposure image fusion and pattern recognition. VOLUME 8, 2020