Multi-Focus Image Fusion With Point Detection Filter and Superpixel-Based Consistency Verification

An accurate and efficient measurement of pixel’s sharpness is a critical factor in most multi-focus image fusion methods. In our practice, we found that the focused regions get more blurred than the defocus regions when the multi-focus images are blurred digitally. Based on this observation, a novel multi-focus image fusion method is presented in this paper. In the given fusion scheme, focused regions detection is achieved by point detection filter and Gaussian filter, which has been certified more effective than other frequently used image clarity measures. Moreover, unlike the other commonly used consistency verification, we propose a superpixel-based consistency verification (SCV) method by integrating the image superpixels to improve the fusion performance. Image superpixels can perceptually represent meaningful image local features. Two datasets of multi-focus images are used to conduct experiments. Experimental results demonstrate that the proposed method can be competitive with or even outperform the state-of-the-art fusion methods in terms of both subjective visual perception and objective evaluation metrics.


I. INTRODUCTION
With the development of electronics and computer technology, multiple imaging sensors have been synergistically used in various image processing systems. Although there are many kinds of sensors, most optical cameras cannot capture a single sharper image with all related objects in focus, due to optical lens' limitation of depth-of-focus [1]. Multi-focus image fusion is an effective technology to solve this problem. In this technology, we firstly shoot multiple images of the same scene with different focus settings. Then the source images are merged to get a composite image in which all objects in the scene are in focus. The composite image also known as the fused image which is more suitable for machine or human perception [2]. These years, multi-focus image fusion technique has been successfully applied in many The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li. image processing fields such as digital imaging, microscopic imaging, machine vision and so on [3]- [6].
During the past two decades, a great variety of multi-focus image fusion algorithms have been proposed and these algorithms can be roughly divided into two categories: transform domain algorithms and spatial domain algorithms [7]. Transform domain algorithms share a 'decomposition-fusionreconstruction' framework. It decomposes the input images into their multi-scale domain firstly. Then the transformed coefficients are fused by certain fusion rules. Finally, the fused image is reconstructed from the integrated coefficients. Pyramid transform [8], wavelet transform [9], contourlet transform [10], curvelet transform [11], shearlet transform [12] are commonly selected as the decomposition method. Transform domain algorithms are usually treated as general image fusion methods which can be used to achieve large range of fusion tasks such as multi-focus, visual and infrared image, remote sensing and so on. More recently, sparse representation (SR) also has recently attracted significant attention in image fusion [13]- [15]. In these methods, the source image is represented with sparse coefficients using an over-complete dictionary. Then, the coefficients are combined with the fusion rules. Finally, the fused image is reconstructed from the combined sparse coefficients and the dictionary. Apart from the selection of transform methods, the fusion rules designed for merging transformed coefficients in high-or low-frequency domain also play a critical role in these methods, and many researches have also been taken in this direction [16], [17].
Spatial domain-based fusion methods also attract numerous researches since human eyes perceive images in spatial domain. Averaging the pixel values of all source images directly is the simplest algorithm in pixel-based image fusion methods, but it leads to blurring effect, contrast reducing and original image information loss. Aiming to make full use of the spatial context information, Li et al. first divide the source images into uniform blocks and maximize the spatial frequency (SF) [18] in each block [19]. However, the fusion result may produce undesirable block artifacts on object boundaries. To mitigate these artifacts, many region based image fusion methods have been presented. The region based approaches first segment the source images into regions rather than blocks by using image segmentation techniques such as normalized cut [20], watershed transform [21], and enhance linear spectral clustering [7]. Then the divided regions with different source images are fused according to their clarity, such as SF, energy of Laplacian of the image (EOL) [22], sum-modified-Laplacian (SML) [23] and FSWM [24]. Obviously, image segmentation techniques as well as clarity measures have great influence on the fusion results.
Recently, several state-of-the-art spatial domain fusion methods have been proposed, including multi-scale weighted gradient [25], guided filtering [26], dense SIFT [2]. The Dense SIFT methods employed the local feature descriptors dense SIFT as the activity level measurement and to match the mis-registered pixels between multiple source images to improve the fusion quality in object boundaries [2]. The multi-scale weighted gradient-based image fusion method reconstructed the fused image by making its gradient as close as possible to the magnitude of the merged gradient rather than employing a decision map [25]. Li et al. first employed guided filter to optimize the weighted coefficients of base layers and detail layers, and obtained satisfactory fusion results [26]. The utilization of guided filter alleviates the misalignment of decision map with object boundaries. Similarly, Qiu et al. proposed a novel focus region detection method based on guided filter and mean filter, guided filter is not only used to solve the problem of initial decision map misalignment with object boundaries, but also applied to refine rough focus maps in focus region detection [27]. Although these newly presented algorithms could improve the visual qualities of the fusion images, most of them involve relatively complex procedures and might produce artifacts around the object boundaries due to the inaccurate fusion decision maps. In addition, to overcome issues in handcraft methods, recently CNN-based image fusion approaches have been applied to image fusion [28], [29]. Liu et al. first introduced CNN to fuse multi-focus image. They formulated multi-focus image as a classification task and used their model to classify image patches as focused or defocused to obtain the focus map [28]. However, the use of empirical risk minimization and max pooling limit their fusion performance. Thus, Du et al. [29] presented a deep support value convolutional neural network (DSVCNN) by replacing the empirical risk minimization and max pooling layer by structural risk minimization and standard convolutional layers. Due to the strong capability in feature extraction as well as in data representation, the CNN based methods also achieve state-of-the-art performances. However, these CNN-based methods are time-consuming and require a lot of datasets.
In most multifocus image fusion methods, the consistency verification is usually used as a post-processing technique to remove the misjudgment in the final fused results. Majority filter is the earliest consistency verification based on the idea that neighboring coefficients should maintain a very consistent focus judgment [30]. For mathematical morphology-based fusion methods, the focus misjudgment in initial focus segmentation map is seen as noise or undesirable artifacts which can be easily removed by morphology operators [31]. However, the nonlinear mathematical morphology has the risk of removing the details in the fused images if the morphology operator is designed irrationally. In the guided filtering-based image fusion method [26], although the simple Laplacian coefficients are used as saliency measure, the quality of fused result has a great improved due to the guided filtering-based consistency verification. A reasonable consistency verification strategy can evidently improve the performance of most fusion scheme.
In this paper, a novel multi-focus image fusion method with point detection filter-based focus measure and superpixelbased consistency verification is proposed. In experimental practice, we find that if an image is digitally blurred with a blur kernel, the local quality of the focus areas will degrade more evidently than that of defocus areas. Based on this observation, we can separate the focused areas of different image by comparing their image quality degradation between the raw image and the corresponding blurred image. We presented a new approach to evaluate the pixel's sharpness by convoluting the input image with a point detection filter. This approach is effective and easy to implement. The performance evaluation of the proposed clarity measure is demonstrated in Section II-A. Comparing the quality changes of the source images, the focused and defocused areas are separated using the focus map which is further optimized by SCV. Different from the region based image VOLUME 8, 2020 fusion methods, the superpixel algorithm simple linear iterative clustering (SLIC) [32] is used for consistency verification. Finally, the fused image is constructed by integrating the focus regions in each source image according to the refined focus map. The effectiveness of our method is validated on two datasets under three objective quality metrics. The results are compared with five state-of-the-art multifocus image fusion methods. Experimental results show that the proposed algorithm is competitive with these methods, in both terms of subjective visual perception and quantitative evaluations.
To sum up, this paper has following contributions: (1) This paper presents a novel measure on image pixel's sharpness. The strategy is inspired by the found that image quality of focus areas is degraded significantly more than that of blurred areas when an image is digitally blurred. Furthermore, the present sharpness evaluate strategy can be easily implemented using Gaussian filtering and point detection filter.
(2) Inspired by the success of consistency verification in multi-focus image fusion, this paper proposed an extension of guide filter based consistency verification to integrate the image superpixels, which demonstrates better performance in visual perception.
(3) To demonstrate the efficacy of fusion strategy and show its superiority over other state-of-the-art alternatives, experiments on two datasets of multi-focus images are conducted. The results demonstrate the superiority of the proposed method in both subjective visual perception and objective evaluation metrics.
All frequently used symbols in this paper are listed in Table 1.

II. MULTI-FOCUS IMAGE FUSION FRAMEWORK
The schematic diagram of the proposed fusion algorithm is illustrated in Fig. 1, in which we can see that there are three main steps in the proposed fusion method. In the first step, the pixel's sharpness of the input images is calculated by the proposed Gaussian filtering and point detection filter-based clarity measures. In the second step, initial focus segmentation map S is constructed by comparing the sharpness of the input images. The SCV is used to refine the initial focus segmentation map. Finally, the fused image I F is constructed by combining the source images I A and I B according to the final decision map D.

A. THE CLARITY MEASUREMENT WITH POINT DETECTION FILTER
In experimental practice, we found that if an image is digitally blurred with a blur kernel, the local quality of the focused areas will degrade more evidently than that of defocused areas.  The defocus image can be modeled as the convolution of a sharp image with the point spread function (PSF) [33]. The PSF can be approximated by a Gaussian function g (x, σ ) , where the standard deviation σ = kc measures the defocus blur amount, the smaller the standard deviation, the better the image quality. As the amount of defocus blur is estimated at edge locations, we model two blurred edge with Gaussian function g (x, σ 1 ) and g (x, σ 2 ) as follows where s (x) is the step function. Note that the edge is located at x = 0. f 1 (x) and f 2 (x) are focus and defocus edges since σ 1 < σ 2 . Fig 3 shows the overview of our blur estimation method. f 1 (x) and f 2 (x) are re-blurred using a Gaussian function g (x, σ 3 ).
It can be clearly noted that the differences between f 1 (x) and f 3 (x) are larger than that between f 2 (x) and f 4 (x).
In order to verify the observation for 2D image, we selected a focused region Fig. 4(b) and a defocused region Fig. 4(c) from Fig. 4(a). Then both are blurred by Gaussian blur with standard deviation set from 0.5 to 4. The SF is calculated from each blurred result and illustrated in Fig. 5. The blue plot shows the SF in defocused region while the red plot shows the SF in defocused region. It is evident that the SF in focused region degrade more sharply than that in defocused region. This situation is the same as the 1D edge model.
Inspired by this observation, we proposed a new method to evaluate the sharpness of a pixel as following.
Step 1: Blurring the input source images by an average filter. For both source images I A and I B , their blurred versions M q are obtained by Gaussian kernel as (5).
where I q denotes the source image and H denotes the 5 × 5 Gaussian kernel with standard deviation σ .
Step 2: In multi-focus images, the texture of the focus region is rich relative to the defocus region, which means to a large extent that the difference between the pixels in the focus area is larger than that in the defocus area. Point detection filter can calculate the sum of the difference between the center pixels and their surrounding pixel. Thus when we convolute an image with a point detection filter, the larger the result, the sharper the pixel. Absolute difference of the original image and the blurred image yields the activity level map as (6).
where I and M are the original image and blurred image respectively. W is a point detection filter defined as (7).   The evaluate strategy is based on the idea that image quality of focus areas is degraded significantly more than that of blurred areas when an image is digitally blurred.
We used the strategy in [22] to evaluate the validity of the novel sharpness measuring. Firstly, two multi-focus source images are decomposed into small blocks with equal size. Then, we use the proposed method as well as other widely used focus measuring methods to calculate the sharpness of each image block. Finally, the blocks with greater sharpness measures from each block pair are chose to construct the composite image. We use root mean square error (RMSE) between the composite image and ground truth reference image to evaluate the performance of the focus measures. Note that no consistency verification was used in this process, as it will disturb the results obtained by focus measures directly.
RMSE is defined as where R and F denote ground-truth reference image and the composite image respectively. F (m, n) is the pixel's intensity at position (m, n). M and N denote the rows and columns of the image. Fig. 6 shows two test images 'clock' and 'disk' with sizes 512 × 512 and 640 ×480, respectively. In experiments, the block size is set to be 32 × 32 pixels and the standard deviation of Gaussian filter in (1) is set to 2. The main image clarity measures employed in spatial domain include SF, EOL, SML, and FSWM. Since Huang et al in [22] has proven that SML and EOL perform better than other three common focus measures, we only compared with SML, EOL and neighbor distance filter (ND) proposed in [34]. Fig. 7 shows the selection maps and fused images obtained by ND, EOL, SML, and our method, respectively. Apparently, the performance of SML and our method in clarity measure are better than those of ND and EOL. Fig. 8 shows the quantitative comparisons of different focus measures in term of RMSE. The proposed focus measure achieves the lowest RMSE, which implies the composite image obtained by our method is certainly closer to the reference image.
In this paper, we use the absolute value of sharpness measure in (6) to yield an initial focus segmentation map as.
where S (m, n) = 1 denotes that pixel in I A is focused, elsewise the pixel in I B is focused. Fig. 9(a) and (b) are two gray source images of the same scene. Fig. 9(a) is focused on the statue while Fig. 9(b) is focused on the background building. Fig. 9(c) and Fig. 9(d)  are their Gaussian filtered versions. Fig. 9(e)-(h) are the processing results of Fig. 9(a)-(b) by point detection filter. As shown in Fig. 9(e) and (g), we can see the background of the two images has changed little, but the foreground has obvious changes. The same situation is illustrated in Fig. 9(f) and (h). Based on this observation, an initial focus segmentation map obtained by (9) illustrated in Fig 9(i), where the focus area and the non-focus area are roughly separated.

B. SUPERPIXEL BASED FUSION MAP CONSISTENCY VERIFICATION
The initial focus map is always not a perfect segmentation as shown in Fig. 9(i). The essential cause of situation is that the focus segmentation is usually a high-level vision task based on the understanding the scene globally. However, most focus measures are based on local pixel values statistics features. Therefore, the initial focus map is usually post-processed with consistency verification to remove the misjudgment in the final fused results.  In this paper, we comprehensively use superpixel, guided filters, and morphology operators to realize the robust consistency verification. More especially, the superpixel segmentation is used to cluster a set of image pixels that share similar visual characteristics. The superpixels represent perceptually meaningful image local features. The processing flow of our consistency verification scheme can be seen in Fig. 1. Firstly, initial focus segmentation map S is preprocessed with a guided image filtering and the source image is performed as guided images. This operation makes the focus map consistent with the source images content. Then, the filtered map is further segmented into small visual regions using SLIC which is an adaptation of k-means for superpixel generation. It operates faster and more memory efficient than existing methods. The parameter k adjusts desired number of approximately equally sized superpixels. The statistics number of '0' and '1' in ith superpixel region is denoted by NR i . Finally, the final decision map D is constructed as where R i denotes the ith superpixel region. NR 1 i and NR 0 i is the number of 1 and of 0 in R i of the initial focus segmentation map S, respectively. Furthermore, we use morphological operators to refine the decision map D. An example of decision map acquisition is demonstrated in Fig. 9. The initial focus segmentation map S is shown in Fig. 9(i) in which the misjudgments illustrated as noisy or cracks. Fig. 9(j) is the focus map processed by guided filtering. Fig. 9(k) is the final decision map obtained by superpixel-based consistency verification which yields a smooth and edge-aligned labeling.

C. FUSED RESULT
With the final decision map D, the fused image is obtained by (11) where× denotes the point multiple operator.
Noticed that the proposed method can easily extend to achiever color image fusion. The input color images are translated into YUV color space. The intensity component Y is used as input gray images to get the clarity segmentation map. Then, we adopt the intuitive scheme that fuses three original RGB color channels separately.

III. EXPRIMENTS
In this section, we conduct numerical experiments to verify our main results. We first introduce the datasets used to test the performance evaluation of different methods. Then the influences of three free parameters on fusion performance are analyzed. Finally, the performance of the proposed fusion method is compared with other state-of-the-art alternatives. We also provide the experimental results discussion in details. The experiments are performed on an Intel(R) Core TM i5-8250U @1.60GHz PC with the MATLAB R2016a simulation software.

A. IMAGE DATASETS
In experiments, we evaluate the proposed fusion algorithm on two image datasets. The first one is the grayscale multi-focus dataset which contains 6 pairs of commonly used grayscale images in many related papers. The second one is color image dataset composed of 6 pairs of color multi-focus images of size 520 × 520 which are selected from a multi-focus image dataset ''Lytro'' and is publicly available online [35]. Fig. 10 shows the grayscale multi-focus image dataset and Fig. 11 shows the color multi-focus image dataset.

B. OBJECTIVE EVALUATION METRICS
In our experiments, we exploit seven evaluation metrics to objectively evaluate the performance of fused images obtained by different algorithms, which are information theory based metric Q MI [36], Q FMI [37], image feature based metric Q AB/F [38], Q E [39], structural similarity based metric Q Y [40], Q W [39], and human perception based metric Q CB [41]. For all the seven evaluation metrics, a larger value indicates a better fused result.

C. ANALYSIS OF THE FREE PARAMETERS
In this subsection, the influences of different parameters to objective fusion performance are analyzed with an image dataset shown in Fig. 8 and Fig. 9. The fusion performances are evaluated by the average values of Q MI , Q FMI , Q AB/F , Q E , Q Y , Q W , Q CB , and Q CB . Obviously, the standard deviation σ in (1), the block size r in guided filtering and the number of superpixels k in SLIC are three free parameters in the proposed method. When analyzing the influence of σ , block size r is preset as 10 and the number of superpixels k is set as 100. In like manner, when we analyze the influence of block size r in guided filtering, we set the standard deviation σ to 2 and set the number of superpixels k to 100. Similarly, when analyzing the influence of the parameter k in SLIC, standard deviation σ is set to 2 and block size r is set to10. As shown in Fig. 12, the fusion performance show sign of levelling off with large rang. The evaluation metrics in Fig. 12(a)-(c) illustrated that our method does not rely heavily on the exact parameter choice. In this paper, the default parameters are set as σ = 2, r = 10, and k = 100.

D. EXPRIMENTAL RESULTS AND DISCUSSION
To confirm the effectiveness of the proposed method, we compare our algorithm with five state-of-the-art multifocus image fusion approaches, including convolutional neural network based image fusion algorithm [28], multiscale focus measures and generalized random walk based image fusion algorithm [41], guided filter and mean filter based image fusion algorithm [27], guided filtering based image fusion algorithm [26], image matting based image fusion algorithm [43], and boundary finding based algorithm [44]. For convenience, convolutional neural network based image fusion algorithm is abbreviated as CNN; multi-scale focus measures and generalized random walk based image fusion algorithm is abbreviated as MFGR; guided filter and mean filter based image fusion algorithm is abbreviated as GFMF; guided filtering based image fusion algorithm is abbreviated as GF; image matting based image fusion algorithm is abbreviated as IM; boundary finding based algorithm abbreviated as BF. The source codes of CNN can be downloaded from website [45]. The source codes of MFGR can be downloaded from website [46]. The source code of GFMF can be downloaded from website [47]. The source codes of BF can be downloaded from website [48], and the source codes of IM and GF can be downloaded from website [49]. The optimal parameters reported in the related publications are used in our experiments.
The qualitative evaluation of multi-focus image fusion algorithm is completed by comparing the visual quality of fusion image. For Grayscale datasets, Flower and Temple VOLUME 8, 2020   Fig. 13 and the source image in Fig.13 (b).  Fig. 13 and Fig.15, respectively. The fusion results in both Fig. 13 and Fig.15 show that all the algorithms could achieve the purpose of multi-focus image fusion. In order to show the fusion results more clearly, Fig. 14 and Fig. 16 show the difference image obtained by subtracting the second source image shown in Fig. 13(b) and Fig.15(b) from each fused image, and the values of each difference image is normalized to the range of 0 to 1. From Fig. 14(a)-(f), we can observe that the fused images of these compared methods produce significant undesirable block artifacts on the boundary between the flower and the background wall, and the fused image of BF loses some original image information in the lower part of the flower. As shown in Fig. 14(g), our method owns highest visual quality among other six methods. From Fig 16(a)-(g), it also can be seen that compared to other approaches, our algorithm can locate the boundary between the stone lion and the temple more accurate. Because of the lack of space, only the fused results of other gray images by our proposed method are presented in Fig. 17, which also provides high quality visual results. The object features from both source images are preserved clearly in the fused image and no reconstruction artifacts are produced.
We also conduct the proposed method on color image datasets shown in Fig. 11. For each pair images in Fig. 11, the left image near focused while the right image is far focus. Taking Fig. 11(a) as an example, the left image is near focused, where the statue is in focus and clear in vision, whereas the background wall is out of focus and blurred. The right image is far focused, and the situations for the statue and the background wall are contrary. Fig. 11(a) and Fig. 11(b) are also selected for visual comparison in detail. Fig. 18 and Fig. 20 show the corresponding fusion results of proposed algorithm and other state-of-the-art approaches. As shown in Fig. 18 and Fig. 20, all fusion results of our method and other state-of-the-art approaches demonstrate high visual quality. To make better comparison, the normalized difference images obtained by subtracting the second source image shown in Fig. 18 (b) from each fused image is provided in Fig. 19. From Fig. 19(a)-(f), we can observe that the fused image of CNN, MFGR, GFMF, GF, and BF are overall high quality except that the regions around the statue are slightly blurred, but the fusion result of IM produce undesirable artifacts around the statue. The normalized difference image shown in Fig. 19(g) shows that our method owns highest visual quality in the boundary regions between the statue and the background wall among other six compared methods. Fig. 21 shows the normalized difference images obtained by subtracting the first source image shown in Fig. 20(a) from each fused image. From the area enclosed by a green rectangle in Fig. 21, we can note that a small focused region in Fig. 20(a) preserved by GF, IM, and MFGR, but their fusion quality around the boundary regions labeled by blue rectangle is poor, especially for IM. As shown in Fig. 21(a), the fusion results of BF misclassified on Koala, the CNN VOLUME 8, 2020 FIGURE 16. The normalized difference image between each fused image in Fig. 15 and the source image in Fig.15 (b). method also suffer from some artifacts in the blue region. The normalized difference image shown in Fig. 21(g) demonstrates that the fused image of our method owns highest visual quality among all these six methods. The fusion results of other test color images in Fig. 11 are shown in Fig. 22. All objects in scene are reserved clearly in the fused images.
In addition to the visual comparison, seven evaluation metrics, Q MI , Q FMI , Q AB/F , Q E , Q Y , Q W , and Q CB are used to evaluate the fusion performance objectively.    Fig. 18 and the source image in Fig.18    also gives a largest value in most of the examples. It should be noticed that the proposed method, GFMF, and GF give a stable and better performance of Q E , this is because the guided filtering based methods call inherit global salient edges effectively. As shown in TABLE 6 and TABLE 7, the proposed method outperform BF, CNN, IM and MFGR obviously,  and competitive with the GF and GFMF of the structural similarity based metrics Q Y and Q W , which means the fused image of our method, GF and GFMF can well preserve the complementary information of different source images. As shown in TABLE 8, our method also gives a stable and higher values for human perception-based metrics Q CB in most cases, which indicates that the fused images of proposed method have better effects in visual perception without introducing distortions. Although the proposed algorithm is not the most efficient multi-focus image fusion approach, the experiment results show that our method also is competitive with the state-of-the-art algorithms in terms of both subjective visual perception and objective evaluation metrics.   To evaluate the computational efficiency, the running time for processing 12 sources image sets by each algorithm is counted and listed in TABLE 9. In terms of the average time, it can be noticed that the CNN method suffered 99970 VOLUME 8, 2020   from the highest computation cost with 183.50 seconds. The GF method provides higher efficient than other methods with 0.40 seconds. The average time of our method is 3.02 seconds, it is also acceptable.

IV. CONCLUSION
This paper proposes a new multi-focus image fusion method to detect the focus regions in multiple source images using point filter and Gaussian filter, and presents a superpixelbased consistency by introducing the SLIC algorithm into consistency verification to improve the fusion performance. Compared with several traditional multi-focus fusion methods, this proposed method achieves good results both subjectively and objectively.
YUEHUA LI received the B.S. degree in electronic engineering from the East China University of Technology, in 1997, and the master's degree in signal and information processing from Nanchang University, in 2007. She is currently a University Lecturer with the Department of Electrical Engineering, University of South China. Her current research interests include signal and information processing, image processing, and artificial intelligent control systems.
LIHUI PANG received the B.S. degree from the Henan University of Science and Technology, in 2008, and the Ph.D. degree from the University of Electronic Science and Technology of China, in 2015. She is currently a Lecturer with the University of South China. Her current research interests include blind signal processing, pattern recognition, and deep learning. VOLUME 8, 2020