Fast Multi-Focus Image Fusion Using Inner Channel Prior

In this paper, we propose a simple but effective sharpness prior—inner channel prior—to detect the focused area of multi-focus images. The inner channel prior is a kind of location feature of natural image sharpness. It is based on a key observation—the high-frequency information of the saturation channel is located inside the object. The inner channel prior indicates the focus degree inside objects, which can obtain better color fidelity and sharper object. Using the proposed prior combined with the multi-scale image fusion, we can directly obtain a sharp image with an extended depth of field focusing on all the objects. Results on a variety of defocused image sequences demonstrate the availability of the proposed prior. In addition, the proposed prior is insensitive to overexposure and can better maintain the color of multi-focus images.


I. INTRODUCTION
It is challenging to obtain an image in which all the captured objects are focused, for the finite depth of field (DOF) of photographic lens. Only at a particular distance from the camera can the object be focused and have acceptable sharpness, whereas the sharpness decreases gradually as the object moves away from the sharp focus plane. To obtain an all-infocus image, we need to fuse the images captured from the same scene at separate object distances.
Image fusion is highly desired in computational photography and computer vision applications, such as remote sensing [1], medical diagnosis [2], [3], micro-image fusion [4], biochemical analysis [5], image dehazing [6], [7] and microscopy [8]. The fusion of multi-focus images is a major application of image fusion. The main idea of multi-focus image fusion is to extract the important and essential information from the input images, then fusing these information into one image which theoretically consists of all the information of input images [9], as shown in Fig. 1.
However, gathering focused objects is a challenging problem, since there is no golden criteria to measure the sharpness of an object. Therefore, many focus measurements [10] have been proposed to distinguish sharp image blocks from defocused image blocks, including variance, energy of image gradient (EOG), Tenenbaum's algorithm (Tenengrad) and The associate editor coordinating the review of this manuscript and approving it for publication was Gangyi Jiang. sum-modified-Laplacian (SML) [11]. Experiments [10] show that SML can provide a better fusion performance than other focus measures. Meanwhile, how to fuse a sharp image as many details as possible is also a tough problem. Image fusion has made significant progress [12] in recent years. Sun et al. [13] utilize the region-level method, Laplacian pyramids, to fuse multi-focus images, obtaining a sharp image with more original information than the method at pixel level. However, this method cannot obtain a clear image in the case of fusing images with fewer details. Chen et al. [14] utilizes the image matting algorithm, making full use of the spatial information to refine the weight maps with state-of-the-art performance, even in some cases of obvious misregistration. However, this method could have some wrong judgements inside objects by using edge information based image matting algorithm. In this paper, we propose a novel prior-inner channel prior-for focus detection. The inner channel prior is based on the statistics of sharp images. We find, the color saturation of object varies inside the object, thus the high spatial frequency information of the saturation channel mainly locates inside the object. Generally, the high spatial frequency information of the image primarily appears at the edge of the object [15], which is called the edge channel, as shown in Fig.2. Therefore, the high frequency information of saturation can directly produce an sparse focus map of the sharp object. Then the guided filter [16] is applied to propagate the sparse map to the entire image. In addition, the high spatial frequency information of value channel, which mainly appears at the edge of the object, is used to deal with the blurry edge of multi-focus images.
The main contributions of this paper are described as follows: 1) We propose the inner channel prior to measure the focus degree inside objects, avoiding the misjudgement caused by the edge information based measurement. 2) Inspired by the sum-modified-Laplacian (SML), we select the guided filter instead of sum operator to propagate the sparse sharpness map to the full image. Guided filter-modified-Laplacian operator would increase the focus detection accuracy. 3) We utilize the edge channel to refine the focus detection at the edge of objects and some tiny objects containing a small number of pixels. 4) Combining the image pyramid fusion method, we can generate a high-quality fused image. Experiments demonstrate that our approach is physically valid and efficient. 5) Using the thread programming, we can simultaneously process the R,G,B channel, which would reduce the time cost of the algorithm.

II. RELATED WORK
Multi-focus image fusion aims to acquire an all-in-focus image, which can be used in diverse tasks of image processing [17]. From the literature, multi-focus image fusion (MFIF) methods can be divided into 3 categories: spatial domain, transform domain and neural network [18]. The spatial domain algorithms directly cope with pixels and pixel-intensity based operations [19], which can be divided into pixel-based and region-based [20]. The dense SIFT method [21] is a typical method of spatial domain method. Spatial domain algorithm can be time-efficient but sensitive to noise. Besides, it tends to have fixed limitations, such as image blurring and spatial distortion [22], which would lead to the artifacts in the fused image. However, these problems can be tackled in transform domain.
The transform domain algorithms convert the original images into the feature domain, in which these images can be fused with higher quality [23]. The representative transform domain algorithms consist of the nonsubsampled contourlet transform (NSCT) technique [24], the sparse representation technique [25], [26], wavelet transform [27] and gradient domain technique [28]. The gradient domain method transforms the source images into the gradient domain, then the image fusion is implemented by gradient process. After processing the gradients, image reconstruction is executed to obtain the sharp image. However, transform domain algorithms are time-consuming and computational expensive.
The deep learning algorithms have become increasingly popular techniques for MFIF in the past few years. Neural network is a feature model of deep learning which can learn to extract the characteristic descriptors for an image in separate levels of abstraction [29]. The deep learning methods can obtain an all-round good performance regarding all the fusion metrics but cannot present the obvious strength in every single metrics compared with the traditional methods [20]. Despite the great advancement achieved in recent years, there are still several major issues in the area of multi-focus image fusion [12], [20], such as defocus effects, color distortion and process efficiency. In this paper, we plan to cope with these challenging problems.

III. INNER CHANNEL PRIOR
The inner channel prior is based on the following observation on sharp images: An all-in-focus natural image has a vivid color with large saturation range, while a defocused image appears more muted and grey with smaller saturation range. Meanwhile, the saturation of object color varies inside the object, thus the high spatial frequency information of the saturation channel mainly locates inside the object ( Fig. 3(b)), whereas the high spatial frequency information of the value channel primarily appears at the edge of the object (Fig. 3(c)).

A. DEFINITION
To formally describe this observation, we first introduce the HSV color model. Compared with RGB using primary color, HSV is closer to the way human perceive color. HSV model has three ingredients: hue, saturation and value. This color space describes colors in terms of their saturation and their brightness value. According to [30], the ingredient value is the prime part that is used really in image processing. However, we mainly show the novel characteristic of the saturation component in this paper.
HSV model can be derived from RGB model by the following mathematical relationship: where R, G, B are the value of RGB channel, respectively. max and min are the mathematical operators, return the maximum and minimum of the variables, respectively. max and min are the maximum and minimum of R, G, B, respectively. s and v are the saturation component and value component of the image, respectively. Since this prior is used for focus detection, we choose the focus evaluation to acquire the high frequency information of the saturation. According to [10], we choose the ML (Modified Laplacian) operator to perform the computation. The sparse focus map ML in is given by, where I sat (x, y) is value of the saturation component of the input image I at (x, y). (x, y) is the pixel index. ML in (x, y) is the sparse map of inner channel. Then we apply the Guided filter [16] to propagate the focus estimates from a sparse high-frequency map to the entire image ( Fig. 3(e)) for acquiring inner channel I in , using the guided image I , which is the original input image. The inner channel is defined as where i is the pixel index, I in i is the value of inner channel at pixel i, ω k is a window centered at the pixel k.
|ω| is the size of the window ω k , (a k , b k ) are some linear coefficients assumed to be constant in ω k , defines as For the detailed derivation of Eq.(3), readers can refer to [31].

B. FILTER PARAMETER
There are two parameters in guided filter influencing the resultant image: regulation and radius r. is a regularization parameter penalizing large a k . The edge of the output image would be better preserved with smaller , and in this paper = 10 −4 , according to [31]. r is the radius of the window ω k . The output image would be smoother with larger radius r. This can be seen in Fig. 4.
However,the radius can't be too large, since large radius would smooth the edge of the fused image and degrade the resultant image qualification, as shown in Fig. 5. So in this paper, we set radius to 15.
Using the definition of the inner channel, our observation indicates that if I is a sharp image with clear objects, the signal of I ' inner channel mainly located inside the object. The signal of the inner channel locating inside the object mainly due to two factors: a) it is pretty rare to see pure colors in nature, which means there are mixed colors with saturation varied. b) most objects are three-dimensional. The saturation perceived by the sensor can be slightly changed because of the angle of the incident ray.
To verify how excellent the inner channel is, we collect a sharp image database consisting of 150 images from heliconsoft.com and Flocker.com. Fig. 6 shows several sharp images and the corresponding inner channels and edge channels.
As shown in the Fig. 6, the inner channel can indicate the focus degree of objects better than the edge channel, which could be clearly seen in the picture of ''mite'' (the 3rd picture), ''wing'' (the 4th picture) and ''bee'' (the 5th picture). However, there are many edges and tiny objects in the natural images, such as the 6th and 7th image, so we utilize the edge channel to refine the focus map of natural images.

IV. MULTI-FOCUS IMAGE FUSION
Multi focus fuison produces the desired image by picking up merely the best parts in the multi-focus image series. This procedure is conducted by a set of focus measures, which are then united into a full weight map. It is convenient to refer the input image series as a focus stack. The focused image is then produced by collapsing the focus stack under the guidence of full weight map.

A. FOCUS DETECTION
All the images in the focus stack comprise defocused regions because of the DOF of the cameras. Such regions should be abandoned whereas the regions including sharp objects and details should be preserved. We plan to fulfill this by carrying VOLUME 9, 2021  out following metrics. The larger pixel value of the metrics means the pixel is better focused.

1) INNER CHANNEL
We apply a ML operator combined with guided filter to the saturation component of each image, yielding an indicator I in . It tends to assign a high weight to significant elements inside objects.

2) EDGE CHANNEL
There are burring effect locating at the edge of the object, which needs to be addressed to generate a sharp fused image. Thus we introduce the edge channel I ed , which is computed by a ML operater combined with guided filter. This parameter specify the interesting elements at the edge of the object (Fig. 3(f)). Meanwhile, this measure can be used to compensate for the limitation of inner channel, such as grey objects (saturation equals to zero) and some tiny objects consisting of very few pixels.
The definition of edge channel I ed is similar to that of inner channel I in .
where I val (x, y) is value of the value component of the input image I at (x, y). (x, y) is the pixel index. ML ed (x, y) is the sparse map of edge channel.
where i is the pixel index, I ed i is the value of edge channel at pixel i, ω k is a window centered at the pixel k. |ω| is the size of the window ω k , (a k , b k ) are some linear coefficients assumed to be constant in ω k .
For each pixel, we combine the information processed by the two separate measures to form a full focus map using summation. We select pixels over a linear combination, as we want to apply all qualities defined by these measures. Similar to weighted items of a linear combination, we can alter the significance of each measure using following power mathematical relatioship: where α and β are the corresponding weighting exponents of its focus measure. The subscript i, j, k denotes pixel (i, j) in the k-th image. The W i,j,k (Fig. 3(d)) is the full focus map.

I in
i,j,k aims to maintain color of the object and obtain sharp object, and I ed i,j,k aims to deal with blurry effects. The larger α and β indicates the better corresponding performance. However, larger α and β means higher computational complexity and lower processing efficiency. The performance of image fuison with separate β are shown in Fig. 7.
We cannot obtain an image with clear edge in the case of β = 0, whereas an clear image can be acquired while β = 1 and β = 2, as shown in Fig. 7.
α should be large in some cases, such as an image has different objects at same location but the fused image solely need the object at the top. As shown in Fig. 8. α = 1 and β = 1 can obtain the image with sharp object and edge. Larger α and β would be computational expensive since it utilize multiplication to do the computation. So α and β all equal to one in this paper. The guided filter enables the neighbouring region share the value of sparse map, which will enhance the weight of the defocused region ( Fig. 9(a)), so we need to emphasise the sharp objects and details ( Fig. 9(b)) by the following function where W i,j,k is the full weight map. α and β are the corresponding weighting exponents of its focus measure. The subscript i, j, k denotes pixel (i, j) in the k-th image. I in i,j,k and I ed i,j,k are the value of inner channl and edge channel    respectively. ML in i,j,k and ML ed i,j,k are the sparse map of inner channel and edge channle respectively.
The full weight map W i,j,k is used to guide the image fusion.

C. IMAGE FUSION
We use a technique inspired by Mertens et al. [32], which seamlessly fuses N images guided by N normalized weight maps, and works at multiple scales using a pyramid image decomposition. We adapt this technique to our case. Define the l-th level in a Laplacian pyramid decomposition of k-th image as L l k , and G l k is the same definition in a Gaussian pyramid decomposition.
We build the Laplacian pyramid of every RGB image for fusion and the Gaussian pyramid of the saturation and value components for generating the full weight map. Then, we generate the full focus map W k using the original-image level of the Gaussian pyramid, as it boasts the full information. We separately compute the sparse focus map at each scale of Gaussian pyramid, forming a sparse map pyramid Gs l k .
where Gs l k is the l-th level of the k-th image of the sparse map pyramid. α and β are the corresponding weighting exponents of its focus measure. ML in,l k and ML ed,l k are the corresponding sparse map of inner channel and edge channel at l-th level of the k-th image.
To generate the Focus pyramid F l k , we build the Gaussian pyramid of the W k , forming a full focus map pyramid Gw l k . The Focus pyramid is obtained by multiplying Gs l k by Gw l k respectively.
F l k = Gs l k * Gw l k (9) where F l k , Gs l k and Gw l k is the corresponding focus map, sparse map and weight map at l-th level of k-th image.
The mask pyramid is obtained by applying following function to the Focus pyramid series of image sequences: where M l i,j is the value of l-th level at (i, j) pixel. argmax is the mathematical operator which return the index of the variables. F is the focus map pyramid. The subscript i, j, k denotes pixel (i, j) in the k-th image. Then the all-in-focus image pyramid is obtained under the guidance of the mask pyramid M . The intensity value at pixel (i, j) of the l-th level in sharp image pyramid is the intensity value at pixel (i, j) of the same level of k-th multi-focus image, where k is the intensity value at pixel (i, j) of the l-th level in mask pyramid. The full procedure is shown in Fig. 10. The R,G,B channel of image would be separetely processed, using thread programming in visual studio 2017.

A. QUANTITATIVE EVALUATION
Quantitative analysis has been performed using four widely used image metrics: structural similarity based metric Q s [39], average gradient (AG) [40], mean square deviation (MSD) [37] and linear index of fuzziness (LIF) [41]. The picked 4 metrics are concisely introduced as follows.

1) STRUCTURAL SIMILARITY BASED METRIC Q s
The metric Q s simultaneously takes the original images and edge images into consideration.
where Q 0 (A, F | w) refers to the UIQI [42] map obtained by a sliding window w operation using A and F. The saliency weight λ(w) is computed based upon the local variances of the source images. c(w) is the normalized saliency of the family of all windows. A , B and F correspondingly refers to the edge images of A, B and F. α is a weight parameter controling the weight of the edge term. A larger Q s indicates a better image fusion performance.

2) AVERAGE GRADIENT (AG)
The metric AG takes the gradient information to measure the quality of fused images: where ∂I (m,n) ∂m and ∂I (m,n) ∂n correspondingly denote the gradients of the image in horizontal and vertical directions. M and N are the corresponding total pixel number of the row and column of the image. A larger AG refers to more and sharper boundaries in the fused image.

3) MEAN SQUARE DEVIATION (MSD)
The metric MSD indicaes the image detail richness by computing the difference between intensity and mean intensity of fused image, a larger MSD means a clearer resultant image. (13) where M and N are the corresponding total pixel number of the row and column of the image. m, n are the pixel indexes. I is the mean of the image.

4) LINEAR INDEX OF FUZZINESS (LIF )
The metric LIF evaluate the enhancement of the resultant image. A smaller LIF refers to better enhancement of the fused image.
where I (m, n) is the intensity of pixel (m, n) and I max is the maximum intensity. m, n are the pixel indexes. M and N are the corresponding total pixel number of the row and column of the image. The quantitative comparisons over 20-pair test images are shown in Table 1, the larger AG, MSD, Q s and smaller LIF denote the better performance. The best results are highlighted in bold. The proposed method remarkably outforms the other methods in all quality metrics except the AG metric, because the DSIFT method introduces some edges near the focused/defocused boundary, which are nonexistent in source images, as shown in Fig. 11(d). The time complexity comparisons over test images are shown in Table 1. BRW method is the fastest algorithm among selected mehthods, which costs 0.57s to fuse two 520 × 520 images. The proposed method cost 0.77s to fuse two 2448 × 2048 images. We use the BRW to fuse two 1944 × 1296 images, and it cost 4.61s to obtain the fused image.
We utilize the thread programming to simultaneously process the R, G, B channel, which would save the two-thirds time. Fast guided filter [16] is utilized to maximize the time efficiency of the algorithm. Moreover, we use the summation to combine the inner channel with edge channel, which would be more efficient compared with multiplication.
B. QUALITATIVE ANALYSIS 1) LYTRO DATASET Fig. 11 shows the visual comparison on the 'Lytro' dataset. We select three-pair images to show the strength of the proposed method in three different situation: focused/defocused boundary with small defocused area and large defocused area and the acceptably sharp region.
The middle row ('Lytro-19') presents the power of proposed method to cope with the acceptably sharp region obtain the acceptably sharp region. Source image A presents the small defocused region. The CNN (Fig. 11(c)), GFDF (Fig. 11(e)), BRW (Fig. 11(f)) and NSCT-SR (Fig. 11(i)) fail in the case of fusing two large defocused area image. However, the proposed method ( Fig. 11(j)) acquires a clearer area similar to that in the source image A (Fig. 11(a)).
The bottom row ('Lytro-13') exhibit the ability of the proposed technique to address the focused/defocued boundary with large defocused area. There are sharp region consisting of grass in source images, as shown in the enlarged region. BRW (Fig. 11(f)) and the proposed method ( Fig. 11(j)) obtain the sharp region and maintain the sharp boundary between foreground and background. MGFF (Fig. 11(g)) and NSCTSR ( Fig. 11(i)) obtain the sharp region and blurry edge.

2) COLLECTED DATASET
We respectively collected four datasets that include 22 defocused images of stone, 6 defocused images of flower,     8 defocused images of insect and 9 defocused images of mental mesh to show the strength of our method, as shown in Fig. 12-Fig. 15 Fig. 12 demonstrates merit of the proposed method to cope with the color distortion. Source image A (Fig. 12(a)) shows the sharp colorful area. BRW (Fig. 12(f)), GFDF (Fig. 12(e)) and the proposed method ( Fig. 12(j)) obtain the sharp region with right color. CNN (Fig. 12(c)), MGFF ( Fig. 12(g)) and NSCT-SR (Fig. 12(i)) fail to acquire the clear region on the lower left area. MFF-Net (Fig. 12(h)) introduces the color distortion in the green area whereas DSIFT (Fig. 12(d)) introduces the color distortion in the brown area. Fig. 13 mainly shows the ability of the proposed technique to deal with blurry effects. Source image B (Fig. 13(b)) shows the sharp area. The results of CNN (Fig. 13(c)), GFDF ( Fig. 13(e)), MGFF ( Fig. 13(g)) and NSCT-SR ( Fig. 13(i)) all have white shade around the white flower. BRW (Fig. 13(f)) and DISFT (Fig. 13(d)) introduce another tiny edge near the object. MFF-Net ( Fig. 13(h)) and proposed method ( Fig. 13(j)) can acquire the sharp boundary similar to the source image.

c: TINY OBJECT
Fig. 14 presents the ability of our technique coping with tiny object. Source image B (Fig. 14(b)) shows the sharp object. BRW (Fig. 14(f)), GFDF (Fig. 14(e)) and the proposed method ( Fig. 14(j)) obtain the sharper object compared with others. MGFF (Fig. 14(g)) fails to acquire the clear image in this case.  Fig. 15(a)) shows the overexposure region and source image B (Fig. 15(b)) shows the clear area. The overexposure would affect the focus metric, which would fuse more defocused region into the resultant image, as shown in CNN (Fig. 15(c)), GFDF (Fig. 15(e)) and NSCT-SR (Fig. 15(i)). MGFF (Fig. 15(g)) and MFF-Net (Fig. 15(h)) fail to recognize the boundary of the overexposure region. Only DSIFT (Fig. 15(d)) and the proposed method (Fig. 15(j)) acquire the clear region similar to source image.
The quantitative evaluation of the new dataset are show in Table 2. The best results are highlighted in bold. The proposed method remarkably outforms the other methods in all quality metrics except AG metric, because the MGFF method fails to obtain sharp image and introduces a lot of noise, as shown in Fig. 12-Fig. 15.

VI. CONCLUSION
In this paper, we have proposed a very simple but effective prior, called inner channel prior, for multi-focus image fusion. The inner channel prior aims to detect the sharp region inside objects, whereas the edge channel aims to obtain the sharp region at the edge of objects. Combined with edge channel, inner channel can acquire the sharp area of the whole image. Since the saturation is introduced in the inner channel, this proposed method can avoid the color distortion occuring in the fusion process. Meanwhile, this method can be time-effective by the multi-threaded optimization. Some experiments have shown the high efficiency and effectiveness of the prior compared with existing typical methods.
Since the inner channel prior is based on the HSV model, it would be invalid when the image sequences are gray-scale images. Gray-scale images don't contain the saturation component, so the inner channel prior can't be used to measure the focus degree inside objects. Moreover, as the inner channel is obtained by using guided filter, its accuracy may decrease when the halo artifacts are introduced by the guided filter.
As for another common challenging problem in image fusion, misregistration, is not addressed in this paper. We leave this problem for future research.