Multi-Modal Image Fusion via Sparse Representation and Multi-Scale Anisotropic Guided Measure

The multi-modal image fusion plays an important role in various fields. In this paper, a novel multi-modal image fusion method based on robust principal component analysis (RPCA) is proposed, which consists of low-rank components fusion and sparse components fusion. In the low-rank components fusion part, a universal low-rank dictionary is constructed for sparse representation (SR) and the low-rank fusion is converted to sparse coefficients fusion by adopting the batch-OMP. In the sparse components fusion part, the anisotropic weight map is constructed to express salient structures of the images. Moreover, a multi-scale anisotropic guided measure is proposed to guide the fusion process, which can extract and preserve the scale-aware salient details of sparse components. Finally, the multi-modal fusion can be achieved by combining two fusion parts together. The experimental results validate that the proposed method outperforms nine state-of-the-art methods in multi-modal fusion both at gray-gray and gray-color scales, in terms of qualitative and quantitative evaluations.


I. INTRODUCTION
With the development of the modern technology, the requirement for the completeness of the information acquisition is increasing, so the multi-sensors play an important role in many fields. To better obtain the composite image for further visual and processing tasks, the image fusion has become a research hotspot and been widely employed in computer vision, military surveillance, medical imaging, remote sensing, and so on [1]- [5].
Current image fusion methods are mainly divided into spatial domain based and transform domain based according to their processing domain [6], [7]. In spatial domain, the fused image can be constructed through the combination of the input images at pixel-level or block-level. These methods mainly select salient pixels or regions with higher clarity to fuse the multi-modal images [8]. The direct fusion in pixellevel will lead to decreasing the edge contrast, and the region The associate editor coordinating the review of this manuscript and approving it for publication was Inês Domingues . fusion effect relies too much on segmentation of different details. Meanwhile, such methods are often subject to noise interference and blocking artifacts though their computation complexity is simple [9]. Zhang proposed to fuse infrared and visible images through infrared feature extraction and visual information preservation, and it is especially designed for low-light circumstance [10]. In this fusion type, the image details may be preserved well, however, the continuity is hardly guaranteed.
The transform based methods always adopt the way of multi-scale analysis. Generally, these methods decompose the image into different subbands to extract edge details or salience structures at different scales. Then, the image fusion is converted to these subbands fusion by using some rules. Finally, the fused image can be obtained through inverse multi-scale transform. The major advantage of such methods is that more details of multimodal images are well preserved. The traditional methods include curvelet transform (CVT) [11], [12], discrete wavelet transform (DWT) [13], [14], shearlet transform [15]- [17], dual-tree complex wavelet transform (DTCWT) [18], [19] and nonsubsampled contourlet transform (NSCT) [20]- [22]. Different transformation methods have their own unique advantages on features preservation, and many researches are performed to find optimal combination ways. Li and Yang [23] proposed a multifocus image fusion method by combining CVT and wavelet transform. Chen et al. [24] proposed an image fusion method based on neighborhood characteristic and regionalization in NSCT domain. However, the fused effect heavily depends on the fusion rule and many rules don't take the global features consistency into account. So finding an optimal transform basis is not so easy to realize. Zhao designed a visual weight map to extract saliency in decomposed scales [25]. Feng proposed a total variation to retain the texture details and maintain the edges in infrared and visible image fusion [26]. Zhu proposed a new activity measure consisting of phase congruency, sharpness change and local sharpness change measure to realize the multiple highpass subbands fusion with high informative [27]. Zhu also proposed a total-variation based method to accomplish fusion on cartoon and texture components, respectively [8]. These studies above better extract the details and structures on transformation space rather on original image through designing corresponding extraction rules.
Additionally, the sparse representation (SR) becomes popular in fusion due to exploiting the self-similarity of the natural images [28], [29] and revealing the intrinsic nature of human visual system (HVS) characteristics. The main idea of the SR is that a given signal can be represented by a linear combination of a few atoms in an overcomplete dictionary [30]. Yang firstly introduces the SR into image fusion in [31]. Wei adopts the SR in the hyper-spectral and multispectral image fusion [32]. Liu proposed the image fusion method with convolutional sparse representation [33], and so on. Meanwhile, the Group SR, Robust SR and Nonnegative SR are proposed in recent years. However, the fusion effect of the SR-based methods relies on the dictionary to a great extent. The diversity of the high-frequency signals restricts the dictionary establishment. Moreover, the SR based on single image component lacks of flexibility in fusion. Liu proposed a convolutional sparsity based morphological component analysis and achieved multi-component and global SRs of source images [34]. Thus, the SR based fusion of multi-component can decrease the complexity of the dictionary and improve fusion adaptive.
Motivated by these methods above, we study the fusion scheme under the robust principal component analysis (RPCA) [35] to purse more complete background and abundant structure information in this paper. On one hand, instead of adopting SR on original image, the SR based fusion is performed on low-rank components of RPCA result to preserve background more flexibly and robust. In this way, the low-rank component overcomplete dictionary can be constructed easily with higher representative and university, and the problem of low-rank fusion is converted to find the best fusion way of the sparse coefficients encoded by batch-OMP (orthogonal matching pursuit). On the other hand, a multiscale anisotropic guided measure is proposed to guide the sparse components fusion in the paper. Specifically, the anisotropic weight map is constructed by combining the Laplace filtering, dilation operation and heat diffusion. Then, multi-scale guided map is constructed by using the designed multi-scale guided filter and taking the original images as the guidance image. Finally, the sparse components fusion is completed with the guidance of multi-scale anisotropic guided map. Thus, the structure information can be preserved according to its contrasts with a certain direction at multiscale sparse components, which is not fully considered in many existing coefficients fusion ways.
Both multi-scale structures information and global spatial consistency are preserved in the proposed measure.
Specifically, the contributions of the proposed fusion scheme can be summarized as: • An overcomplete dictionary of low-rank components is constructed and it gives a rise to the robust and universal representation of the low-rank components. The batch-OMP is utilized to enhance the efficiency of sparse coding.
• The anisotropic weight map is constructed, which well respects the anisotropic directional structures and salient objects of the images. Moreover, a multi-scale guided filter is designed to extract the scale-aware structures and improve the spatial consistency.
• A multi-scale anisotropic guided measure is proposed to guide the sparse components fusion, which integrates the original and complementary information into final fusion result.
The rest of the paper is organized as follows. Section II presents the proposed methodology in detail. The experiment results and analysis are presented in Section III. Finally, the conclusions are given in Section VI.

A. FUSION FRAMEWORK
The key issues in the multi-modal image fusion are how to extract the effective information in different bands and how to combine them together. Taking the color image and the gray image as example, the flow of the proposed fusion method is depicted in Fig. 1. The color map is converted into YUV space, in which Y component represents the signal intensity and it is used for fusion with gray map. In detail, they are firstly decomposed into low-rank components and sparse components, which represent the background and structure details, respectively. The low-rank components are encoded by batch-OMP under the overcomplete dictionary, and their corresponding coefficients are fused to reflect the low-rank components fusion result. Meanwhile, the sparse components are fused using the proposed multi-scale anisotropic guided measure. Finally, the low-rank fusion components, sparse components and chrominance components (U and V) are combined to obtain final fusion image.

B. RPCA DECOMPOSITION MODEL
The core idea of RPCA [36] is that a data matrix can be represented as the superposition of a low-rank component and a sparse component under the low-rank and sparse optimization criteria. Assuming an input matrix, it always contains the background, structure information and noise, so it can be decomposed as: where L is a principal matrix known to be low-rank, and S is sparse for it represents some salient detail and noise. Although the Eq. (1) is NP-hard, the Wright et al. have proposed a best solution by using tractable convex optimization: where · * represents the nuclear norm of a matrix, · 1 is the l 1 -norm, and λ is a weighting parameter that is usually set to be 1/ √ max(M , N ). As shown in Fig. 2, the objects with high temperature difference can be modeled as the sparse component in the infrared image. A visible image contains abundant scene background information and texture details [36]. To separate the spare details from the background more accurately, we adopt the IALM (inexact alternating direction methods) instead of ALM (augmented Lagrange multipliers) to solve the convex problem in Eq. (2) [37], in which the exact solution is not need. In RPCA decomposition model, the higher λ is, the higher weight of low-rank components obtains, i.e., the more map information is judged as background and the fusion map lacks of distinct details. As shown in Fig. 3, the IR image are decomposed into low-rank components (first row) and sparse components (second row) with different λ. It can be seen that the people, chimney, road and other salient objects are better extracted in sparse components when λ is set to 0.02. Therefore, a appropriate λ is so important to preserve enough background and structure details. In additional, the λ = 1 √ max(m,n) is suggested in [38], but it doesn't work well in our many experiments. We find a more optimal parameter λ definition that can realize a well compromise fusion effect of two different components. The λ of the original image with lower average value is defined as λ = 10 3×min(m,n) , and the other is defined as λ = 10 3×max(m,n)

C. LOW-RANK COMPONENT FUSION
The basic assumption of the SR is that a signal can be approximately represented by a linear combination of a few atoms from an overcomplete dictionary [23]. Most of existing methods directly obtain natural image patches and train them, but the dictionary sometimes performs unsatisfied because the types and quantity of the training sample are far from enough. Learning-based approach can overcome the problem in dictionary construction, however, it need constant attention of experimenter. The main reason is that the natural image patches are so diverse. The goal of the SR is to find the sparsest coefficients that contain the fewest nonzero atoms among all feasible solution. We find the similarity of the low-rank component patches is higher than natural image patches. So we construct the overcomplete dictionary by training the low-rank patches to improve its robust and flexibility. Supposing there are K training patches y of size √ n × √ n and they are arranged to column vectors. Note that their mean values should be subtracted to zero before construction. The K-SVD [39] is selected for dictionary learning for its simplicity and efficiency. The dictionary D satisfies the follows condition: The pre-fused images I A and I B are divided into √ n × √ n patches and they are ordered as where α i is the sparse coefficient. In this process, OMP is a traditional and effective way for sparse coding. Thus, the original image patches are represented as sparse coefficients in a common dictionary D.
The sliding window technique is utilized to divide the training images and pre-fused images into patches of size √ n × √ n. Referring a typical example [22], we verify the fusion results of √ n = 2, 4, 8, 12, 16 in dictionary training respectively, and make a compromise between computation and redundancy when √ n is defined as 8. After obtaining the dictionary D, the batch-OMP is adopted for sparse coding because its complexity is simpler and it doesn't depend on the dictionary implementation. Due that the low-rank component reflects the basic information of the background, we adopt the maximum principle for sparse coefficients fusion: Then the fused result of L F is calculated by:

D. MULTI-SCALE ANISOTROPIC GUIDED MEASURE
The guided filter is based on a local model, making it qualified for other applications such as image matting, upsampling and colorization [40], [41]. The guided filter is used to guide the fusion measure of the sparse component because its good edge preserving smoothing ability. In proposed fusion framework, the details guided measure plays an important role in constructing a weight map and guiding the sparse components fusion. To improve the performance VOLUME 8, 2020 of details guided measure, a multi-scale anisotropic guided measure is introduced in this section. The guided filtering principle can be described as: where Z i and Y i is the ith pixel value of the output image Z and guidance image Y , r k is its local window. The linear coefficients a k and b k can be estimated by: where X is the input image, δ 2 k and µ k are the variance and the mean of Y in r k , η is the regularization parameter, |r| is the number of pixels in r k ,X k is the mean of X in r k . In Eq. (5), the value of Z i will change when the window size r 0 of r k is different. So, all the values of coefficients a k and b k are first averaged if there are several different r 0 . Thus, the output image is estimated as: The weight map is calculated to guide the fusion according to actual pre-fused image feature. However, the low frequency information always causes much interference. Moreover, the details extraction is not so satisfied because many filtering methods are isotropic so that the magnitude and direction are not fully considered in filtering process. To overcome the above two deficits, we propose to construct an anisotropic map. Firstly, the high frequency details H n are obtained by applying the Laplace filtering to original images: where L is a 3×3 filter referring from [41]. Then, we adopt dilation operation on H n to avoid the inner loss of salient object. Therefore, the weight map can better correspond to the decomposition images of the sparse component: where Se is a dilation template. Next, the anisotropic heat diffusion is utilized to reveal the intrinsic structure and discover the local anatomical structure. The heat diffusion process over the manifold M is governed by the heat equation: where M is the Laplace-Beltrami operator of M. The heat kernel is one of the generic solutions and defined as: where λ i and φ i are the eigenvalue of the Laplacian matrix and its corresponding eigenvector. Thus, the heat diffusion based image n can be obtained that preserve the anisotropic feature.
The higher of the value of heat diffusion result is, the more weight of corresponding pre-fused image should take in fusion. Finally, the initial anisotropic weight map W t initial for tth pre-fused image are obtained by: The weight map at first scale is calculated using W initial map guided by original image, and the next weight map is calculated using the previous result.
where I is the original image used as guided image, W initial is the initial anisotropic weight map, W j MGF is the guided filtering weight map at jth scale, G (·) represents the guided filter function.
To explore the relationship between the window sizes with the multi-scale guided filtering results, we take the initial anisotropic weight map and original IR image as the input and guidance map, respectively. In first group of experiment, the scale is set 3, and the window sizes are set 2, 4 and 8. As shown in Fig. 4(c)-(e), the weight maps can extract more distinct details, but brings more smooth background information at the same time. The second group experiment is performed under the window size r = 2, 4 and 8. The results indicate that the region containing the high-frequency targets are well segmented, and the representation ability of edge details is improved, as shown in Fig. 4

(f)-(h).
After obtaining the weight maps, they are treated as multi-scale anisotropic guided measure for sparse component fusion. The multi-modal images are merged through normalized summation at different scale, respectively.
Similarly, the proposed method can be extended to fuse the color and gray image. The direct fusion in RGB of the color image may cause the color distortion. Considering the human perception, the YUV space can be used to encode a color image into one luminance component (Y) and two chrominance components (U and V). Therefore, the YUV color space is adopted to accomplish the gray and color images fusion. The conversion from RGB to YUV is defined 35642 VOLUME 8, 2020 Thus, the Y component is treated as the gray component of the color image and used to fusion with the gray image. Finally, the fused color image can be obtained through the inverse YUV transformation.
In color images fusion, such as PET and SPECT, the color images are first converted into YUV space. Thus, the fusion can be treated as luminance components (Y) fusion and chrominance components (U and V) fusion, i.e., the former can be accomplished according to gray-gray fusion method, and the latter can select the maximum value of the chrominance components as the fused results to preserve more color information. Finally, the color-color fused image can be obtained through inverse YUV transformation.

III. EXPERIMENT RESULTS AND ANALYSIS
To demonstrate the multi-model fusion effectiveness of the proposed method, we validate the effect of the proposed method by conducting lots of experiments. Here we select six typical pairs of multi-modal images including three color-gray pairs to analysis in detail and other ten pairs for fusion results present. The proposed method is compared with nine image fusion methods, i.e., the morphological difference pyramid (MDP) based method [43], the gradient pyramid (GP) based method [44], the SR based method [22], the DTCWT based method [19], the NSCT based method [20], the NSCT pulse coupled neural network based method [45], zhang's method [10], ASR method [46] and GFCE method [47]. The parameter settings of these methods are as follows. Four decomposition levels, the 'averaging' scheme for the low-pass sub-band and the absolute maximum choosing scheme for the high-pass subband are selected for the MDP based method, GP based method and DTCWT based method. Specifically, 'legall' and 'qshift_06' are selected as the DTCWT decomposition filters. For the SR based method, the number of the overlapped pixels is set 6, and the max-L1 is selected for coefficients fusion rule. Three decomposition levels, 'pyrexc' for the directional decomposition and 'vk' for the pyramidal decomposition are adopted in NSCT based method. Five decomposition levels, '9-7' for the directional decomposition, 'pkva' for the pyramidal decomposition and the maximum choosing scheme for high-pass and low-pass are adopted in NSCT-PCNN based method. The parameters of zhang's method, ASR method and GFCE method are set up as the default in respective literatures. In our method, the λ is defined according to Section B. The level of multi-scale anisotropic guided measure is set three and its filter window sizes are set 2, 4 and 8, respectively. Note that there will be artifacts if the window size is too small or too large.
The Laplace filter is set as from [41]. The dilation template is set as the experiments are implemented using MATLAB 2014b on a notebook PC with a 2.2 GHz Intel Core CPU and 4 GB memory.

A. QUALITATIVE EVALUATION
In image fusion, the qualitative evaluation means visually comparing the fusion images by human vision system (HVS) according to their prior knowledge. In this subsection, twenty people are selected as observers and two representation regions are enlarged as close-ups in each fused image for better comparison. Fig. 5 shows the fused images obtained with different methods for the ''UN camp'' images. All the methods can achieve the fusion effect to a certain extent. The MDP based method can only extract details in some regions. For example, there are distortions at the house roof and its edge. The contrast of the human with the background is low to distinguish the human in GP and NSCT based methods. The consistency of whole fused image is poor in the result of the SR based method. There are some aircrafts around the human in the result of the DTCWT based method. There are too many low-frequency signals to highlight the details in ASR method result. In whole view, the proposed method performs better than others comparison methods except GFCE method on energy preservation and detail extraction. Fig. 6 shows the fused images obtained with different methods for the ''Road2'' images. The pedestrian disappears in the light in the results of MDP based methods. The image energies are low in the results of the GP based method, NSCT based method and ASR method, in which the contrast of the roof is not obvious enough. The objects are interfered by the lights of all the comparison methods in fused images, such as the car, the roof lamps, the pedestrian, and so on. On the contrary, the proposed method achieves the better fusion result due to more prominent pedestrian and richer background information. Fig. 7 shows the fused results of the medical images. Artificially distorted traces appear in brain boundary of the fused images of the MDP based method. In NSCT based result, the high-frequency signals are so weak to distinguish them. Similarly, the bright parts in brain of source 2 are more or less lost in all the comparison methods. In contrast,      so weak to identify. The sky is not so continuous in the GFCE fusion result.
In Fig. 9, the background details of far scene are preserved well in the results of the DTCWT, NSCT, NSCT-PCNN based methods, zhang's method, GFCE method and proposed method. However, the MDP and GP based methods can't tackle the color information well. The yellow smoke is not discontinuous in NSCT-PCNN based fusion image. Moreover, there is too much noise in the GFCE fusion result, which interferes with the grass and roof edges. The cloud contrast in the sky is so low in the zhang's fusion result. The proposed method makes cloud contrast high, preserves more details and performs continuous.
In Fig. 10, a set of MR and PET image fusion results is taken as the example. The color distortions appear in the fused images of comparison method excepting NSCT-PCNN based method, zhang's method and GFCE method. The color distortion in SR based result is so high and it is obviously not suitable to observe. The high-frequency signals of source 1 are interfered by color in zhang's fusion result. In overview, the NSCT-PCNN based method, GFCE method and proposed method obtain relatively satisfied effects.

B. QUANTITATIVE EVALUATION
In order to assess the fusion more objectively, five fusion metrics, i.e., information theory based metric (Q MI ), gradientbased metric (Q G ), phase congruency based metric (Q P ), structural similarity based metric (Q E ), Yang's metric (Q Y ) are adopted. The default parameters given in the related publications are adopted for these quality indexes [48], [49]. The quantitative performances of different methods are shown in Table 1-6, in which the bold values indicate the greatest value in each row and mean the best performance among the all methods. We can see from six tables that the proposed method nearly obtains the highest scores in most of the metric values.
Specially, the whole image of the GP based method is smooth, especially at corners and edges, so its phase congruency based metrics is higher than ours in Table 1. The GFCE method obtains better effect on energy and high-frequency details preservation than our method. In Fig. 5 experiment, the light on the square of our fusion image is too bright to 35646 VOLUME 8, 2020 express the structure details, so the Q E metric of proposed method only ranks fourth. The light of the cars and motors are preserved more in zhang's fusion result, so its Q MI obtains higher value in Table 2. However, too much light of the cars and motors is not beneficial to identify these targets. In Table 3, the proposed method tends to preserve the bright parts from source 2, which covers up some low gray details. Therefore, many gradients in the fusion image are lost, that leads to structure weakness, so the Q G and Q Y only rank fourth. In Table 4, the OETEC fusion result of SR nearly fails for too many discontinuous blocks exist in the image. However, these bright blocks increase the amount of information and gradients, which is described by Q MI and Q G metrics, respectively. In Table 5, the reason that GP get higher Q P is similar to Table 1. In Table 6, the Q E of proposed method is nearly equal with highest value, which indicates the proposed method also obtains well effect.
In six groups of fusion experiments under five assessment metrics, there are 30 metrics in total, and the number of times of proposed method ranking first to fourth   are 17, 8, 1 and 4 times, respectively. According to [41], the five quality metrics should be considered together to evaluate the real performance. In our view, the fusion quality should be evaluated by combining qualitative and quantitative evaluations, rather than based on single one. In summary, it is demonstrated that the proposed method outperforms nine comparison fusion methods and achieves the state-of-the-art fusion effect.
The main difference between the fused gray-gray and graycolor scale images is whether the color is well mapped into the gray details. In RGB images, the color and gray intensity are coupled together. The gray-color fusion is decomposed into gray-gray fusion and color mapping using proposed solution that converts RGB to YUV. The color components (U and V) are directly mapped into gray-gray fused image to VOLUME 8, 2020 obtain final color fused image. Thus, the details and the color are preserved well in fused result, as shown in Fig. 8-10.
The average time cost of all the methods are shown in Table 7, the resolution of six image pairs are 320 × 240, 256 × 256, 256 × 256, 340 × 255, 616 × 454 and 256 × 256, respectively. It can be seen that the method only ranks sixth and its computation cost is much higher than that of the first five methods. The main reason of high computation cost is the sparse representation occupies 96.4% of the running time in proposed method. However, the proposed method is more efficiency than other SR based methods, which indicates the proposed low-rank components SR fusion is effective.
Furthermore, Fig. 12 and Fig. 13 give the multi-modal image fusion examples at gray-gray and gray-color scales, respectively. According to the fusion results, the fusion results preserve the details and color information well and improve the visual observation quality. Therefore, the proposed method can well accommodate multi-modal image fusion.

IV. CONCLUSION
In this paper, a novel multi-modal image fusion method is proposed by decomposing the fusion into low-rank components fusion and sparse components fusion. The low-rank components have the similar feature, so a universal low-rank dictionary is constructed and the batch-OMP is utilized to improve effect and efficiency of coefficients. To preserve more energy and details of the sparse components, a multiscale anisotropic measure is proposed to construct multiscale weight maps as the guidance for the sparse components fusion. Moreover, the multi-scale weight maps improve the spatial consistency and suppress the aircrafts and faults. The qualitative and quantitative evaluations verify the better performance of the proposed method on multi-modal images fusion. In the future, we will investigate how to expand the application area and select the more appropriate parameters.