An Infrared and Visible Image Fusion Method Guided by Saliency and Gradient Information

Infrared and visible image fusion is a hot topic due to the perfect complementarity of their information. There are two key problems in infrared and visible image fusion. One is how to extract significant target areas and rich texture details from the source images, and the other is how to integrate them to produce satisfactory fused images. To tackle these problems, we propose a novel fusion framework in this paper. A multi-level image decomposition method is used to obtain the base layer and detail layer of the source image. For the fusion of base layer, an ingenious fusion strategy guided by the saliency map of source image is designed to improve the intensity of salient targets and the visual quality of the fused image. For the fusion of detail layer, an efficient approach by introducing the enhanced gradient information is presented to boost the detail features and sharpen the edges of the fused image. Experimental results demonstrate that, compared with fifteen classical and advanced fusion methods, the proposed image fusion framework has better performance in both subjective and objective evaluation.


I. INTRODUCTION
Multi-sensors image fusion is an enhancement technology that integrates the image information obtained by different kinds of sensors into one image, and it plays a vital role in computer vision tasks, such as target recognition, remote sensing, and surveillance [1]- [3].
Infrared images can distinguish targets from the background according to the thermal radiation information and work well in day/ night and all-weather conditions, but they have low resolution and weak details. Visible images contain detailed texture information and high spatial resolution, but they are easily affected by the weather and brightness [4]. Therefore, the infrared and visible image fusion becomes a very hot topic due to the perfect complementarity of their information. There are two key problems in infrared and visible image fusion. One is how to extract significant target areas and rich texture details from the source images, and the other is how to combine them to produce excellent fused images. This paper aims to explore an effective method to achieve satisfactory fusion of infrared and visible images.
The associate editor coordinating the review of this manuscript and approving it for publication was Qiangqiang Yuan.
Multifarious infrared and visible image fusion methods have been proposed in recent years, and they can be grouped into three dominant types, including multi-scale transform, sparse and low-rank representation learning-based, and deep learning-based methods [5]- [7].
Multi-scale transforms have developed for decades in the field of infrared and visible image fusion. Discrete wavelet transform is a typical multi-scale transform method [8], [9], which decomposes the input image into high and low frequency sub-images. Then, these sub-images are fused to a single image through appropriate fusion rules. The dual-tree complex wavelet transform (DTCWT) is proposed to over-come the shift variance and lack of directionality problems of the discrete wavelet transform [10]. For capturing the abundant directional information of source images, contourlet transform is proposed [11]. Nonsubsampled contourlet transform (NSCT) is a modified form based on contourlet transform, which is widely used in infrared and visible image fusion due to its flexibility and shiftinvariance [12], [13]. However, the above-mentioned fusion methods need to transform images to frequency domain, which increases the computational complexity [14].
In order to avoid image transformation, representation learning-based methods have attracted the attention of researchers. The most common methods are on the basis of sparse representation (SR) [15] and dictionary learning [16], [17], which are consistent with the physiological mechanism of the human visual system. Nevertheless, image fusion methods based on SR suffer from high sensitivity to misregistration. To tackle this drawback, Liu et al introduce the convolutional sparse representation (CSR) model to fuse multi-modal images. Even so, fused images obtained by fusion methods based on SR have insufficient texture details [18].
With the development of deep learning, many innovative image fusion methods based on deep learning are designed [19]- [21]. The convolutional neural network (CNN) attracts much focus due to its ability of powerful feature representation. Liu et al utilize CNNs to finish the fusion of infrared and visible images [22]. Whereas, CNN model usually requires the ground truth of training images. It is not considered to build the ground truth in infrared and visible image fusion because it is unrealistic to define a standard for fused images. To solve this issue, Ma et al. construct an end-to-end model named generative adversarial network for infrared and visible image fusion (FusionGAN) [4]. On this basis, they ameliorate the loss function of FusionGAN to increase the detail information and sharpen the edge of fused images [23]. There are mainly two drawbacks of deep learning-based fusion methods. One is that the network model is difficult to train when the number of images is limited, especially for infrared and visible image fusion, the other is that a good network often depends on GPUs with good performance.
Recently, the latent low-rank representation (LatLRR) model has been gradually attracted attention in image fusion. Fusion methods based on LatLRR can decompose source images into base and detail parts without transformation, which is beneficial to design fusion rules so that generate high quality fused images. In addition, these methods can fuse infrared and visible image without complex training process and good performance GPUs. Thus, fusion methods based on LatLRR are widely used in image fusion [14], [24], [25].
Inspired by this, we propose a novel framework for infrared and visible image fusion. The main challenge of infrared and visible image fusion is to generate a single image containing salient target areas and texture details. This paper proposed a novel infrared and visible image fusion framework, which improving the fusion image quality by introducing the saliency and gradient information to the fusion strategy. To facilitate the design of the fusion strategy, we decompose the source image into the detail and base layers. To improve the intensity of salient targets and the visual quality of the fused image, the saliency information of the source image is introduced to fuse the base layer. To boost texture details of the fused image, the gradient information is used to assist the fusion of detail layer. Experimental results prove that the proposed image fusion framework outperforms than other traditional and deep learning fusion methods.
The main contributions of this paper are summarized as follows.
1. A novel method based on multi-level image decomposition is proposed for infrared and visible image fusion. 2. For base layer fusion, an excellent strategy guided by the saliency map is designed to increase the salient information and improve the visual quality of the fused image. 3. For detail layer fusion, an efficient method with enhanced gradient information is presented to increase the detail information of the fused image. 4. Compared with the classical and state-of-the-art fusion methods, our proposed fusion framework has a better performance in terms of both subjective and objective evaluation. This paper is organized as follows. In Section II, the proposed infrared and visible image fusion framework is introduced in detail. In Section III, information about the dataset and parameter settings is given. In Section IV, the experimental results and analyses are provided. In Section V, conclusions and the future work are presented.

II. THE PROPOSED INFRARED AND VISIBLE IMAGE FUSION METHOD
The proposed fusion framework is schematically presented in Figure 1, which is mainly composed of four parts: (1) image decomposition, (2) the fusion of base layer, (3) the fusion of detail layer, and (4) image reconstruction.
Image decomposition: a multi-level image decomposition method based on LatLRR is employed to obtain the base layer and detail layer of source images, which is beneficial to design fusion strategies. Infrared and visible images are decomposed to base layer images b l A and b l B , detail layer matrixes M 1:l A and M 1:l B after l levels decomposition. The base layer includes the primary intensity and brightness information of the input image. The detail layer contains texture details and feature information of source images.
The fusion of base layer: an ingenious fusion strategy guided by the saliency map of the source image is designed. As shown in Figure 1, salient regions of infrared and visible images are different. Because of the different imaging ways, the infrared image mainly highlights the thermal radiation information of objects, whereas the visible image focuses on the spectral information reflected by different objects. Inspired by this, the weight matrix used to fusion the base layer is calculated depends on the saliency map, which aims to retain suitable intensity information and improve the visual quality of the fused image, simultaneously.
The fusion of detail layer: an effective approach is presented to achieve the fusion of detail layer. First, the detail matrixes of infrared and visible images are fused based on weighted-average rule. Then the enhanced gradient information is added to the detail layer to improve the contrast and sharpness of the fusion result.
Image reconstruction: the final step of the proposed fusion framework is reconstructing fusion results of base layer and detail layer to obtain the fused image. The four parts of our proposed method will be described in detail in the following sections.

A. IMAGE DECOMPOSITION
A successful fusion of infrared and visible images should integrate as much as target prominent and detail information as possible to the fusion image. Therefore, it is a significant process to adequately extract the intensity and texture content of source images. Recently, a multi-level image decomposition method based on latent rank representation (LatLRR) is presented, which can divide input images into base layer and detail layer [14], [24]. The base layer mainly contains intensity and brightness information of source images, and the detail layer principally includes texture and structure information of source images. This decomposition way is beneficial for designing the fusion strategies. Inspired by this, we utilize this approach to decompose the source images. The flowchart of image decomposition is shown in Figure 2.
The LatLRR model is described as Equation (1).
where λ is a positive balance factor, λ is an empirical value, we set the value to 0.4 in this paper according to reference [26]. . * represents the nuclear norm, and . 1 is the l 1 -norm. X denotes the observed data, which is the input image in this paper. Z and L are the low-rank and saliency coefficients, respectively. E expresses the sparse noise. The low-rank part XZ, saliency part LX and sparse noise part E can be obtained from Equation (1). The noise part is removed in image fusion operation.
The multi-level image decomposition method is constructed according to the LatLRR model in reference [14], which can obtain the base layer and detail layer of the input image. The detail layer and base layer images are expressed as follows: where l is the highest decomposition level, M i and b i represent the detail matrix and the base image at level i. Note that b 0 is the source image. After l levels LatLRR decomposition, the source image derives l detail images d 1:l and a base image b l . P is the projection matrix learned by LatLRR. The size of the projection matrix P is 16 × 16, and the decomposition level is set to {1, 2, 3, 4}. W (·) denotes a two-stage operator including the sliding window technology and reshuffling, and R(·) means the function reconstructing the detail image d i according to the detail matrix M i . The process of W (·) operator is shown in Figure 3. The window size is 16 × 16, the stride of sliding window is 1. Before sliding window, the source image is resized to (m,ñ) to guarantee the width and height of the resized image are both integer multiples of 16.m andñ are calculated as Equa- where, (m, n) is the size of the source image, mod is the function to calculate the remainder. Then, a window with size of 16 × 16 is used to slide over the source image, which aims to get a series of image patches with size 16 × 16.
After sliding window, the image patch is reconstructed into the matrix k with size S p × 1. Next, these matrixes k are reshuffled to one matrix H with size S p × C.
where, S p = 16 × 16 = 256. C is calculated as follows: where, sqrt(·) is the operation to get the square root. S d is the stride value of sliding window, The process of R(·) is similar to the inverse process of W (·).

B. FUSION OF BASE LAYER
The base layer can be obtained through image decomposition. As exhibited in Figure 4, with the increase of decomposition level, the image of base layer becomes more and more smooth, and the contrast between pixels gets lower and lower. As a result, the salient object regions cannot be retained integrally, the salient object edges are blurred, and the intensity information is decreased at a great degree. Traditional average-rule only averages the pixel values of infrared and visible base images, which cannot satisfy the requirement of sufficiently integrating intensity information from the base layer to the fused image. However, as exhibited in Figure 4, the salient regions of infrared and visible images are highlighted in their saliency maps. Besides, the saliency map possesses strong contrast and clear object edge. Therefore, an innovative fusion strategy guided by the saliency map of source image is designed to improve the intensity of salient targets and the visual quality of the fused image. This base layer fusion method consists of two steps: extraction of saliency map and design of fusion strategy. The two parts are described in detail next.

1) EXTRACTION OF SALIENCY MAP
The human visual system always pays more attention to the salient area than background in an image to reduce the difficulty in some tasks, such as object detection, tracking and recognition [27]. Pixels of salient structure, region and object stand out from the surrounding neighbor pixels. Saliency detection is on the purpose of extracting visually salient regions of images. For image fusion, a suitable saliency detection method should simultaneously meet two requirements: clearly extract the salient region and preserve the edge and background information as far as possible. These two conditions guarantee the rich intensity information and integrated structure of the fused images. Inspired by [28], [29], a pixel-level saliency detection method named visual saliency map (VSM) is employed to obtain the saliency map of images in this step. The saliency map of an image is described as follows:  where, the size of the input image is m × n. · represents the distance between the intensity values of two pixels. I (x, y) and I (q, t) are intensity values at the pixels (x, y) and (q, t), respectively.
In order to accelerate the computation and make saliency maps of infrared and visible images in the same order of magnitudes, S(x, y) is normalized to [0,1], which is rewritten by: For proving the advantages and reasonability of the abovementioned method which can extract the saliency map, we compare it with four classical saliency detection algorithms: GS [30], SF [31], GBMR [32], RBD [33]. As shown in Figure 5, saliency maps of GS, SF, GBMR and RBD methods all can detect the salient regions. However, these methods have a common deficiency that they overemphasize salient areas so that filter out plenty of background information, which will result in missing a large amount of information in fused images. Compared with the above four methods, VSM can not only perfectly extract the salient object (such as people of infrared image) but also reserve the major content in infrared and visible images. In summary, VSM is more suitable for image fusion.

2) DESIGN OF FUSION STRATEGY
To maintain sufficient intensity information of salient targets in the source image and improve the visual quality of the fused image, the saliency map is considered as an important reference to calculate the weight matrix and guide the base layer fusion. The weight matrixes of infrared and visible images fusion are provided by: where, A (x, y) and B (x, y) are the saliency values of infrared and visible images at the pixel (x, y), w bA (x, y) and w bB (x, y) are the weight values of infrared and visible images at the pixel (x, y). Note that, at the pixel ( will be unusable. In this condition, the saliency values of infrared and visible images are equal at pixel (x, y). Therefore, we set the weight value w bA (x, y) to 0.5.
Therefore, the fused image of base layer F base is generated as follows: The details of the proposed base layer fusion strategy are described as Algorithm 1.

Algorithm 1 The Proposed Base Layer Fusion Strategy in This Paper
Input: a pair of aligned infrared and visible images; Output: the fused image of base layer F base ; 1. Input images are decomposed by Equations (1)-(4) to obtain base layer images b l A and b l B ; 2. Normalized saliency maps A and B are obtained by Equations (11) and (12); 3. Weighted matrixes w bA and w bB are computed by Equations (13) and (14); 4. Fused image of base layer F base is acquired by Equation (15).

C. FUSION OF DETAIL LAYER
As shown in Figure 2, the detail layers of input images are acquired by decomposition. Generally, the thermal radiation information produced by targets is contained in infrared images, which can be emphasized in the fused parts of base layer. Nevertheless, texture detail information of objects is included in visible images, which is beneficial to improve target tracking and recognition due to its high spatial resolution. The gradient map of an image contains contour and edge information, which has strong sharpness and contrast.  Some researchers have utilized the gradient information by different ways to enhance the image quality [34], [35]. To get fused images with rich texture details and sharpened edges, an efficient approach by introducing the enhanced gradient information is presented to finish the fusion of detail layer, which is specifically introduced as follows.
After l levels decomposition, detail matrixes M 1:l of an image with size m × n is obtained. The size of a detail matrix M i is S p × C. S p = 256, C is calculated by Equation (10).
The weight for each pair of corresponding image patches can be written as follows: where, A and B are infrared and visible images, respectively, · * denotes the nuclear norm to calculate the sum of singular values of the matrix. re(·) indicates the function that reorganizes the matrix M i,k with size 256 × 1 into the image patch with size 16 × 16. The process of re(·) is shown in Figure 6.
Based on Equation (16), the fused detail matrix is provided as below: The fused image of detail layer at level i is expressed by Equation (19): Here, R(·) is the same function as Equation (3), which reconstructs the detail image d i according to the detail matrix M i .
For improving the quality of the fused image, enhanced gradient information of the visible image is added to the fusion of detail layer image as the supplementary content. Gamma transformation function is used to increase the contrast of gradient map, which is denoted as follows: where, g and G are the input matrix and the matrix after Gamma transformation, respectively. ε is a very small constant term called compensation factor, which makes sure (g + ε) is non-zero, γ is the Gamma coefficient. In this paper, the input of Equation (20) is the gradient map of visible image, which can be solved by: where I B is the visible image. According to Equations (20) and (21), the gradient map∇ B after Gamma transformation can be expressed as follows: Based on Equations (19) and (22), the fused image of detail layer F detail can be derived as follows: The details of the proposed detail layer fusion strategy are given as Algorithm 2.

D. IMAGE RECONSTRUCTION
Image reconstruction contains mainly two parts in this paper. One part is to reconstruct the fused image of detail layer and the other part is to reconstruct the final fused image depend on the fused images of base layer and detail layer. The fused image F of infrared and visible image is summarized as follows:

III. EXPERIMENTAL DATASET AND SETTINGS A. EXPERIMENTAL DATASET
In this paper, we test our method on TNO Image Fusion Dataset [36] and KAIST Dataset [37]. TNO dataset contains many registered infrared and visible images under different scenes, which can freely be used for research purpose. Therefore, the TNO dataset is widely used for infrared and visible image fusion research. A sample of these image pairs is shown in Figure 7. KAIST is a multispectral pedestrian dataset, which contains abundant registered infrared-visible image pairs.

C. EVALUATION METRICS
For objectively analyzing the fusion results of the proposed method, seven quality metrics are utilized.

1) ENTROPY(EN)
EN calculates the information richness of the fused image. The higher the EN is, the more information is contained in the fused image, and the better quality of the fused image is [48]. The definition of EN is provided by: where Z is the number of gray values, p z is the normalized histogram of the corresponding gray level in the fused image.

2) MUTUAL INFORMATION(MI)
MI computes the amount of information that is integrated from the source images to the fused image. The larger MI means the higher quality of the fused image [49]. MI is defined as below: where p S (s) and p f (f ) are the marginal histograms of source image S and fused image F, respectively. p S,F (s, f ) are the joint histogram of the source image and the fused image.

3) AVERAGE GRADIENT(AG)
AG measures the degree of sharpness and clarity in the fused image. Large AG means the fused image has much details information and clear edges [50]. AG is calculated as below: In which, F(x, y) is the pixel value of the fused image with the size m × n.

4) SPATIAL FREQUENCY(SF)
SF calculates the distribution of the intensity information and structure features in the fused image. Larger SF means the fused image has more texture details and more sharpening edges. It contains two parts: spatial Row Frequency (RF) and spatial Column Frequency (CF) [51]. SF is expressed as follows: where F(x, y) is the pixel value of the fused image.

5) STANDARD DEVIATION(SD)
SD indicates the spread of the information in the image. Larger SD means that the fused image has higher contrast, wide distribution of the gray value, and richer information [52]. SD is defined as follows: Here, F(x, y) is the pixel value of the fused image with the size m × n, F mean is the mean pixel value of the fused image.

6) SUM OF THE CORRELATIONS OF DISSERENCES (SCD)
SCD means the amount of complementary information contained in the fused image. The larger SCD is, the higher quality of the fused image is [53]. SCD is defined as follows: where D and a are the average value of the pixel values of D I and I , respectively. VOLUME 9, 2021

7) VISUAL INFORMATION FIDELITY FUSION(VIFF)
VIFF expresses the visual information fidelity of the fused image, the higher the VIFF is, the better visual quality of the fused image is [54]. VIFF is a metric with complexity calculation, which can be simplified as follows: where VID is the visual information with distortion information, whereas, VIND is the visual information without distortion information. All above-mentioned evaluation indicators are positively correlated with fusion quality, that is, the greater the metric value is, the better the fusion quality is.

D. PARAMETER SETTINGS
In this paper, the image decomposition level l is set to {1, 2, 3, 4}. In Section II-C, for the fusion of detail layer, an efficient approach by introducing the enhanced gradient information is presented to increase the texture detail information and sharpen the edges of the fused image. A vital parameter of this approach is γ , which is a factor of Gamma transform and determines the enhancement degree to the gradient map. Different γ value will result in different fusion performance. In this part, we set the γ from 0.25 to 2.5. The interval is 0.25. It is necessary to select one appropriate γ value for our proposed fusion framework based on the test images. Thus, we use seven metrics to evaluate the performance of the proposed fusion method with different γ .  The average values of seven metrics for all fused images obtained by the proposed fusion framework with different γ and different level l are shown in Figure 8.
In Figure 8, NON on the X-axis means the proposed fusion framework without γ enhanced visible gradient information, which is considered as the reference to estimate the fusion quality of the proposed fusion framework with different γ . Compared with NON, when γ = 0.25, the average values of metrics (except AG and SF) are observably decreased, which illustrates that the fusion performance of the proposed framework has not been improved. When γ belongs to [0.5, 1.5], the maximum average values of metrics (including EN, MI) are still smaller than NON condition. When γ > 1.5, the average values of metrics at levels 1 to 4 are all larger than NON, which proves that the fusion quality of the proposed framework has been significantly increased.
In summary, if γ is set to an appropriate value, the fusion quality of the proposed fusion framework will be enhanced.
For intuitively and concretely evaluating the fusion performance of the proposed fusion method with different γ , we propose a max-comparison method to score γ , which is described as follows: sgn(max(E γ (nm)) − max(E(nm))) (36) E γ (nm) = (E γ (nm, 1), E γ (nm, 2), E γ (nm, 3), E γ (nm, 4)) (37) VOLUME 9, 2021 FIGURE 11. Experiments on ''men'' images of TNO dataset.  where NM is the number of evaluation metrics, in this paper, NM is equal to 7. The nm values (from 1 to 7) correspond to the metrics EN, MI, AG, SF, SD, SCD, and VIFF, respectively. E γ (nm) contains the nm-th evaluation metric values calculated according to the fusion result obtained from the proposed fusion method with γ enhanced visible gradient information, E γ (nm, i) represents the evaluation metric value at level i. E(nm) is the nm-th evaluation metric values calculated according to the fusion result obtained from the proposed fusion method without γ enhanced visible gradient information. sgn(·) is the signum function, which is denoted as follows: The higher the Score(γ ) is, the better the fusion quality is. The max-comparison score results of different γ is shown in Table 1. When γ = 1.75, 2, 2.25, 2.5, the scores are all 7 and higher than when γ is equal to other values. That is, the proposed fusion framework with γ = 1.75, 2, 2.25, 2.5 can achieve good performance.
To further select the best γ from these four values, we propose a novel rank-score method based on [55]. The modified rank-score method is expressed as follows: where K is the number of γ value, and K = 4. K is the ranking of the fusion performance with γ .The rank scores of these Gamma γ = 1.75, 2, 2.25, 2.5 are presented in Table 2. When γ = 1.75, the rank score is the maximum among these four conditions, which means the fusion quality is the best. As a result, we set γ to 1.75 in our proposed fusion framework.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this part, fusion results of the proposed method are evaluated subjectively and objectively. We choose fifteen comparison methods and seven evaluation metrics introduced in Section III to demonstrate the fusion performance of our method. To test the fusion performance of our proposed method, we conduct our method and the fifteen comparison methods on TNO and KAIST datasets. In the field of image fusion, researchers generally select about 20 pairs of images to test the algorithm [46], [47]. Therefore, we select 20 infrared and visible image pairs from TNO dataset and the other 20 infrared and visible image pairs from KAIST dataset for testing. The subjective and objective analyses are provided as follows.

A. EXPERIMENTS ON TNO DATASET 1) SUBJECTIVE EVALUATION
Three examples of the fused results on TNO dataset are given in Figures 9-11. As shown in the red and blue boxes of Figures 9-11, the fused images of CBF method have much noise and unclear detail information. Compared with our proposed method, GTF, FusionGAN, and DenseFuse can only generate very little saliency features in the fused images. Although CVT, DTCWT, MSVD, and IFEVIP, LatLRR and DenseFuse can integrate some salient and texture information to the fused images, the edges are blurred and incomplete. By contrast, GFF, HMSD-GF, FCNN, GF, WLS, NestFuse and the proposed fusion framework can achieve better fusion than others. Furthermore, our method can simultaneously integrate enough luminance and detail structure information to the fused image. The visual quality of fused images obtained by our method is obviously better than others. Especially, as shown in Figure 9, the words in the red boxes obtained by our methods are quite clear, have the whole contour and sharpened edges. In addition, as exhibited in Figure 11, the men in red boxes and the window in blue boxes are retained well in the fused images of our method. With the increase of the image decomposition level (from 1 to 4), the saliency of the target regions is enhanced, and the contrast of the fused image is improved gradually.
In summary, compared with traditional and deep learning methods, our method can deliver fused images with stronger intensity of the salient targets, more detail information, more sharpened edges, and higher visual quality.

2) OBJECTIVE EVALUATION
The average evaluation metrics values obtained by the proposed method and comparison methods on TNO dataset are shown in Table 3. In general, the average values of our proposed method (decomposition level from 1 to 4) on these metrics are acceptable. Specifically, the average values (including AG, SF, VIFF) of our method (level 4) are much larger than those of other methods. These values show that the proposed method can produce the fused image with adequate details information, clear edges, and high visual quality.
The average values of our method on these metrics are all in the top three, which indicates that can achieve better image fusion than other fifteen methods. The average values (including EN, MI, AG, SF, VIFF) of our method (level 4) are the maximum among these comparison methods. These values demonstrate that our method can preserve sufficient information, enhance the features (such as edges), and improve the visual quality of fused images. The average values on the metric SCD of our method are all bigger than most methods, which illustrates that the information of the fused images obtained by our method has credible complementarity.
For assessing the proposed fusion framework intuitively, we give the bar charts of different fusion methods about seven evaluation metrics on TNO dataset in Figure 12. The different color bars represent the average evaluation metrics values of different fusion methods. The average values of our method on metrics EN, MI, SD, and SCD are very high, these values illustrate that the proposed method can retain enough information. It can be seen that the average values on metrics AG, SF, and VIFF are significantly larger than those of other methods, which proves that the proposed fusion method can produce fused images with strong spatial structure and fine visual quality.

B. EXPERIMENTS ON KAIST DATASET 1) SUBJECTIVE EVALUATION
The example of the fused results on KAIST dataset is given in Figure 13. Pedestrians and fence are contained in the red boxes. As shown in Figure 13, the fused images of MSVD and GTF methods have poorer visual quality than our method. Specifically, the pedestrians and zebra crossings in MSVD and GTF fusion images are not obvious, and the edge structures of the fence and zebra crossings are not clear. Pedestrians in these fusion images (including CVT, CBF, GFF, HMSD-GF) is not salient. The texture information of the fence in the fused image produced by FusionGAN is little. As exhibited in Figure 13, our method, IEFVIP, FCNN and LatLRR can generate the fused images with high quality. Especially, compared with other methods, the pedestrians and zebra crossings of the fused images obtained by our method are more salient, and the edges of the fence are more sharpened. In conclusion, the proposed fusion method can generate fused images with salient targets, clear detail textures, sharpened edges, and high visual quality. Table 4 gives the average evaluation metric values obtained by different fusion methods on KAIST dataset. As shown in Table 4, the average values (including AG, SF, SD and VIFF) of our method (level 4) are the best among these methods, which illustrates that the proposed infrared and visible image fusion method can produce fused images with clear edges and high visual quality. Although the average values (including EN, MI and SCD) of our method are not in the top three, they are all acceptable, which indicates that our proposed method can maintain sufficient source image information.

2) OBJECTIVE EVALUATION
In order to compare the proposed fusion method with other methods more intuitively, Figure 14 provides the bar charts of different fusion methods about seven evaluation metrics on KAIST dataset. The different color bars represent the average evaluation metrics values of different fusion methods. As shown in Figure 14, the average values (including AG, SF, SD and VIFF) of our method are obviously larger than other methods. The average values (EN, MI and SCD) of our method are not the maximum, but they are still greater than many other methods. These values reflect the superiority of the proposed method. After quantitative and qualitative analyses of the experimental results on TNO and KAIST datasets, we can draw a conclusion that the proposed method can realize satisfactory image fusion and outperforms than most of existing fusion methods. Table 5 gives the average running time of the proposed fusion method and other comparison methods on TNO and KAIST datasets. As shown in Table 5, the LatLRR method consumes more time to fuse a pair of images than many algorithms since it needs sliding window during the fusion process. Our method is designed based on the LatLRR model, so it also takes a long time to fuse a pair of images. The average running time of these methods (including LatLRR, FCNN and NestFuse-n) are larger than our method (level 1). As presented in Table 5, many methods require less running time than the proposed method. However, there are still many methods that cannot meet the real-time requirement.

3) DISCUSSION ON TIME EFFICIENCY
Our work focuses on the quality of the fused images. Although the proposed infrared and visible image fusion method has a certain complexity, it performs better than lots of classical and state-of-the-art methods. Considering the previous subjective and objective analysis results, we can still say that the method proposed in this article has a good performance and is suitable for the condition without the realtime requirement. In the future work, we will try to accelerate the image fusion speed of our method.

V. CONCLUSION
In this paper, we propose an effective method for infrared and visible image fusion, which can produce fused images with strong intensity information, high visual quality, rich texture details, and sharpened edges. Source images are decomposed into base layer and detail layer by the decomposition method. For the fusion of base layer, an excellent strategy guided by the saliency map is designed, which can preserve suitable intensity information and improve the visual quality of the fused images. For the fusion of detail layer, an ingenious approach is constructed by utilizing the enhanced gradient information. This approach can increase the details information and sharpen the edges of the fused image. Moreover, lots of comparison experiments are conducted, which convincingly proves the effectiveness and advantages of the proposed fusion framework. In addition, image fusion technology has been widely used in many fields of computer vision. Therefore, in the future, we will try to apply the proposed fusion method to some computer vision tasks, such as target recognition.