Fusion of Low-Quality Visible and Infrared Images Based on Multi-Level Latent Low-Rank Representation Joint With Retinex Enhancement and Multi-Visual Weight Information

Latent low-rank representation has been applied to multi-level image decomposition for the fusion of infrared and visible images to obtain good results. However, when the original infrared and visible images are of low quality, the visual effect of the fused images is still unsatisfactory. To combat this challenge, this paper proposes an infrared and visible image fusion method based on multi-level latent low-rank representation joint with image enhancement and multiple visual weight information. First, the source images are decomposed into detail parts - including detail images and detail matrices - and the base images respectively using multi-level latent low-rank representation. Then the nuclear norm based fusion strategy is used to fuse the detail matrices and multi-visual weights determined by the clarity, local contrast and edge-corner saliency is used to fuse the detail images. The aforementioned two fusion results are weight averaged to obtain a fused detail image. The base images are fused by an averaging strategy after Retinex-based enhancement. The final fused image is obtained by combining the fused detail image and the fused base image. Compared with other state-of-the-art fusion methods, the proposed algorithm displays better fusion performance in both subjective and objective evaluation.


I. INTRODUCTION
Image fusion is an important branch of multi-sensor information fusion. It aims to integrate the image information of a certain moment under the same scene obtained by different types of sensors, which aids in describing the characteristics of the target scene more comprehensively [1]. The visible images capture scene information based on reflection mechanism. Under well-lit conditions, the obtained image has a high resolution, but it is difficult to capture all the information in the scene at night or under low visibility conditions such as fog and haze. The imaging of infrared images occurs by The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei.
detecting the thermal radiation of the object itself, which highlights the thermal target blocked by the object. This technique reduces interference by environmental factors, but its texture detail information is insufficient and the contrast is low [2]. Infrared and visible image fusion can fully combine the advantages of these two imaging technologies. The fused image contains highly-detailed texture information and clear infrared thermal radiation targets, which is conducive to further target detection and other work [3], [4].
In recent years, with the development of fusion technology and the demand of practical application, many infrared and visible image fusion algorithms have been proposed. The multi-scale transform method is the most widely used image fusion method. The transform domain algorithms include the wavelet transform [4], [5], the shearlet transform [6], the curvelet transform [7], and the non-subsampled contourlet transform [8]. This type of method usually comprises three steps. First, each source image is decomposed into several levels. Then the decomposed layers are integrated by appropriate fusion rules. Finally, the fused image is obtained by the corresponding inverse transformation. In recent years, the multi-scale transform based methods are often combined with neural network to achieve better fusion performance [9], [10]. However, it is still challenging to select flexible basis functions that allow data driven choice of the best representation of source images, and the adaptive selection of decomposition levels still remains to be solved [1]. Hence, researchers attempt to find other methods to process source images without a transform, such as deep learning-based method and representation learningbased method.
Deep learning-based fusion methods specialized for infrared and visible image fusion include the convolutional neural network (CNN) [11], DenseFuse [12], the disentangled representation fusion (DRF) [13] and so on. The main drawback of deep learning-based methods is that it is difficult to train when the training data is insufficient, especially in infrared and visible image fusion tasks, and a very little attention is paid to image decomposition in the deep learning based methods [14].
Sparse representation is a commonly used method in the domain of representation learning. The image fusion method based on sparse representation does not need to transform the image to a certain frequency domain. This kind of method uses a sliding window to divide the source image into blocks, reducing the impact of image misregistration [15]. For example, Zhang et al. used overlapping sub-blocks of infrared and visible images for training to construct an over-complete dictionary [16]. To increase the fusion effect, sparse representation usually combines with other tools, such as shearlet transform [17] and image cartoon-texture decomposition [18].
Although sparse representation can overcome the deficiency of multi-scale transformation and achieve better fusion performance, it still suffers from the insufficient ability of detail information extraction [14]. Latent low-rank representation (LatLRR) [19] can effectively extract local and global structural information simultaneously. Li et al. used LatLRR for infrared and visible image fusion [20]. However, the LatLRR-based image decomposition method only extracts incomplete high-frequency information, which will lead to unsatisfied fusion results. To alleviate the issue, the basic image decomposed by LatLRR was further decomposed into high-frequency and low-frequency components via the non-subsampled shearlet transform in [21]. Li et al. further proposed a LatLRR based multi-level decomposition method (MDLatLRR) [14], which performs multi-level LatLRR decomposition, and can extract more details from the source image. This paper focuses on the fusion method of multi-level LatLRR decomposition.
MDLatLRR first decomposed the images by multi-level LatLRR and then averaged the low-rank images. The significance coefficient matrix was fused block by block by a weighted fusion method based on nuclear norm. Multilevel LatLRR decomposition of images can remove image noise, but it cannot effectively solve the clarity and contrast problems of blurred visible and infrared images. This results in an unsatisfactory visual effect. Based on Retinex theory, the reflection can be extracted by removing illumination to enhance the image details [22]. Recently, Retinex theory has been widely used in the enhancement of low-quality visible [23]- [25] and infrared images [26]. Although the quality of image is improved by implementing Retinex enhancement, some targets are still not clear or invisible in the enhanced image because these targets could not be captured in the source image due to occlusion or low visibility conditions. Image fusion can combine the advantages of thermal radiation information in infrared images and detailed texture information in visible images to make the missing targets to be observed or to make some vague targets clearer. The combination of image fusion and image enhancement can make full use of the advantages of the two methods to improve the fusion effect [27]. We will consider the enhancement before fusion.
In general, a good fusion algorithm depends on the image decomposition method on the one hand, and the fusion strategy on the other hand [28]. In MDLatLRR, the fusion strategy of detail parts is based on nuclear-norm, which calculates the sum of single value of each input patch to preserve 2D information from the source images. It is usually not sufficient to retain the detail information and saliency target in full measure only with nuclear norm. The importance of other visual salient features in detail image fusion has also been verified in [28]- [31]. We will further consider other visual salient features for the fusion of the detail parts.
In summary, this paper focuses on the problem of fusion for blurred visible and infrared images by proposing an infrared and visible image fusion method based on latent low-rank representation combined with image enhancement and multiple visual weights. After the multi-level LatLRR decomposition of the image, the Retinex enhancement and the image visual fusion weight calculation are combined. The innovations of the method in this paper are as follows: (1) A novel visible and infrared image fusion framework based on multi-level latent low-rank decomposition joint with image enhancement, multi-visual information, and nuclear-norm is proposed to be suitable for the fusion of low-quality visible and infrared images.
(2) Retinex enhancement was performed on the base image, which is decomposed by MDLatLRR to improve the clarity of the final fused images.
(3) To enhance the saliency of the fused image, the fused detail image was obtained by weight summarizing the result of multi-visual fusion and nuclear norm fusion. The weights of multi-visual fusion are constructed by combining three types of visual information, which are marginal significance, local contrast, and clarity.
The rest of the paper is organized as follows. In Section II, we introduce preliminaries, including image decomposition based on LatLRR and Retinex-based image enhancement. In Section III, two strategies for the combination of MDLatLRR decomposition and image enhancement are discussed. In Section IV, the proposed image fusion method is presented in detail. The experimental result and analysis are shown in Section V. Finally, in Section VI we draw our conclusions.

A. IMAGE DECOMPOSITION BASED ON LATENT LOW-RANK REPRESENTATION
The main idea of the LatLRR [19] is to express the original data matrix as the sum of low-rank components (global structure), significant components (local structure), and sparse noise. The mathematical model of LatLRR can be expressed as where · * denotes the nuclear norm which is the sum of the singular values of the matrix, · 1 is l1-norm, and λ > 0 is a balance coefficient. X ∈ R N ×M denotes the observed data matrix, Z ∈ R M ×M is a low-rank matrix to separate data and noise, L ∈ R N ×N is the salient coefficients matrix and E ∈ R N ×M is a sparse noisy matrix. Eq.1 can be regarded as a convex optimization problem with a nuclear norm that can be solved by the inexact Augmented Lagrangian Multiplier (ALM) [19] algorithm. When the LatLRR is used for image decomposition, the image data must be preprocessed first to obtain the observed data matrix where each column of the matrix corresponds to an image patch. Fig.1 shows the process of the single-level LatLRR decomposition of images, where P(·) denotes a two-stage preprocessing operator which divides the input image I into many image patches by a sliding window of size n×n with an overlap stride s. Afterwards, these image patches are reshuffled into the observed data matrix. Stride s means that the window is moved by s pixels in the x and y direction at a time.
As Fig.1 shows, once the projection matrix L is learned by LatLRR, the image I is decomposed into a detail image I d and a base image I b by Eq.2 [14].
where R(·) signifies the operator which reconstructs the detail image from the detail part. When restoring V d , the averaging strategy is used to process the overlapping pixel. Convenient for subsequent description, the operation of the dashed box in Fig.1 is denoted as DLatLRR. The framework of the multi-level version of DLatLRR (MDLatLRR) can be modified as Fig.2 since it is easy to verify that P (I b ) = P(I ) − V d . In order to distinguish the MDLatLRR in [14], the procedure of image multi-level LatLRR of images shown by Fig.2 is named as IMDLatLRR. As can be seen from Fig.2, image I is decomposed by r-level LatLRR, and output r base images I k b , r detail images I k d and r detailed matrices V k d , k = 1, 2, · · · , r.

B. RETINEX-BASED IMAGE ENHANCEMENT
The basic assumption of Retinex theory is that the original image I (x, y) can be decomposed into the product of the illuminance component and the reflection component as follows: where L(x, y) is the illuminance component, which determines the dynamic range of image gray scale transformation and R(x, y) is the reflection component which represents intrinsic features in real-world scenarios. The purpose of Retinex-based image enhancement is to eliminate the influence of uneven illumination and to represent intrinsic features of the image. According to singlescale Retinex, the reflection component can be obtained as follows: where * is the convolution operation and G(x, y) is a filtering function.

III. IMAGE ENHANCEMENT AND DECOMPOSITION
Image enhancement can improve the clarity and contrast of images, thus improving the visual effect of fusion results.
After an image is decomposed by IMDLatLRR, the detail image is usually clearer and image blurriness caused by low illumination is mainly focused on the base image.
In this sense, image enhancement can be conducted on the base image after IMDLatLRR. This means there are two strategies to combine IMDLatLRR decomposition and image enhancement: (1) Enhance the source image and decompose the enhanced image by IMDLatLRR; (2) Decompose the source image by IMDLatLRR and enhance the base image  after decomposition. Fig.3 shows the result of image reconstruction after different processes, where Fig.3(b) is reconstructed after Retinex-based image enhancement and then IMDLatLRR, Fig.3(c) is reconstructed after IMDLatLRR and then base image enhancement. By inspecting the results in Fig.3, it is clear that the process order of IMDLatLRR and Retinex-based image enhancement has little effect on the results of image reconstruction. We also did an experiment to compare the fusion result using these two preprocessing methods and the results show that the second strategy is slightly superior. For lack of space, this article focuses on the second strategy.

IV. PROPOSED FUSION METHOD A. THE FRAMEWORK OF THE FUSION METHOD
The framework of our fusion algorithm is presented in Fig.4. Input infrared image I IR and visible image I VIS , IMDLatLRR is first carried out for I IR and I VIS respectively to output detail matrices V 1:r dIR and V 1:r dVIS , detail images I 1:r dIR and I 1:r dVIS , and two base image I r bIR and I r bVIS . In our fusion method, I r bIR and I r bVIS are Retinex enhanced respectively and a weighted average strategy is utilized for the enhanced imagesÎ r bIR andÎ r bVIS to obtain the fused base image I bf .
For each pair of detail matrices, the nuclear-norm based fusion strategy [14] is used to fuse these matrices column by column and reconstructed to obtain the fused detail image I df1 . Summarize r detail images I 1:r dIR to obtain one detail image I r dIR , and summarize r detail images I 1:r dVIS to obtain one detail image I r dVIS . Both I r dIR and I r dVIS contain more obvious structure and features. Then weights based on multiple visual salient features were performed on I r dIR and I r dVIS to obtain the fused detail image I df2 . The final fused detail image is obtained by Once the fused detail image and base image are obtained, the fused image is obtained by I f = I bf + I df . In the next subsections, the details of the fusion strategies will be presented.

B. FUSION OF BASE IMAGES BASED ON RETINEX ENHANCEMENT
Given the two base images I r bIR and I r bVIS , firstly calculate the Gaussian mask G, then, according to Eq.(4), the enhanced base imagesÎ r bIR andÎ r bVIS are obtained by Because the base image mainly contains the background information of the source image, the simple weighted average fusion is usually used to achieve a better fusion effect. The fused base image is obtained by  where w bIR and w bVIS denote the corresponding weights of the two enhanced base images.

C. FUSION OF DETAIL PARTS BASED ON NUCLEAR-NORM AND MULTI-VISUAL WEIGHT INFORMATION
The detail image contains more visual salient features than the base image. The nuclear-norm based fusion strategy for the detail images has been verified effective in [14]. Furthermore, the importance of visual salient features in the detail image fusion has also been verified in [28]- [31]. Better fusion effects can be obtained by simultaneously considering nuclear-norm and multiple visual salient features. Three key indicators that reflect the visual effects of structures, namely clarity, contrast and edge-corners saliency of images, are used to construct multiple visual fusion weights.
1) Image clarity The human eye is more sensitive to image clarity. Several focus measures were studied in [32] as the measures of image clarity, and the improved Laplace energy sum was verified to show better performance. It can be formulated as For the sake of convenience, denote I xy as I (x, y) in the following. The clarity of images is defined as where N 1 and M 1 are positive integers.
2) Local contrast of image Compared with the changes of a single pixel, it is easier for human eyes to detect the changes of local pixels. Both local image gradient and local image contrast are key values that assess the spatial details of an image, therefore, the magnitude of local image gradients with the local image contrast are combined to calculate local contrast LC as follows: ∇I pq (9) where ℵ I xy is an s * s neighborhood around I xy , ∇I pq is the spatial gradient of I pq , and β is a constant.
3)The edge-corner saliency of the image The structure tensor can reflect the structure and spatial information of the image. Thus, the smooth region, the edge region, and the corner region of the image can be distinguished by using the structure tensor [33].
The linear structure tensor matrix of each position in the image is defined as: Normalize the matrix M and N to obtain M and N respectively. Then the saliency of the edge-corner is obtained by combining M and N linear as follows: where k ∈ [0, 1] is a constant.

4) Multi-visual fusion weight
The fusion weight plays a key role in the image fusion effect. Define the single weight model as follows:  Finally, the fusion weights constructed by AF, EA, and LC are defined as:

5) Fusion and reconstruction
The final fused detail image is obtained by where w ∈ [0, 1] is a constant, I df 1 is the result of fusion and reconstruction for the detail matrices as referred to MDLatLRR [14], and the value of I df 2 in position (x, y) is

V. EXPERIMENT RESULT AND ANALYSIS A. EXPERIMENTAL SETTING
To verify the advantages of the proposed fusion algorithm, we adopted the dataset from [34]. The test dataset contains 21 aligned infrared and visible image pairs. A sample of these infrared and visible images is shown in Fig.5. The selected four groups of source images have different resolution and the infrared and visible images of the same scene have the same resolution. The scene features of each group of images are different. Among them, the group ''Street'' is the road scene at night, including vehicles, street lamps, pedestrians, etc. Compared with other groups, the group ''Forest'' has a lower resolution and relatively singular scene information. The illumination of the ''Man'' images is uneven and the edge illumination is weak. The visible images of the ''Soldier'' group have poor visibility due to the smoke. All the parameters needed in IMDLatLRR are the same as those used in [14]. Namely, a sliding window of size 16 * 16 with stride equal to 1 is used to divide the source images into image patches, the decomposition level varies between 1 and 4 (meaning r = 1, 2, 3, 4), the nuclear norm is used in the fusion of detail matrices, and w bIR = w bVIS = 0.5. According to [35], the value of k in Eq.13 is set as 0.5 in our experiment. There are other parameters in the proposed algorithm. A group of experiments are conducted to evaluate the robustness of the parameters used in the algorithm.
All the experiments are implemented in MATLAB R2017b on 3.6 GHz Intel(R) Core(TM) i7-7700 CPU with 16 GB RAM.

B. VERIFY THE PARAMETER ROBUSTNESS OF THE ALGORITHM
The Gaussian filtering approach is used in Retinex-based enhancement. In addition, there are four parameters that may affect the algorithm's performance, including the scale σ used in the Retinex-based enhancement of the base image, the size of neighborhood (denoted as s) used to calculate the LC, the size of the Gaussian filter window (denoted as h) used to calculate EA, and the fusion weight (w) in Eq.16.
Four group experiments are conducted to evaluate the robustness of the four parameters of the algorithm. In the first By examining the results in Fig.6, each quality metric is almost a line for all four levels. In fact, the variance for each of the quality metrics of each level is less than 0.001. The results indicate that the influence caused by the value of the Gaussian filter parameter in base image enhancement to fusion results can be almost ignored.
By inspecting the results in Fig.7, except for CC, SSIM, and VIF, other metrics slightly decrease with the increase of s, especially for AG and SF when the decomposition level equals 3 or 4. According to Fig.7, the suitable size of neighborhood used to calculate the LC is 3 * 3 or 5 * 5.
The results in Fig.8 also show that, with the exceptions of CC, SSIM, and VIF, other metrics slightly decrease with the increase of h, especially for AG and SF when the decomposition level equals 3 or 4. This is because with the increase of h, the role of the filter will be weakened. According to Fig.8, the most suitable size of the Gaussian filter window used to calculate EA is 3 or 5.
As shown in Fig.9, with the increase of w, SSIM and SD increased and reached the optimal value when w = 1, while SF and AG decreased and reached the optimal value when w = 0, and the other three indicators are very slightly increased. It shows that w = 0.5 is the most appropriate by considering all the indicators comprehensively.
As noted in Figs.6-9, with the exceptions of SSIM and CC, the other metrics increased with the growing number of decomposition levels. This is because the detail information is not fully extracted by IMDLatLRR with the shallow decomposition level (i.e. level 1), while the base parts contain more source image information, and the simple average weight of the base image ensures that more structural information of the source image is preserved. When the decomposition level increases, the detail parts will contain more base information (e.g. luminance, contour). The fusion strategy of the detail image will cause the change of image structure information. The relation between the variation trend of SSIM and the decomposition order is consistent with the results found in [14].

C. ABLATION EXPERIMENT
We conduct ablation experiments to explore the effectiveness of the two main components (basic image enhancement and multi-visual weighted fusion) in the proposed method. Then, the proposed method is implemented in two stages. In first stage, only the basic image enhancement is combined with MDLatLRR (MDLatLRR + Retinex), and the multi-visual weighted fusion is then considered in the second stage. In Tab. 1, compared with MDLatLRR at the same level, only SSIM is slightly reduced and all other metrics are obviously improved both in MDLatLRR + Retinex and the proposed method. This indicates that combining the Retinex enhancement of base image and multi-visual fusion of detail images can improve the fusion image contrast, highlight the image features and more according to human vision on the basis of retaining the structure of the source image. Furthermore, compared with MDLatLRR + Retinex, in the proposed method, some metrics are almost no change or very little improved when decomposition level is 1 or 2, and most metrics are improved when level is 3, while all metrics are slightly improved when decomposition level is 4. This indicates that with the increasing number of decomposition levels (between 1 to 4), more contributions are made by the multi-visual fusion of detail images because the detail becomes clearer and the salient features are better enhanced.

D. ALGORITHM COMPARISON EXPERIMENT
The parameters used in this experiment are set as follows: σ = 80, h = 5, s = 5, w = 0.5.

1) Subjective evaluation
The fused results of four pairs of low-quality source images, which are illuminated poorly or with unclear thermal targets, are shown in Figs.10-13. These results are obtained by 9 existing fusion methods and our algorithm. The MDLatLRR uses L16, nuclear norm, and the decomposition level is 4. For the sliding window technique, the stride is set to 1 and 4.  Fig.13 shows the fusion results of the source image pair of ''Soldier''. From Fig.13 we can see that the three deep learning-based methods (CNN, NSST-PAPCNN and DRF) fail to remove haze from visible images well, and the people lying down could hardly be seen. Moreover, the fusion results of GTF and VSM-WLS are also not clear enough. Focusing on the tree trunk (red boxes), the proposed method is obviously the clearest.
On the other hand, it can be seen from Figs.10-13 that the fused images obtained by MDLatLRR and the proposed method capture more salient features, and for the same method, the results of stride = 1 and stride = 4 are almost indistinguishable. Furthermore, compared with MDLatLRR, the visual quality of the fused images captured by the proposed method is slightly better.
2) Evaluation metrics   images. The optimal and suboptimal values are marked in bold red and bold blue font with underline, respectively.
In Tab. 2, the proposed method achieves the best results in 3 criteria (En, SD, and VIF) and second-best values in SF and AG when stride = 1. For stride = 4, the proposed method achieves the best results in SF and AG and second-best values in EN and VIF. Although in CC, our method does not achieve the best value, it obtains a comparable result. The quantitative  results show that the proposed fusion method delivers better fusion performance than the compared methods. Tab. 2 also shows that there is very little difference in the results of MDLatLRR under stride = 1 and stride = 4.
3) Computational cost Tab. 3 depicts the average running time of four image with different size for different methods in this paper. As can be seen from Tab. 3, the GTF, CVT, NSCT and VSM-WLS methods have short running time, but their performances are quite ordinary. The CNN, NSST-PAPCNN and DRF algorithms are based on deep learning models without training. The corresponding testing phase does not require much time, the running time of the algorithm is relatively short, and the image fusion effect is limited. Compared with MDLatLRR, the proposed method takes more time because it has two more steps, namely, base image enhancement and the calculation of multi-visual weights.
Both MDLatLRR and the proposed method are image decomposition methods based on representation learning. When stride = 1, the running time of the proposed method is similar to that of the sparse representation algorithm CSR. Compared with other methods, the proposed method did not yield good performance in running time, but its fusion performance is the best. For scenarios where time is preferred, stride = 4 can be considered in the application, and the fusion performance is slightly inferior to that with stride = 1. Therefore, compared with other methods, the running time of the proposed algorithm is still competitive. The running time of the proposed method may be reduced to some extent by using efficient programming languages like Python and C.

VI. CONCLUSION
In this paper, we developed a new MDLatLRR-based fusion framework for fusing infrared and visible images by jointing Retinex-based enhancement and multi visual weights. Firstly, the infrared and visible images were decomposed by multi-level LatLRR to extract detail parts and base parts of the input images at several representation levels. The clarity, local contrast and edge and corner significance of the image were calculated to construct the visual weights for deal parts fusion, the final fusion results of the detail images are weighted averaged by nuclear norm fusion and multiple visual weights fusion. Retinex-based enhancements were conducted on the base parts before fusion the base image using the averaging strategy. The proposed method was evaluated both subjectively and objectively in a number of experiments. The experimental results demonstrate that compared to MDLatLRR, the effectiveness of image enhancement of base image is obvious, and with the increasing number of decomposition levels, the multi-visual fusion of detail images contributes more to the improvement of the fusion performance. The performance of the proposed method is superior to that of the compared methods. However, the proposed method also has its limitations, such as the difficulty of determining the combination of multi-visual weight coefficients and the insufficient investigation of the VOLUME 10, 2022 selection of Retinex enhancement filters. In the future, we will conduct further research on the following aspects: 1) Try to use other metrics of multi-visual information for image fusion, such as edge information using fractional order differentiation based fusion metrics [47]. 2) Try to use optimized methods to determine the combination of different multi-visual weight coefficients. 3) To investigate the effect to the fusion result caused by different filters in the Retinex enhancement algorithm. 4) Develop fast fusion scheme for color images fusion based on MDLatLRR and Retinex enhancement.