IR-MSDNet: Infrared and Visible Image Fusion Based On Infrared Features and Multiscale Dense Network

Infrared (IR) and visible images are heterogeneous data, and their fusion is one of the important research contents in the remote sensing field. In the last decade, deep networks have been widely used in image fusion due to their ability to preserve high-level semantic information. However, due to the lower resolution of IR images, deep learning-based methods may not be able to retain the salient features of IR images. In this article, a novel IR and visible image fusion based on IR Features & Multiscale Dense Network (IR-MSDNet) is proposed to preserve the content and key target features from both visible and IR images in the fused image. It comprises an encoder, a multiscale decoder, a traditional processing unit, and a fused unit, and can capture incredibly rich background details in visible images and prominent target details in IR features. When the dense and multiscale features are fused, the background details are obtained by utilizing attention strategy, and then combined with complimentary edge features. While IR features are extracted by traditional quadtree decomposition and Bezier interpolation, and further intensified by refinement. Finally, both the decoded multiscale features and IR features are used to reconstruct the final fused image. Experimental evaluation with other state-of-the-art fusion methods validates the superiority of our proposed IR-MSDNet in both subjective and objective evaluation metrics. Additional objective evaluation conducted on the object detection (OD) task further verifies that the proposed IR-MSDNet has greatly enhanced the details in the fused images, which bring the best OD results.


I. INTRODUCTION
R EMOTE sensing image fusion has been studied for decades, because complementary information from multisource remote sensing images in the fused image is of great help to various remote sensing applications such as surveillance, object detection (OD), etc. [1], [2]. The fusion of infrared (IR) image and visible image is one of the important tasks in remote sensing field. Its main purpose is to extract features from multisource images, and then fuse them to generate fused images with prominent IR target and rich background details. Recently, many fusion methods have been proposed for this purpose, which can be broadly divided into traditional methods [3]- [7] and deep learning-based methods [8]- [12].
In addition to the spatial domain method, most traditional methods are based on signal processing techniques. These signal processing methods mainly include multiscale-based methods [3] and learning-based methods, which have achieved good results. Different from multiscale transformation, learning approaches are usually based on representation, such a as sparse representation [7] and dictionary learning [4]. Although these direct methods can avoid information loss during image fusion, they are usually complicated and time consuming, especially for online learning.
Recent advances in deep convolutional neural networks (CNN) in remote sensing has provided better potential for image fusion in learning and extracting high-level semantic information than traditional methods. Prabhakar et al. [8] proposed a novel CNN-based fusion framework for a multiexposure image fusion task. A fusion framework that utilizes multilevel deep features for image fusion was proposed [9]. Li et al. [12] proposed a fusion network using dense block in an autoencoder manner. Ma et al. [11] introduced a novel method for image fusion using a generative adversarial network (GAN) for IR and visible image fusion. Though the fusion performance of these CNN or GAN methods is better than existing methods, there are still some drawbacks in IR and visible image fusion.
First, common deep networks cannot efficiently extract salient features of IR images, because subsampling at each layer will weaken or smooth their features of IR images, and will be submerged in multiscale features of visible images. Second, because of the characteristic of IR image features, most network framework are not necessarily suitable for IR and visible image This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ fusion. Finally, IR features need to be further enhanced for image reconstruction after.
To overcome these above drawbacks, a novel IR and visible image fusion based on IR Features & Multiscale Dense Network (IR-MSDNet) is proposed, which makes full use of respective advantages of deep learning and traditional handcrafted feature, especially IR feature extraction, to obtain a better fusion result. It preserves full background details and key target features from both visible and IR images.
Major contributions of this article are described as follows: 1) In IR-MSDNet, an efficient encoder is designed for both visible and IR images, which is capable of preserving both the target details in IR images and the rich background details in visible images, and multiscale decoder can reconstruct the initial fusion image 2) IR features are extracted by traditional quadtree decomposition [13] and Bezier interpolation [14], and further enhanced by refinement. These IR features are combined with the initial fused image to produce the final fusion image directly. In this way, IR features will be not submerged in multiscale features of visible images. 3) To the best of our knowledge, it is the first time to combine deep learning with traditional methods for remote sensing image fusion, opening up a new perspective of visible and IR image fusion. The rest of this article is organized as follows. Section II describes the related work. Section III introduces the proposed IR-MSDNet in details. Section IV includes the discussion made on the experiments, and results. Section V concludes this article.

A. Traditional Fusion Methods
The spatial domain approach can be roughly divided into pixel-based, block-based, and region-based approaches. Pixelbased image fusion method extracts image features by preserving the spatial consistency of final fused images, such as dense scale-invariant feature transform (DSIFT) [19], image matting (IM) [20], and guided filtering (GF) [21]. In block-based image fusion method [22], [23], the images are divided into the same number of blocks, and then the blocks are fused by fusion rules. The number and size of blocks directly affect the fusion results. The region-based method depends on image segmentation, so its performance also depends on the efficiency of image segmentation [24], [25].
The multiscale transformation methods, as a typical representative of traditional fusion methods, are usually used to decompose the images into multiscale representations, and then fuse the multiscale representations according to certain fusion rules. Finally, the fused image is obtained by the inverse transformation of multiscale representations. Laplacian pyramid [15], discrete wavelet transforms [16], dual tree complex wavelet transforms [17], and curvelet transform [18], are example of among multiscale transformation methods.
In short, in order to improve the quality of fusion traditional fusion methods generally require more manual intervention, and the fusion rules adopted are relatively complex, therefore, there are inevitably problems such as low efficiency and high computational cost.

B. Deep Learning Based Fusion Methods
Deep learning method has attracted extensive attention since its appearance, and has been successfully applied to a wide range of remote sensing applications, such as image fusion [26]. Liu et al. [27] first utilized CNNs as backbone to achieve a rich fusion result based on a decision map indicating the rules of image fusion. Nonetheless, this method had a limitation of training strategy only for multifocus images. Li et al. [12] proposed a novel autoencoder network for image fusion. It includes an encoder, fusion layers, and a decoder. The encoder and decoder are trained by all input images, and then deep features extracted by the encoder are adaptively fused. Zhang et al. [10] proposed a general end-to-end fusion network, which is simple and effective to produce fused images, but lacks expertise in IR images due to generalizability for different types of images. Ma et al. [11] introduced a GAN architecture for IR and visible image fusion. During the training, the source images features were concatenated to the generator network, and the fused image was obtained. However, sometimes it is the strong adversarial ability of GAN that may cause the IR image to suppress the visible image with its content after image fusion. In a word, the abovementioned deep networks are generally designed, and cannot efficiently extract salient features from IR images, as their features will be weakened by down sampling due to their lower resolution.
III. PROPOSED IR-MSDNET IR-MSDNet, as shown in Fig. 1, comprises an encoder, a multiscale decoder, a traditional processing unit, and a fused unit. Suppose IV and IR represent visible and IR images, respectively, where IV and IR images have been preregistered according to [9], and fed to the encoder and traditional processing unit. The encoder is used to extract dense multiscale features from visible and IR images for the initial image fusion, respectively. The multiscale decoder is designed to reconstruct the initial fused image with richer background detail. In addition, to increase the detail of the initial fused image, traditional processing unit is designed to extract edge features from visible and IR images, especially IR features with focusing on target details from IR images. Because the traditional processing method focuses on the IR image's target detail to extract the IR feature, visible image can be used to refine the IR feature. In the fused unit, the final fused image is generated by fusing initial fused image and IR feature.
Therefore, it can be seen from the Fig. 1 that, in our proposed IR-MSDNet, the visible and IR image fusion includes two main processes. One process is to construct an encoder and decoder network to realize the initial fusion of visible and IR images, the other is to further fuse IR image features extracted by the traditional methods with the initial fusion image to compensate for the loss of IR image details caused by CNN.

A. Encoder
The encoder contains different modules for feature extraction, including dense module, multiscale module, feature fusion module, channel-wise attention (CWA) [28] module and fused feature bank. In order to receive a visible or IR image of arbitrary size, this image is first processed via the convolutional layer, where kernel size is 3×3, and the stride is 1. Before feature fusion, different multiscale features were extracted from visible and IR images, respectively, through dense module and multiscale module. In the encoder, details of dense module, multiscale module, and CWA module are shown in Fig. 2, respectively.
The dense module [shown in Fig. 2(a)] is made up of three cascaded convolutional layers, namely, the output of the previous layer is the input of the next layer. The convolution kernel of 3x3 is usually used to extract coarse features in this dense module. In multiscale module [shown in Fig. 2(b)], the size of convolution kernels varies from 5 × 5, 3 × 3 to 1 × 1, which not only preserves the details in the dense module, but also extracts features from rough to coarse. Therefore, these multiscale features are necessary for image fusion [29].
In the fusion module, l 1 -norm and soft-max strategy [8] has been chosen for fusing the multiscale features, as shown in the Fig. 3. In order to improve fusion efficiency, block based averaging method is adopted to avoid any misregistration between multiscale feature maps, making fusion consistent. Let ϕ n k (x, y) be multiscale features at (x, y) position, where k ∈ {I R , I v } corresponds to IR image or visible image, and n ∈ {1, 3, 5} represents one of multiscale corresponding to certain kernel size. According to [30], the l 1 -norm of ϕ n k (x, y) is defined as the activity level measurement for fusing multiscale features. The initial activity level measurement α k will be calculated as follows: The final activity level measurement for entire multiscale features would be where p represents the size of entire block size, and it is recommended to choose a small value for p, such as p = 1.
Let f n k denotes fused feature map: where w n k represents weight map, and * represents convolution operation. Then, the sum of f n IR (x, y) + f n Iv (x, y) is result of the dense multiscale fused features.
After the above feature fusion, to further fully exploit the dense multiscale fused features, CWA is used to further treat the fused features by acquiring different channel weights, which make rich the enhanced features with both more salient highlevel features and more channel information. Actually, in CWA module [shown in Fig. 2(c)], CWA is carried out by a series of Processes, which include the global pooling, FC (fully connection layer), RELU activation, FC (fully connection layer), and sigmoid activation. Finally, the combined weighted features are multiplied by the fused features to obtain the processed. Suppose F is fused features and F is enhanced salient representation obtained by CWA, which can be expressed as follows: where β 1 and β 2 are the weights of two FC with their tasks f c 1 and f c 2, respectively. g is the global pooling operator. β is the combine total weight for channel attention and δ denotes the RELU function. C(.) denotes the channel -wise multiplication between fused feature map and total weight β.
As is known to all, in image data fusion, edge is one of the most important information of the fused image. However, in the CNN, the edge feature information of the image is easy to be blurred or smoothed by pooling operation, so that the reconstructed image edge details after feature fusion are not rich. Therefore, in addition to the above features from the CNN fusion and processed by CWA, edge features extracted by traditional methods from visible and IR images are also added to our model. In order to facilitate the subsequent decoder processing, such edge features of visible and IR images need to be processed through a fixed convolutional layer of size 3×3 after the fusion of OR operation (discussed in the following sections), as shown Fig. 1. In the fused feature bank, two types of above features are being collected in concatenated way to build up a rich and detailed encoder.

B. Multiscale Decoder
At the end of the encoder, complimentary edge features are concatenated with rich fused features from attention module in fused feature bank. The initial fused image is then reconstructed in multiscale decoder.
In our proposed multiscale decoder, there are five learnable convolution layers. Each convolution layer is enclosed by ReLU function. In order to avoid vanishing gradients, which sometimes occur in many networks, a smooth training a skip connection strategy is used for smooth training. Thus, the three layers are designed via skip connections. In this decoder, including the multiscale layers, a unified 3 × 3 kernel size has been implemented. Similar to encoder module, the same size of convolution kernels, which varies from 5 × 5, 3 × 3 to 1 × 1, has been chosen to preserve the details of all scales. The details of multiscale decoder are shown in Fig. 1. All the feature maps with multiscale are concatenated before the final layers to reconstruct a rich initial fused image.

C. Training Encoder and Decoder Networks
In our IR-MSDNet, encoder (except fusion layer) and decoder networks are mainly trained. In the training stage, the aim is to obtain the optimal weights to train the autoencoder network, so that it has the ability of deeply rich feature extraction and more abundant reconstruction.
In the training period, the main key is to train the network to reconstruct the initial fused image from visible and IR source images. The dense and multiscale modules extract the features from the visible and IR images, respectively, and then multiscale features are concatenated and fed to the CWA module. Finally, the features via the CWA module and edge features are concatenated, forming the fused features bank, then fed to the multiscale decoder for decoding.
The training will enable the encoder and decoder to obtain their final parameters and weights by loss functions. Let the total loss L T be the sum of two kind of loss functions. The first loss function is structural similarity (SSIM) loss, and the second pixel loss, which are describe as follows: where Op and I denotes the respective fused image and source images. || · || is Frobenius norm. SSIM means the structural similarity of two images. λ represents the tradeoff parameter for total loss to build the final output images from source images. In addition, this tradeoff parameter is utilized to handle the efficiency factor in terms of early training of the network. Due to its importance, λ early training leads to the optimal weights with fast convergence. Its effect has been explained briefly in Section IV-C.

D. Traditional Processing
Different from the abovementioned extracting features of encoders, there are two modules in this traditional processing unit. The first important module is responsible for extracting edge features from visible and IR images, respectively, while the second module directly extracts IR features from IR image through a series of mini process modules.
Edge features are extracted by canny edge detectors [31] from both visible and IR images, and then combined by OR operation as shown in Fig. 4. In order to feed edge features to encoder, this combined edge feature is convoluted by the kernel of size 3 × 3 and forward toward to the fused feature bank. The second and most important module in the traditional processing unit is IR feature extraction, which is also core heart of our scheme, as shown in Fig. 1. In order to extracts IR features directly, a series of processes include Quad-tree decomposition, Bezier interpolation, Gaussian filter, and IR feature refinement.
Initially, the quadtree decomposition technique [13] is adopted to pay more attention to the approximate outline of IR target entity. Because Quadtree decomposition is time-efficient, it helps to select appropriate control points, with which lots of noises can be actively suppressed. In quadtree decomposition, threshold T quad and the area size are two vital parameters. T quad is utilized to control whether the area size would be more decayed or not. Typically, a small upper limit is designated to prevent the variation. Usually the location of control points is expressed as (a, b). These coordinates are consistently appraised from individual area in the quadtree framework.
The second step is to construct artificially background by Bezier interpolation [14]. Bezier interpolation is one of the best methods to reconstruct a large-scale matrix, which can be an image in our case. Thus, the method first interpolates to some identified control points, and then the interpolation can be adapted to estimate the contour of the object.
After that, the Bezier plane of individual area can be reconstructed through approximation of x and y coordinates and grey values. These approximations directly correspond to 16 control points: where (a; b) indicates the position of an interpolated point. (A;  B) indicates the variable interpolation factor, which is connected to (a; b). O indicates the constant interpolation factor matrix. R indicates 4×4 matrix with 16 control points. T stands for vector or matrix transpose B, M , and R are then defined as follows: A = [a 3 , a 2 , a 1 , a 0 Although IR background (I BIR ) might be almost perfectly restored by joining the Bezier plane of individual area in the quadtree framework, the linear combination of all Bezier planes may hinder some IR key objects. Diverse control points are utilized in the sewed areas, therefore, the IR background is flattened by a Gaussian filter.
where s and Φ represent the size and omega parameter of the Gaussian filter, respectively. In most cases, the linked Bezier areas are much similar. Therefore, a minor smoothing gradation can be reasonable for producing a flatten background. After that, a flatten and expected IR background image I FBIR is obtained. Then, the bright IR could be easily obtained by difference between background image and IR image I R .
To further refine IR features, IR features is subtracted from the cross product of the estimated background (difference between I R and I V ) and a suitable minimizing ratio α. Consequently, a lot of useless background details can be almost removed, whereas the beneficial IR features are preserved.
where α signifies the parameter that control the background degradation factor within a range of [0,1], and α = 0.6 in our experiments. After the improvement and enhancement of IR features, data fusion with initial fused image can be carried out.

E. Data Fusion
In the fused unit, the final fusion image I Final Fused is obtained by pixel-level fusing of the initial fused image I Initial Fused with IR Final features.
IR Final features are obtained by suppressing the initial IR features while preserving the visible information to overcome the fused image suffering from overexposure: where ∀ denotes the feature suppression ratio and it can be calculated as follows: where Avg denotes the mean average of the 0.5% highest of the addition of initial fused image and IR feature image. This process is like an average scaling of the grey intensities. After the improvement's steps, final IR features are now feasible for fusion. Fig. 5 and Fig. 6 show the complete process from IR features extraction to the final fusion.

IV. EXPERIMENTS AND ANALYSIS
In this section, two remote sensing benchmark datasets are used in our experiments for verifying the effectiveness and superiority of the proposed IR-MSDNet subjectively, and then objectively by quantitative quality metrics with other state of arts fusion methods. An OD task is further performed to verify that the proposed IR-MSDNet has greatly enhanced the details in the fused images, which bring the best OD results.

A. Datasets
Two remotes sensing benchmark datasets are TNO [32] a and Aerial Image dataset [4] b , which also commonly used by others algorithm in [6], [9]- [11]. TNO image fusion dataset comprises multispectral (visual and IR images) nighttime imagery related to different military circumstances, and they are registered with different multiband camera systems. The second Aerial image dataset of visible and IR images has been captured by a remote sensing platform. The size of each image on the Aerial image dataset is 512 × 512. Fig. 7 shows twenty pairs with different scenes on Aerial Image dataset, while Fig. 8 shows ten visible and IR image pairs of TNO dataset.
In this article, for objective evaluation, seven quantitative quality metrics are selected: entropy (En) [34]; Qabf [35] showing the quality of visual evidence found from the fusion image; FMIw and FMIdct [36] computing fast mutual information (FMI); a modified structural similarity SSIMa [37]; MS-SSIM [38] computing a modified structural similarity which only emphases on structural information, and to further analyses the quality of the fused image, and the standard deviation (SD) [39], which are utilized as quality metrics.

C. Related Parameters
In compared with the visible image dataset, the IR image dataset is relatively small, therefore, in this article, a publicly available larger dataset MS-COCO [40] is used for training   x-axis indicates iterations, each single point represents 100 iteration. "blue" -λ = 1; "red"λ = 10; "green" -λ = 100; "yellow" -λ = 1000. the network following the convention set by other methods as [9], [12]. First, MS-COCO is converted to gray for training the network. Images are resized to 256 × 256 and converted to grayscale images. Learning rate, batch size and epochs are set as 1 * 10 −4 , 2 and 4, respectively. Some parameters for IR feature extraction are set as follows: such as, the threshold T quad quadtree decomposition size, the kernel size s = 11 and sigma Φ = 5 in Gaussian smoothing, the background suppression ratio, and the IR feature suppression ratio is set throughout the article as [41]. Fig. 9 shows the parameter in loss function evaluation, as discussed in the last para of Section III, the parameter λ ࢠ {1, 10, 100, 1000}; it is observed that if λ is set to larger values the network converges faster. However, after 40 000 iterations, the optimal weights are achieved, no matter which loss weights are chosen. According to observation and feasibility, when λ is set 1000; the optimal weights are obtained after training the network, which means best values for quality metrics entropy (En), for the quality of visual evidence found from the fusion images (Q abf ), for computing fast mutual information (FMIw), and for a modified structural similarity (SSIMa).  Graphical and numerical illustrations in Fig. 10, and Table I show the performance on TNO dataset with different modules of IR-MSDNet, respectively. There are four different condition cases: without multiscale features and CWA, without edges features, without IR features, and IR-MSDNet. The effectiveness of each module can be seen from their values of each metric in four cases. It is noticed that If without IR features, all metrics decline. It implies that IR features are the most important features. Moreover, edges feature has been also found to be more prominent than multiscale and CWA. Obviously, by taking all features into the fusion, the best fusion is given by IR-MSDNet.

D. Comparison and Analysis of Subjective Evaluations
To comprehensively evaluate the performance of IR MSDNet, traditional and deep learning-based methods (as mentioned in Section IV-B) are involved in these comparisons. Moreover, twenty image pairs of TNO dataset and Aerial image dataset are evaluated here. Figs. 11 (3)-(10) and 12 (3)-(10) are the fused images of "umbrella" from TNO and "Warehouse" from Aerial Image dataset.
Manifestly, the proposed IR-MSDNet differs from the rest comparative methods mainly in the target region and the region of the details in the background. In order to facilitate observation, a small region of the IR target is marked with a green frame, and the region of the details in the background is marked with a red frame. It can be seen that all the traditional fusion methods not only spot the target information existing in the IR images, but also contain some noises, which leads to blurred effects and not more salient in the fusion images. The fused images in Figs. 11 (3) and (4) and in 12 (4) and (5), hold several blocking artifacts, demonstrating in ringing around the salient features. Deep learning-based methods all have better human visual perception than traditional methods except Fig. 12(7), which produce a blur effect in Aerial Image dataset.
In a word, it can be seen from Figs. 11 and 12 that the proposed IR-MSDNet preserves more precise intensity information of the IR image and captures more textured from the visible image on both datasets.

E. Comparison and Analysis of Objective Evaluations
Objective evaluations for the proposed IR-MSDNet and all studied fusion methods is given in Table II on TNO dataset and  Table III on Aerial image dataset. The value of the assessment metrics in boldface indicate the optimal ones and in underline for suboptimal ones. It can be seen from Table II that the objective evaluations are consistent with the conclusions from the subjective evaluations. In particular on TNO dataset, the proposed IR-MSDNet has optimal values for four of the seven metrics and suboptimal values for the other three. It also has the optimal and suboptimal values with least average value for FMI dct on aerial image dataset that can be seen from Table III. Different datasets may have slight inferior performances, but for Qabf, the proposed IR-MSDNet is considerably superior compared with other methods, which means that IR-MSDNet can produce fused images with rich details. The IR-MSDNet has significant advantages on image fidelity, which is consistent      3.64s, 0.31s, and 3.82s, respectively. Although IR-MSDNet is slightly slower than other deep methods but its performance is better than theirs.

F. Objective Evaluations on OD
Furthermore, to emphasis the effectiveness of IR-MSDNet, it is further applied to OD of remote sensing. RPN [2] is chosen as the basic state-of-the-art detector on KAIST [42] Multispectral (IR, Visible) benchmark datasets. Log average Miss Rate (MR) metric [2] as the standard measure for OD is used for quantitative evaluation. Log average MR is computed by averaging the miss rates over different false positive per-image (FPPI) points sampled within the evenly spaced in log-space. Since MR specifies the rate of undetected objects (e.g., persons, vehicles, etc.) as false negatives, so a lower value represents a robust OD. In Table IV, RPN detector is used to detect different objects from visible images, IR images, fused images generated by different fusion methods, i.e., JSRSD [4], VggML [9], DeepFuse [8], FusionGAN [11], DenseFuse [12], IFCNN [10], two streams (feature fusion from visible and IR image) [2], and our proposed IR-MSDNet, respectively. It can be seen from Table IV, the lowest average value of 27.45% is achieved by the proposed IR-MSDNet, which means that the proposed IR-MSDNet has greatly enhanced the details in the fused images and makes the RPN detector obtain the best OD results. So, the proposed IR-MSDNet not only competes for the image fusion task, but also can be applied to OD tasks in multispectral images (IR and Visible) as well.

V. CONCLUSION
In this article, a novel and effective deep architecture named IR-MSDNet is exclusively proposed to learn robust and discriminative salient representation to perform IR and visible image fusion. IR-MSDNet is based on specially designed IR features and a multiscale dense network with attention. It mainly contains an encoder, a multiscale decoder and a fused unit. The dense and multiscale features fused by l1-norm strategy, and further enhanced by attention module could capture more scale-related features. Edges features are concatenated with the features output by CWA module to form more detailed features for fusing. The final fused image is reconstructed from both the decoded multiscale features and IR features extracted traditional methods. Taking advantage of both deep learning and traditional methods, IR-MSDNet fully inherits IR target information and visible rich background details in the fused image. In experiments, both subjective and objective quality metrics are utilized to evaluate the proposed IR-MSDNet with other fusion methods. The results show that the proposed IR-MSDNet has achieved the state-of-the-art fusion performance. A further experiment on multispectral (RGB-Infrared) OD further validates the proposed method has greatly enhanced the details in the fused images, which are crucial for obtaining the best OD results.