Multiscale Progressive Fusion of Infrared and Visible Images

Infrared and visible image fusion aims to generate more informative images of a given scene by combining multimodal images with complementary information. Although recent learning-based approaches have shown significant fusion performance, developing an effective fusion algorithm that can preserve complementary information while preventing bias toward either of the source images remains a significant challenge. In this work, we propose a multiscale progressive fusion (MPFusion) algorithm that extracts and progressively fuses multiscale features of infrared and visible images. The proposed algorithm consists of two networks, IRNet and FusionNet, which extract the intrinsic features of infrared and visible images, respectively. We transfer the multiscale information of the infrared image from IRNet to FusionNet to generate an informative fusion result. To this end, we develop the multi-dilated residual block (MDRB) and the progressive fusion block (PFB), which progressively combines the multiscale features from IRNet with those from FusionNet to fuse complementary features effectively and adaptively. Furthermore, we exploit edge-guided attention maps to preserve complementary edge information in the source images during fusion. Experimental results on several datasets demonstrate that the proposed algorithm outperforms state-of-the-art infrared and visible image fusion algorithms on both quantitative and qualitative comparisons.


I. INTRODUCTION
Image fusion is a technique that combines multiple images captured from different sensors to generate a more informative image of a given scene that can facilitate subsequent processing [1], [2], [3], [4], [5]. A pair of infrared and visible images is the most commonly used combination of modalities because the images captured in the two wavelengths contain complementary information on a scene from different aspects, and thereby provide more robust and informative results together [2]. In particular, whereas visible images contain scene textures to facilitate human visual perception, their quality is easily affected by environmental conditions, such as illumination or weather. In contrast, because infrared images capture the thermal radiation of objects, they are robust against environmental conditions but have poor scene The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu . textures [6]. Infrared and visible image fusion techniques have been applied in various applications because of its practical usefulness and importance, including object tracking [7], [8], salient object detection [9], [10], [11], VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and surveillance [12]. Figure 1 shows an example in which visible and infrared image fusion improves object detection performance.
The key challenge in image fusion is the development of effective feature extraction from each image and appropriate fusion rules to integrate them into the fused image. Various algorithms have recently been proposed to address this challenge. These algorithms can be broadly classified as model-and learning-based [2]. Model-based algorithms have been designed to extract image features based on different mathematical theories and then determine appropriate fusion rules on the basis of the extracted features [13], [14], [15], [16], [17], [18], [19], [20]. However, the extraction of faithful features using such manually designed models makes designing fusion rules difficult and computationally demanding.
With recent advances in deep learning, deep learningbased algorithms that employ convolutional neural networks (CNNs) [21], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32] or generative adversarial networks (GANs) [22], [33], [34], [35], [36], [37] have been developed most actively. CNNs can extract high-level features from source images more effectively than traditional feature engineering, which is essential to generate informative fused images. Therefore, CNN-based fusion algorithms have been designed to learn to extract informative features and fuse them by characterizing the complex relations between source images and fused images. However, despite the powerful ability of CNNs to extract visual features, CNN-based algorithms may fail to preserve complementary information in either of the source images, thereby generating biased fusion results [38]. Further, single-scale feature extraction [23], [24], [25] hardly utilizes both global and local information simultaneously, which leads to a loss of spatial information in the source images. GAN-based algorithms generate fused images that preserve the pixel value distributions of both infrared and visible images. Although GAN-based algorithms have achieved improved performance, they have also exhibited limited ability to highlight discriminative regions in source images and generated undesirable artifacts and noise [22], [33], [34], [35]. Figure 2 shows an example of a pair of infrared and visible images and the fusion results obtained by SEDR-Fuse [21] and DDcGAN [22], which are representative CNN-and GAN-based algorithms, respectively. The result of SEDRFuse in Figure 2(c) is biased toward the infrared image, whereas that of DDcGAN in Figure 2(d) is biased toward the visible image. In contrast, the proposed algorithm achieves a desirable balance between the two source images by better preserving the dominant infrared objects and rich visible details.
Recently, several transformer-based fusion algorithms [39], [40], [41], [42], [43] have been developed to capture interdomain long-range dependencies with self-attention mechanism. For example, in [39] and [40], local and global features respectively extracted by CNNs and transformers were integrated to take advantages of both models. In [41] and [42], both self-attention and cross-attention were utilized in pure transformers without CNNs. However, transformer-based fusion algorithms generally demand considerable computational resources to capture long-range dependencies, limiting their applicability to high-resolution images.
In this work, to address the aforementioned limitations of conventional algorithms and better preserve the complementary information in source images, we propose a multiscale progressive fusion algorithm, called MPFusion, for infrared and visible image pairs. The proposed algorithm is composed of two networks: IRNet, which extracts multiscale features from infrared images, and FusionNet, which extracts features from visible images and then progressively fuses the features extracted from both images. To this end, we develop the multi-dilated residual block (MDRB) and the progressive fusion block (PFB) to fuse the multiscale features extracted from IRNet with those from FusionNet. In addition, we develop edge-guided attention maps to faithfully preserve the complementary edge information in the source images during fusion. Experimental results show that the proposed MPFusion algorithm substantially outperforms state-of-theart infrared and visible image fusion algorithms [21], [22], [29], [30], [31], [32], [33], [38], [44], [45], [46] on several datasets.
The main contributions of this work are summarized as follows: • We propose the MPFusion algorithm for infrared and visible image fusion to extract and progressively fuse multiscale features of source images; thus, the MPFusion algorithm can preserve both global and local information in source images.
• We develop two new blocks, called MDRB and PFB, which are designed to improve fusion performance by effectively exploiting multiscale features. Specifically, MDRB progressively extracts intrinsic features of source images, whereas PFB progressively fuses their complementary information.
• We develop an adaptive channel fusion strategy that adaptively combines infrared and visible features to better exploit information on the statistical characteristics of the source images. • We experimentally show that the proposed MPFusion algorithm outperforms state-of-the-art fusion algorithms on multiple datasets. The remainder of this paper is organized as follows. Section II briefly reviews related work. Section III describes the proposed MPFusion algorithm for infrared and visible image fusion, and Section IV discusses the experimental results. Finally, Section V concludes the paper.

II. RELATED WORK A. MODEL-BASED FUSION
Model-based algorithms have been developed based on different mathematical or algorithmic models for feature extraction and fusion rules [2]. For example, multiscale transform-based algorithms [13], [14] decompose each source image into multiscale representations, fuse them in a transform domain, and then obtain a fused image using the inverse multiscale transform. Sparse representation-based algorithms [15], [16], [17], [18] learn to construct overcomplete dictionaries to represent the fused image. Saliencybased algorithms [47], [48] estimate salient areas of source images to improve the visual quality of the fused images by preserving important features in the source images. Finally, hybrid algorithms [49], [50] combine other model-based algorithms to improve fusion performance. However, modelbased feature extraction complicates image fusion tasks, and considerable attention is required to ensure the completeness of features [30]. For a more detailed survey on model-based fusion, the reader is referred to [2].

B. LEARNING-BASED FUSION
Inspired by recent successes in deep learning-based computer vision and image processing tasks, extensive research has been conducted on learning-based infrared and visible image fusion. Learning-based fusion algorithms can be broadly categorized into three groups based on how they extract and fuse the features of each image. Figure 3 compares the three architectures commonly used for learning-based image fusion. The two architectures in Figures 3(a) and (b) use CNNs or transformers comprising feature extraction, feature fusion, and image reconstruction, whereas that in Figure 3(c) uses GANs. Figure 3(a) shows an early fusion architecture, which performs feature extraction and feature fusion simultaneously, followed by image reconstruction in an end-to-end manner. Owing to its effectiveness in removing the correlation between two source images, many algorithms [25], [26], [29], [30], [38], [42], [44], [51], [52] using this architecture have recently been proposed. In particular, several researches [38], [44] have focused on designing network architectures for effective extraction of useful features from source images and their fusion. However, since algorithms in early fusion extract and fuse features simultaneously using a common block without considering different modalities, the intrinsic features and complementary information of source images may not be fully exploited. Thus, algorithms in this category may generate biased fused images.
The late fusion architecture in Figure 3(b) extracts the features of each source image separately using CNNs dedicated to each of the two modalities, and then fuses them using a fusion scheme. As late fusion algorithms fuse the features extracted using independently trained networks, they can preserve the intrinsic features of each image. Most researches have focused on the design of elaborate network architectures for end-to-end fusion capable of both better feature extraction and feature fusion [21], [23], [24], [32], [39], [40], [41], [43]. In addition, in [31], an algorithm was developed to generate weight maps for effective fusion using pretrained networks. The proposed MPFusion algorithm similarly performs feature extraction and fusion separately. However, it extracts and progressively fuses multiscale features [53] to preserve complementary information in the source images more effectively, thereby being capable of avoiding bias toward either of the source images.
Finally, several GAN-based algorithms [22], [33], [34], [35], [36], [37] have recently been proposed, which generate a fused image by preserving the pixel value distributions in the source images through an adversarial game between a generator and a discriminator, as shown in Figure 3(c). In particular, as using a single discriminator may fail to preserve pixel value distributions of both images [33], a GAN architecture with two discriminators was developed to overcome the limitations of a single discriminator [22]. In addition, based on the observation that GAN-based algorithms have limited ability to highlight discriminative regions in source images, attempts have been made to incorporate an attention mechanism into GANs [34], [35]. However, because GAN-based algorithms also jointly perform feature extraction and fusion implicitly, they may generate fused images that are biased toward either of the source images as well.   Figure 4 shows an overview of the proposed MPFusion algorithm, which consists of two networks. IRNet extracts multiscale features of an infrared image I inf and FusionNet outputs the fused result of the infrared and visible images, denoted by I inf and I vis , respectively. The edge-guided attention maps E inf and E vis for I inf and I vis respectively, are used to improve the fusion performance by preserving the edge information in the source images. Features extracted by IRNet are fed into FusionNet progressively through the PFB at each level. Note that IRNet is trained separately to extract the intrinsic features of the infrared image, and FusionNet is trained with fixed IRNet. This training strategy improves the fusion performance and ensures the stable training of FusionNet, as will be discussed in Section IV-E.

A. EDGE-GUIDED ATTENTION MAPS
As mentioned previously, infrared and visible image fusion algorithms fuse source images in the feature domain rather than in the image domain. Thus, fine texture details in the source images may be lost during fusion due to unfaithful feature extraction, resulting in a blurry output. Note that the attention mechanism selectively focuses on important parts of the input data to improve the performance of CNNs, and the image details are well represented by the gradient of the image [54], [55]. Based on this observation, we define the edge-guided attention map as the relative magnitude of the gradient of the infrared and visible images. More specifically, we obtain the edge-guided attention maps E inf and E vis for infrared and visible images, respectively, as where ∇ denotes the gradient operator and the division is element-wise. As shown in Figure 4, the edge-guided attention maps are concatenated with the corresponding source images; then, they are downsampled to construct multiscale inputs. The edge-guided attention maps force the network to focus more on the complementary edge information in the source images, thereby improving the fusion performance, as will be discussed in Section IV-E.

B. NETWORK ARCHITECTURE
As shown in Figure 4, both IRNet and FusionNet are multiscale networks that extract image features at multiple levels and successively add the features of the previous level to generate output images. Note that multiscale features of source images have been shown to be effective in infrared and visible image fusion [56], [57]. Each level of the networks is responsible for a particular aspect of the source images: a higher-level network for local details while a lower-level network for global structures. At each level of the networks, the residual channel attention block (RCAB) [58] is used first to force the network to focus on more informative features by adaptively rescaling channel-wise features. Then, the features extracted by IRNet are fed into FusionNet through PFB, which is composed of the identity mapping block (IB) in IRNet and the fusion block (FB) in FusionNet, to progressively fuse the features of both networks. The architecture of the proposed PFB will be described in detail in subsequent sections. Finally, the output images at each level are generated by applying a 1×1 convolutional layer. In this work, the level of the networks is fixed to N = 3; its effects will be discussed in Section IV-E.
We feed the IRNet features unidirectionally into FusionNet for the progressive fusion of infrared and visible images. To this end, we develop the PFB to combine the information from both source images. Figure 5 shows the architecture of the proposed PFB, which consists of IB in IRNet and FB in FusionNet. IB extracts features to generate the input infrared image, whereas FB extracts those of the input visible image and fuses them with the IB features. Both IB and FB have three MDRBs to preserve both global and local information in the source images, a bottleneck layer for dimensionality reduction, and three convolutional layers.

C. ADAPTIVE CHANNEL FUSION
In Figure 5, the infrared features f inf out in the IB and the visible features f vis out in the FB are fused and then fed into the next layer of the FB. An addition or concatenation can be used to fuse these features, as in [59]. However, such straightforward approaches may fail to fully exploit the different characteristics of the source images, degrading the fusion performance, as will be discussed in Section IV-E. Thus, we develop an adaptive channel fusion strategy that adaptively combines two features by exploiting information on the statistical characteristics of the source images during fusion. Specifically, we construct two weight maps α inf and α vis ∈ R (N MDRB ×N C )×1×1 for the infrared and visible images, respectively, where N MDRB and N C are respectively the numbers of MDRB and its output channels. It has been observed that pixel value distributions of source images are essential for representing texture details in images for fusion [25]. Therefore, we use the histograms of the source images to construct weight maps to consider their pixel value distributions more effectively. More specifically, we employ a simple network with two fully connected (FC) layers followed by a sigmoid activation function to learn a weight map for each image, which takes the normalized histogram of each image as input. Figure 6 illustrates the architecture of the adaptive channel-weight generation network.
Next, after each MDRB, the PFB fuses the features of the IB with those of the FB and then feeds the fused features into the next layer of the FB. More specifically, let f inf out and f vis out denote the output features of MDRB in IB and FB, respectively; then, the input feature of the next layer f fus out in FB by adaptive channel fusion is obtained by where denotes channel-wise multiplication and the division is also channel-wise. This strategy enables FusionNet to fuse the infrared and visible features progressively and stably, while preserving the intrinsic features of each image.

D. MDRB
It is important to extract features by fully exploiting the characteristics of the input image and to feed them through the network without loss for high-quality image generation. The multiscale residual block (MSRB) [60] using convolution kernels of different sizes has been frequently employed for feature extraction. However, MSRB requires high computational and memory complexities to increase the receptive field. In this work, inspired by MSRB, we develop MDRB to extract deep features at different scales by employing dilated convolution [61], which can expand the receptive field using the same number of parameters. Figure 7 shows the architecture of the proposed MDRB. MDRB adds the input features f t−1 to the output features of two shared bypass networks that use kernels with different dilation rates r, generating the output features f t . MDRB provides better fusion results than MSRB by faithfully preserving both global and local information with fewer parameters using dilated convolutions, as will be discussed in Section IV-E.

E. LOSS FUNCTIONS
To train IRNet and FusionNet, we define the IR loss L IR and fusion loss L fus , respectively, as will be described subsequently.

1) IR LOSS
To train IRNet, we define the IR loss L IR as the weighted sum of the data loss L id and structure loss L s between an input infrared image and its estimated version as where λ s is a hyper-parameter that balances these two losses.
We employ the 2 norm as the infrared data loss as whereÎ inf k and I inf k represent the estimated image and the corresponding input image, respectively, at the kth network level. The structure loss is defined as where SSIM(·) denotes the structural similarity index [62] between the two images.

2) FUSION LOSS
We define the fusion loss L fus to train FusionNet as a weighted sum of the data loss L fd , spatial loss L sp , and perceptual loss L p as where λ sp and λ p are hyper-parameters that control the relative impacts of the three losses. The fusion data loss is defined as where w inf and w vis denote the weights that control the contributions of the input infrared and visible images, respectively, to the fused image. We employ the spatial consistency loss [63] to preserve the spatial characteristics of the source images, which is given by where K denotes the number of regions and (i) denotes the neighboring regions of i. Finally, to compare the highlevel differences between the source images and fused image, we employ the perceptual loss [64] as where φ k denotes the feature map from the kth layer of the pretrained VGG-16 network [65].

A. TRAINING
We first train IRNet, which is then fixed to train FusionNet. IRNet: We use the Adam optimizer [66] with a learning rate of 10 −4 and a batch size of 8 for 16 epochs. The hyperparameter λ s in (4) is fixed to 100.
FusionNet: We also use the Adam optimizer with the same settings as in IRNet with a batch size of 4 for 25 epochs. The hyper-parameters λ sp and λ p in (7) are fixed to 0.05 and 0.5, respectively, and w inf and w vis in (8)-(10) are all set to 0.5.
Training dataset: We use only the KAIST dataset [67] for training, which contains 95,000 well-aligned color-thermal image pairs with a resolution of 640 × 512. We augment the dataset by converting the RGB color to grayscale and randomly cropping 20,000 256 × 256 patches.

B. EXPERIMENTAL SETTINGS 1) DATASETS
Although we strictly use a single training dataset, we evaluate the performance of the proposed algorithm on various datasets to test its effectiveness and generalization ability.
KAIST [67]: The KAIST dataset contains well-aligned 95,000 image pairs captured using special camera devices with a resolution of 640 × 512. We randomly chose 200 pairs, which were not used for training.
TNO [68]: The TNO dataset contains multispectral nighttime scene images of various resolutions, ranging from 280 × 280 to 768 × 576, registered with multiband camera systems. We use the test set constructed by Li and Wu [38], which contains 20 image pairs.
RoadScene [30]: The RoadScene dataset contains aligned visible and infrared image pairs chosen by Xu et al. [30] from the FLIR dataset, which contains image pairs captured using TABLE 1. Quantitative comparison of the fusion results on the KAIST, TNO, and RoadScene datasets using eight quality metrics. For each metric, the best result is shown in boldface, whereas the second-best is underlined. For each algorithm, the average ranking is reported. real cameras. 1 We use its test set, which contains 221 image pairs with resolutions of up to 563 × 459.

C. QUANTITATIVE ASSESSMENT
We use eight frequently used objective quality metrics to evaluate the fusion performance: entropy (En) [69], total edge information (Q AB/F ) [70], sum of the correlations of differences (SCD) [71], multiscale structural similarity (MS-SSIM) [72], mutual information for discrete cosine features (FMI dct ) [73] as well as wavelet features (FMI w ) [73], natural image quality evaluator (NIQE) [74], and blind/referenceless image spatial quality evaluator (BRISQUE) [75]. The scores for En, Q AB/F , SCD, MS-SSIM, FMI dct , and FMI w are computed between the fused image and the input visible and infrared images, and then averaged. As the ground-truths are unavailable for infrared and visible image fusion, we also use two blind objective quality metrics:  Table 1 on the TNO dataset.
NIQE and BRISQUE. Higher En, Q AB/F , SCD, MS-SSIM, FMI dct , and FMI w scores imply better results, whereas lower NIQE and BRISQUE scores indicate better performance. Table 1 compares the quantitative performances. The proposed MPFusion algorithm provides the highest or secondhighest Q AB/F , SCD, and MS-SSIM scores for each dataset, implying that the proposed algorithm can better preserve the structures and details of the input images via multiscale and progressive feature fusion. DDcGAN yields relatively high scores in terms of the information theory-based metrics, i.e., En, FMI dct , and FMI w , but achieves lower scores on the fidelity-based metrics Q AB/F and MS-SSIM. This is because the GAN-based DDcGAN tends to generate noise and artifacts in fused images, which increase the amount of information conveyed but degrade the visual quality of the resulting images. Note that, because the information theory-based metrics quantify the amount of information in the images, DDcGAN yields high scores of the information theory-based metrics. The results of DDcGAN indicate that each quality metric assesses different aspects of image quality. Thus, one algorithm may outperform the others in terms of a single metric; however, it may perform poorly in terms of other metrics. Therefore, we evaluate the overall performance of the algorithms by employing a ranking-based assessment. Specifically, we obtain the ranking of each algorithm in each metric, and the average rankings are presented in the rightmost column of Table 1. The proposed algorithm consistently yields the best average rankings for all datasets with large margins, which confirms its effectiveness. In addition, the proposed algorithm exhibits similar tendencies across the quality metrics for all the datasets. This confirms the superior generalization ability of the proposed algorithm compared with the algorithms used for comparison.
Finally, Figure 8 shows the box plots for the eight quality metrics in Table 1 using all test images in the TNO dataset. The red lines and crosses denote median values and outliers, respectively. The proposed MPFusion algorithm achieves the highest median values for Q AB/F , SCD, and MS-SSIM scores in Figures 8(b), (c), and (d), respectively. In addition, the proposed algorithm yields the smallest number of outliers. This indicates that the proposed algorithm is more stable and robust than the conventional algorithms. Figure 9 compares the fusion results obtained by each algorithm on the KAIST dataset. GTF in Figure 9(c) loses the fine textures in the input images. In Figures 9(d), (e), (g), (h), and (k), VggML, DenseFuse, IFCNN, SEDRFuse, and DRF, respectively, yield relatively blurry results losing texture details, e.g., the license plate in the second row. The GAN-based algorithms FusionGAN and DDcGAN in Figures 9(f) and (i), respectively, generate undesirable artifacts and noise that alter the image characteristics, e.g., around the car in the second row. IVFusion in Figure 9(l) over-enhances the contrast of the input images. U2Fusion and RFN-Nest in Figures 9(j) and (m), respectively, preserve fine details in the visible images, but lose those in infrared images, e.g., the trees in the third row. In contrast, the proposed algorithm in Figure 9(n) generates a fused image that preserves the fine textures of both input images faithfully without noticeable artifacts. Figure 10 shows the fused images from the TNO dataset. GTF, IFCNN, SEDRFuse, and DRF in Figures 10(c), (g), (h), and (k), respectively, fail to effectively retain complementary information in both input images; the fused images contain more information from the infrared images while losing visual information from the visible images. In Figures 10(d), (e), (j), and (m), VggML, DenseFuse, U2Fusion, and RFN-Nest provide fused images with less artifacts, but the results are blurred, losing texture information. FusionGAN, DDc-GAN, and IVFusion in Figures 10(f), (i), and (l), respectively, generate severe noise components, degrading the image quality. On the contrary, the proposed algorithm in Figure 10(n) preserves the fine textures in the source images faithfully, e.g., the bushes in the third row.

D. QUALITATIVE ASSESSMENT
Finally, Figure 11 shows the fusion results of the Road-Scene dataset; they exhibit similar tendencies to the results in Figures 9 and 10. In Figures 11(c), (d), (e), and (g), GTF, VggML, DenseFuse, and IFCNN provide poor detail. FusionGAN, SEDRFuse, DDcGAN, DRF, and IVFusion in Figures 11(f), (h), (i), (k), and (l), respectively, provide fusion results with severe artifacts and noise that degrade the quality of images. U2Fusion in Figure 11(j) preserves the texture information in the visible images, but loses the background information in the infrared images, e.g., the clouds in the second row. RFN-Nest in Figure 11(m) loses the object contours in the fused images, e.g., the car and windows in the third row. In contrast, the proposed algorithm in Figure 11(n) provides fused images that preserve fine details in each source image.

E. ABLATION STUDIES
We conduct several ablation studies to analyze the effects of the key components of the proposed algorithm on fusion performance. All experiments are performed for the TNO dataset [68] using all metrics used in the previous section to compute the average rankings.

1) EDGE-GUIDED ATTENTION MAPS
To analyze the effectiveness of the edge-guided attention maps, we train the proposed networks using different settings. Table 2 compares the results. The absence of an edge-guided attention map provides poor results because the complementary edge information in the source images cannot be fully exploited. Using an edge-guided attention map in only one of the two networks worsens the performance because only the edge information of a single image is emphasized, which causes the networks to generate fusion results biased toward the image with the attention map. Finally, using edge-guided attention maps in both networks yields the best performance by selectively focusing on complementary edge information. Figure 12 visually compares the fusion results. Using an edge-guided attention map in either of the two networks generates biased results toward the source images, as shown in Figures 12(d) and (e). Using edge-guided attention maps in both networks achieves the best performance, as shown in Figure 12(f), by forcing the networks to focus on the complementary edge information in the source images.

2) FUSION STRATEGIES
We analyze the effectiveness of the proposed adaptive channel fusion described in Section III-C by training the proposed networks with different fusion strategies. We choose five conventional handcrafted fusion strategies, as described in [59]. Table 3 compares the fusion performances. The proposed fusion strategy outperforms all the handcrafted fusion strategies, because it adaptively fuses features by considering the statistical characteristics of the source images using the input histograms.

3) LOSS FUNCTIONS
We train IRNet and FusionNet using different combinations of losses to analyze the effectiveness of each loss function. Table 4 quantitatively compares the results. First, using only (L id , L fd ) provides the worst performance. Second, L s improves the fusion performance. Third, the addition of either L sp or L p significantly improves the fusion performance. Finally, the combination of all the losses yields the best fusion performance by a large margin.

4) NETWORK LEVEL
To analyze the effectiveness of the levels of the proposed networks, we train the proposed networks with different levels.   Table 5 compares the fusion performances. As the network level N increases, performance improves by extracting more meaningful features. However, increasing the level excessively decreases fusion performance. This is because less structural features are extracted from excessively small images, which are then fed to the next level, causing the propagation less informative features.

5) MDRB
To analyze the effectiveness of the proposed MDRB, we train the proposed networks using three feature extraction blocks:  MSRB [60], short skip connection (SSC) [76], and the proposed MDRB. Table 6 compares the fusion performance of each feature extraction block. The proposed MDRB yields the best fusion performance because it captures global information better with larger receptive fields than MSRB and SSC, while requiring a slightly larger number of parameters than SSC.
In addition, Table 7 compares the fusion performance according to the number of MDRBs. As the number of MDRBs increases, performance improves, because more salient features can be extracted. However, increasing the number of MDRBs excessively saturates the performance,    while still increasing the required computational and memory complexities.

6) TRAINING STRATEGIES
As mentioned in Section III, IRNet is first trained separately, and then FusionNet is trained with fixed IRNet. To analyze the effectiveness of this separate training strategy, we train the proposed networks using different training strategies. Table 8 compares the fusion performances. The separate training strategy provides higher fusion performance than joint training. This is because separate training focuses on extracting the intrinsic features of each source image, which better preserves complementary information in the source images during fusion.

F. EFFECTS OF PARAMETERS w inf AND w vis ON FUSION PERFORMANCE
As discussed in Section III-E, the parameters w inf and w vis in (8)-(10) control the contributions of the infrared and visible images, respectively, to the fused image. We evaluate the effects of these parameters on the fusion performance.  Figure 13 shows the fused images for several combinations of w inf and w vis . The fusion performance is considerably affected by the values of w inf and w vis . More specifically, when (w inf , w vis ) = (0.4, 0.6), the fusion results contain information mainly from visible images. However, as w inf increases and w vis decreases, the infrared images contribute to the fusion results more aggressively. This indicates that the selection of w inf and w vis significantly affects fusion performance. Therefore, to achieve the best fusion performance, we fixed both w inf and w vis to 0.5 in this work. Table 9 compares the computational complexity in terms of the average runtime and number of giga floating-point operations per second (GFLOPs) to synthesize 200 paired KAIST images with a resolution 640 × 512 on an Nvidia RTX 2080Ti GPU and the number of network parameters. Although the proposed MPFusion consists of two networks, IRNet and FusionNet, it enables a graceful tradeoff between fusion performance and computational complexity.

H. OBJECT DETECTION PERFORMANCE EVALUATION
To verify the effectiveness of the proposed MPFusion algorithm in improving the performance of computer vision tasks, we apply an object detection algorithm to the fusion results obtained by each algorithm. Specifically, we use a YOLOv4 model [78] pretrained using the COCO dataset [79] for the evaluation. In addition, in this evaluation, we also compare object detection performance on the fusion results of SeAFusion [77], which is an infrared and visible image fusion algorithm dedicated to high-level vision tasks. Figure 14 shows examples of object detection results on the RoadScene dataset. The fusion results obtained by the proposed algorithm in Figure 14(o) yield an improved detection performance compared with either the infrared or visible images in Figures 14(a) and (b), respectively. For example, the bicycles, which are not detected in the second row of Figures 14(a) and (b), are detected in Figure 14(o). In addition, the fusion results of the proposed algorithm yield better detection performance than all the competing algorithms. For example, the skateboard in the first row of Figure 14(b) is detected only in Figure 14(o). Figure 15 compares the precision-recall (PR) curves and reports the mean average precision (mAP) performance for the fusion results of each algorithm. A higher mAP value implies more accurate detection. The results show that the proposed algorithm achieves the best detection performance, i.e., the highest mAP score. Further, the proposed algorithm exhibits 22.45% and 9.20% higher mAP values than the input infrared and visible images, respectively. Finally, it is worth noting that the proposed algorithm yields better performance than the dedicated SeAFusion [77]. These results indicate  that the fusion results of the proposed MPFusion algorithm consistently improve object detection performance. Hence, the proposed algorithm may exhibit positive impacts on computer vision applications under severe environmental conditions.

I. FUSION RESULTS FOR RGB IMAGES
We developed the proposed algorithm to fuse single-channel visible and infrared images for consistency with the TNO dataset [68]. However, the proposed algorithm can also be applied to fuse the visible and infrared images of RGB channels. To this end, we compute edge-guided attention maps in Figure 4 for each of the RGB channels for both visible and infrared images. Figure 16 shows the fusion results for images in the KAIST and RoadScene datasets. The color information of the visible images and the edge information of the infrared images are accurately preserved in the fused images.

V. CONCLUSION
We proposed a multiscale progressive fusion algorithm, called MPFusion, for infrared and visible image fusion. The proposed MPFusion algorithm consists of two networks: IRNet extracts multiscale features of the infrared image, and FusionNet extracts multiscale features of the visible image and progressively fuses them with those from IRNet. Specifically, we developed MDRB and PFB to improve fusion performance by progressively incorporating the multiscale features extracted from IRNet with those from Fusion-Net. Finally, we further improved the fusion performance by preserving the complementary edge information in the source images during fusion based on edge-guided attention maps. Extensive experiments demonstrated that the proposed algorithm outperforms state-of-the-art algorithms on several datasets. An important direction for future work is to develop more effective and sophisticated fusion schemes that can facilitate high-level vision tasks, such as segmentation, object detection, and re-identification.