Deep Unfolding Network for Multi-Band Images Synchronous Fusion

This study proposes a new deep neural network to solve the multi-band image synchronous fusion problem (MBF-Net). Unlike other deep learning-based methods, our network architecture design combines the ideas of model-driven and data-driven methods, so it is more interpretable. First, a new multi-band image synchronous fusion model is proposed. The source image in the data fidelity terms and the prior regularization are implicitly represented by the deep learning network and jointly learned from the training data. The proposed model is then solved using a half quadratic splitting (HQS) algorithm and unfolded into a deep fusion network. In addition, a new saliency loss function is proposed to retain thermal radiation information to enhance the fusion effect. Finally, the experimental results on the TNO dataset demonstrated the effectiveness of the proposed MBF-Net.


I. INTRODUCTION
Owing to theoretical and technical limitations, images acquired by individual modal sensors do not accurately and comprehensively describe scene information. For example, infrared sensors are sensitive to thermal radiation; therefore, thermal targets are prominent in infrared images, but they lack rich textural detail information. Visible sensors capture reflected light information; thus, visible images contain rich textural structures, but the visual quality is poor in dark or harsh environmental conditions. Therefore, different sensors are complementary, and the fusion result acquired by multiple sensors can obtain an image that contains more information and facilitates other tasks [1], [2]. With technological development, new requirements have emerged, and image fusion is not limited to two images; however, synchronous fusion of multiple images has emerged [3], [4], [5].
In the past few decades, many image fusion techniques have been proposed, which can be roughly divided into two categories: traditional methods and deep The associate editor coordinating the review of this manuscript and approving it for publication was Mostafa M. Fouda .
In recent years, optimization-based methods have been widely used in infrared and visible image fusion, and the data fidelity term and regularization term are the keys [19]. The current method of constructing data fidelity terms mainly selects known images as data fidelity terms to produce similar pixel intensity distributions between the fusion result and the source image. Therefore, the fusion result is globally closer to the selected source image. The methods of constructing data fidelity terms can be roughly divided into two types. One is to select a source image or a preprocessed image as a data fidelity term. For example, in [19] and [20], the data fidelity term only focuses on the infrared image. Although the fusion result can contain more salient information about the infrared image, it will cause the background of the fusion result to lack detailed texture information. Although the method of constructing data fidelity terms is simple, essential parts of VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ unselected source images will be ignored and cannot be effectively transferred into the fusion result, thus failing to fully utilize the information of all source images. Another way to build a data fidelity item is to balance the weight of the two images in the fusion result through carefully set parameters. Nie et al. proposed a weighted fidelity term to fuse salient objects in infrared images and salient scenes in visible images [21]. Although this method can fully use all source image information, it is inseparable from manual intervention. In addition, as the number of synchronized fused images increases, the difficulty of weight setting also increases, which is unsuitable for multi-band image synchronous fusion. Therefore, it is very important to construct an ideal data fidelity term for multi-band image synchronous fusion. Due to the ill-posed nature of most image fusion tasks, it is necessary to design prior regularization to obtain the desired fusion results. Most previous methods in infrared and visible image fusion work use a gradient sparse regularization term with the L1-norm, L2-norm, or L1/2-norm to preserve the detail information in infrared and visible images. These regularization term priors are empirically designed and are insufficient to model complex multi-band images. Therefore, it is very important to construct a suitable regularization term to represent the prior information of multi-band images for image fusion.
With the development of computer vision tasks, DL-based methods have been introduced in the field of image fusion and have achieved a more competitive performance than traditional methods [22], [23]. DL-based methods focus on fusing multiple images to produce the desired result by learning the mapping network between the source and the fused images. Although the DL-based fusion methods exhibit good performance, a well-trained model requires powerful equipment and massive data support. The network model is equivalent to a black box and lacks interpretability, which limits the practical application of the DL-based fusion method.
To address the above problems, we propose a multi-band image synchronization fusion network (MBF-Net). Compared with other empirically designed networks, the proposed MBF-Net is specially designed based on the multi-band image fusion model proposed in this study. Thus, the network structure was more reasonable. The source image in the data fidelity term and the prior regularization term are implicitly represented by the deep learning network and jointly learned from the training data, thereby avoiding the introduction of errors in a handcrafted manner. In addition, a new saliency loss function is proposed to retain the thermal radiation information of the infrared images. The experimental results on the TNO dataset show that the method is competitive compared with other fusion methods.
Based on the above analysis, the main contributions of this study can be summarized as follows: 1. We propose a new multi-band image synchronous fusion model in which the source image in data fidelity terms and the prior regularization are implicitly represented by the network and learned from the training data, avoiding errors caused by manual design.
2. We use half quadratic splitting (HQS) to solve the multi-band fusion model, and then the iterative process is unfolded into a deep network MBF-Net. Each part of the network is constructed strictly according to the solution process; therefore, the proposed network is more interpretable.
3. We propose a new saliency loss function to preserve the salient regions of infrared images, which solves the problem of insufficient brightness and blurred edges of the infrared image targets in the fusion results.
The remainder of this study is organized as follows. In Section II, we present the background of our study. Section III describes the proposed MBF-Net in detail, including the creation of the multi-band image fusion model, the network implementation, and the proposed loss function. In Section IV, the experimental results and the corresponding discussion are presented. Section V concludes the paper.

II. RELATED WORK A. INFRARED AND VISIBLE IMAGE FUSION BASED ON TRADITIONAL METHODS
The MST-based method is the most representative of the traditional image fusion methods. It usually relies on a specific transform model to decompose the images, then designs suitable fusion rules for these decomposed images, and finally reconstructs the images. Classical image decomposition methods include wavelet-based algorithms [24], [25], pyramid-based algorithms [26], [27], non-subsampled contourlet transform [28], and curvelet transform [29]. However, the fusion results may suffer from ringing artifacts because the classical filters do not protect the edges well. Therefore some edge-preserving filters are used for image fusion tasks [30], [31], [32]. Classical image fusion rules include ''average'' and ''max-absolute'' fusion rules. However, these simple fusion rules tend to cause a loss of important information [33], so many new fusion methods have been proposed to better preserve the source image information [34], [35]. Although MST-based image fusion methods are simple, they are inseparable from the manual parameters setting and cannot adapt to complex and changing scenes.
Optimization-based methods have been widely used in image fusion in recent years [36], [37]. It is mainly developed by modeling the physical processes behind the problem and capturing a priori domain knowledge. Optimizationbased methods rely on data fidelity terms and regularization terms to estimate the fused images. The data fidelity term enables the fusion result to converge to a particular source image [20] or multiple source images at the same time [21], thus avoiding the fusion result from deviating too far from reality, while the role of the regularization term provides a sparse prior [19] or a low-rank prior [4] for the fusion result.

B. INFRARED AND VISIBLE IMAGE FUSION BASED ON DEEP LEARNING METHODS
In recent years, based on deep learning methods have been widely used for infrared and visible image fusion. Many types of network architectures and network units have been used for infrared and visible image fusion, including autoencoding network [38], [39], [40], generative adversarial networks (GAN) [41], [42], [43], dense connectivity [7], [23], residual networks (ResNet) [44], attention mechanisms [45], [46], [47], and transformer [48], [49], [50]. GANbased methods typically rely on adversarial games of generators and discriminators to make the fused images approximate the probability distribution of the target. Ma et al. proposed FusionGan for infrared-visible image fusion to solve the problem of missing real labels [22]. Many GAN variants have been subsequently used for image fusion to address the difficulty of GAN training [51], [52], [53]. The discriminator is also changed from one to multiple to preserve more information [52], [54]. The attention mechanism is widely used for infrared and visible image fusion. Li et al. used spatial and channel attention in the feature fusion stage to assign a higher weight to each important spatial location and channel [39]. Li et al. integrated the multi scale attention mechanism into the generator and discriminator of a GAN, which can preserve meaningful information and enhance important information [42]. Liu et al. designed an edge-guided attention mechanism based on multi scale features, which utilizes coarse intermediate features to obtain an enhanced attention map of edge images, enabling the fusion results to retain more texture details and reduce undesired artifacts [55]. Because the transformer can focus on global information, there are now several methods to apply it in the field of image fusion. Li et al. first extracted local features by CNN and then captured the long-range dependence features of the image by a transformer, aiming to combine the two features to produce satisfactory fusion results [56]. Liu et al. proposed to fuse the extracted features with focal transformer blocks [57], and these focal transformer blocks can be adaptively fused based on different features.

III. PROPOSED METHOD A. MODEL DESCRIPTION
The model for image fusion usually contains two terms: a data fidelity term and a regularization term. The former usually uses the source image or pre-processed image as a data fidelity term to constrain the fusion result, which does not deviate significantly from reality. The latter uses a handcrafted prior as the regularization term. The gradient sparse regularization term is typically used in infrared and visible image fusion to provide the fusion result with more detail information. Thus, it can be expressed as where ||F − u|| is the data fidelity term, J (F) is the regularization term, F denotes the fused image, u denotes the source image or pre-processed image, p denotes the type of norm used, and α is a parameter that balances data fidelity and regularization terms. Many current methods usually select a specific source image as u [19], [20] or construct u according to the maximum pixel value of the source image [4]. However, this will only make the fusion result close to the selected image in pixel value. In contrast, the vital information of the unselected image cannot be well transmitted to the fusion result, so the fusion result may only partially contain the information of the source image. We believe that different band images describe the same scene with different emphasis. For example, infrared images focus more on salient target information, visible images contain rich detail information, and nearinfrared images contain both. A single sensor cannot obtain the complete scene information, whereas a fused image is a more comprehensive representation of this scene. Therefore, there may be a mapping relationship between the fused image and the source image. To describe this mapping relationship as well as to make the fused result contain more comprehensive information, we let u= n i=1 W i X i build the following model: where X i ∈ R m * n represents the input image with m rows and n columns. i ∈ [1, · · · n] represents the ith band image. W i ∈ R m * m represents the transformation matrix corresponding to the ith band image. β represents the balance parameter. K (W i ) represents the regularization term used to constrain W i , the purpose of which is to allow essential parts of each band image to be obtained. The L2-norm can be used to describe the energy similarity between two images. When the energy difference between the fusion result and the input is Gaussian distributed, the L2-norm is usually used; thus, for the data fidelity term, p = 2. For the regularization term, previous methods usually manually designed the regularization term based on certain prior assumptions [19]. However, the current study shows that a data-driven deep prior is more advantageous in terms of the a priori fitting ability. Therefore, instead of designing J (·) and K (·) in a handcrafted manner, we learned the effects of J deep (·) and K deep (·) on F and W i separately through the network. Thus Eq. (2) can be written as

B. MODEL OPTIMIZATION
When the regularization term is not differentiable, the variable-splitting technique is typically used to decouple the VOLUME 11, 2023 objective function. Thus, by introducing the auxiliary variables V i and Z , Eq. (3) can be reformulated as an optimization problem with the following constraints: Then, by employing the HQS, the above constrained optimization problem can be transformed into an unconstrained optimization problem: where a i and b are the penalty parameters. Eq (5) can be divided into four subproblems: F, W i , Z , and V i : where k represents the number of iterations.
In the closed-form solutions of Eq. (6) and (7), it is difficult to compute the inverse of the matrix directly. Therefore, we used the gradient descent optimization method to obtain an approximate solution. The derivatives with respect to F and W i are: Thus we obtain the iterative formula for updating F k+1 , W k+1 i as: where δ 1 and δ 2 are the step sizes of gradient descent. X T i is the transpose of the image X i . It should be noted that the gradient descent optimization in Eq. (12) and Eq. (13) requires multiple iterations. For simplicity, we set the single step of gradient descent as an approximation of F k+1 and W k+1 i , which is beneficial to the network implementation [58].
We can see that the subproblem of Z in Eq. (8) is the proximal operator of the multi-band fused image prior to J deep (·). We use the network to learn the solver S 1 (·) of the proximal operator directly. Then, the iterative formulation of the subproblem Z is In this way, the multi-band fused image prior is not modeled explicitly, but is learned through the network, which introduces nonlinearity in the prior modeling and avoids the inaccuracy of handcrafted image priors. Similarly, the iterative formulation of the V i subproblem is where S 2 (·) is the solver of Eq. (9).

C. NETWORK IMPLEMENTATION 1) ITERATIONS UNFOLDING
The multi-band image fusion results can be reconstructed by iteratively solving the above subproblems F, W i , Z , and V i .
However, this iterative process is time-consuming and tuning a large number of parameters is not an easy task. To overcome these difficulties, we propose MBF-Net, in which the transformation matrix, hyperparameters, and data prior are automatically learned from the training data. Eq (12)- (15) were converted into a network structure diagram, as shown in Fig. 1. The diagram consists of nodes for different operations in the iterative solver model and data flow corresponding to the data flow between the nodes. In this case, the k + 1 th iteration of the solution model corresponds to the k + 1 th stage of the network graph. The Encoder and Decoder structures contain a convolutional layer with a kernel size of three, a batch normalization (BN) layer, and a rectified linear unit (ReLU). Thus, given a set of multi-band images, a fusion result is generated through a series of intermediate processes.
The MBF-Net proposed in this study is shown in this graph. The outputs F k , W k i , and Z k of the kth iteration results are given in Eq. (12) to obtain the output F k+1 of the fusion results obtained from the k + 1 th iteration, so Eq. (12) can be rewritten as: where δ 1 and b are learnable parameters representing the step sizes of gradient descent and penalty parameters, respectively. And Z k+1 = S 1 (F k+1 ), then the update blocks corresponding to the iterative process of F k+1 and Z k+1 are shown in Fig. 2(a). where ResNet represents the proximal operator solver, S 1 (·), in Eq. (14), and it has been confirmed in a previous work that satisfactory results can be achieved using ResNet instead of the proximal operator [59].
The Eq. (13) gives the output W k i and V k i of the kth iteration and the output F k+1 of the k + 1 th iteration to obtain the weight coefficients W k+1 i in the k + 1 th iteration. thus Eq. (13) can be rewritten in the following form: where δ 2 and a i are learnable parameters representing the step sizes of gradient descent and penalty parameters, respectively. and the update blocks corresponding to the iterative processes of W k+1 i and V k+1 i are shown in Fig. 2(b).

2) ITERATIONS INITIALIZATION
In the first iteration phase, W 1 i and F 1 need to be initialized. Since we learn W i by constructing a convolution kernel, we encode the source image and map it to a space consistent with the dimension of the convolution kernel to facilitate calculation. And get F 1 by the following formula: where Encoder(·) represents the encoder, and since we fuse three band images, n = 3, vis, ir and nir represent the visible, infrared, and near-infrared images, respectively.

3) LOSS FUNCTION
In the training phase, a loss function, L is designed to constrain the fused images to retain important features in the source images. For multi-band image fusion, meaningful features include thermal radiation information in infrared images and gradient information in different band images that characterize texture details. Therefore, our loss function L contains structural similarity (SSIM) loss L ssim , gradient loss L gradient , and saliency loss L salient : The SSIM loss L ssim constrains the fused image and source image from having similar structural information, and the saliency loss L salient constrains the fused image to have similar pixel intensities as the infrared image in salient regions, while the gradient loss L gradient forces the fused image to contain finer-grained texture details. λ 1 and λ 2 are balance constants for balancing different loss functions. SSIM loss are defined as follows: where f represents the fusion result and X i represents the source image. SSIM (·) is a structural similarity operation.  Saliency loss measures the difference between the fused and source images at the pixel level. Therefore, we define saliency loss as where f represents the fusion result, ir represents the infrared image, and map is the weight map constructed map by the following equation: where X = ir(i, j)−mean(ir), ir(i, j) denotes the pixel values of the (i, j) points of the infrared image and mean(ir) denotes the mean value of the pixels in the infrared image. After calculation, the weight of the pixel values smaller than the mean value in the infrared image is 0, and the weight of the pixel values larger than the mean value is between 0 and 1; thus, the significant targets of the infrared image can be well extracted. As shown in Fig. 3. The SSIM loss causes the fused image to have structural information similar to that of the source image, and the saliency loss causes the pixel intensity distribution of the salient region of the fused image to be close to that of the infrared image. However, such coarse-grained constraints may miss some texture details, so we introduce gradient loss to force the fusion result to contain more fine-grained texture information, which is defined as (23) where ∇ denotes the gradient operator, max(·) denotes the maximum operation, and |·| denotes the absolute value operation.

IV. EXPERIMENTAL RESULTS
In this section, we describe the ablation experiments and contrast experiments. We selected 22 sets of images (including registered long-and short-wave infrared and visible light) from the TNO dataset [60] as training data. Each image of the training data was cropped to a strip of 64 pixels, and the size of each patch was 128 × 128. Finally, 13200 sets of images were used as the input dataset. The proposed MBF-Net was trained on an NVIDIA GeForce RTX 3090 using PyTorch3.6. In the MBF-Net training, the epoch was 100 and the batch size was 2. The learning rate was set to 0.0001.

A. ABLATION EXPERIMENTS
In this paper, a saliency loss function is proposed for retaining salient targets, and two parameters, λ 1 and λ 2 , are set in the total loss to balance the different loss functions. In addition, we set the number of iterations to N . Different parameter choices will have different effects on the final fusion results; therefore, we designed ablation experiments to explore the effects of different parameters on the final fusion results. Specifically, we first selected a typical set of multi-band images, then varied the values of the parameters and measured the changes in the images with a series of evaluation metrics. Finally, a comprehensive analysis of the effects of the parameters on each metric was performed to determine the final parameters.

1) THE INFLUENCE OF LOSS FUNCTION
We propose a saliency loss function to retain the significant targets of infrared images. For the total loss, we set two parameters, λ 1 and λ 2 , to balance the different loss functions. To demonstrate the effectiveness of the proposed saliency loss function and to determine the values of parameters λ 1 and λ 2 , we used different values of λ 1 and λ 2 for the experiments, as shown in Fig. 4. When keeping λ 2 constant, as λ 1 increases, it can be intuitively seen that the infrared image salient targets in the fusion results are gradually dimmed. When keeping the λ 1 constant, with an increase in λ 2 , we can observe a gradual enhancement of the infrared salient target brightness. The experimental results demonstrate the effectiveness of the proposed saliency loss function. To choose the appropriate values of λ 1 and λ 2 , we selected the images with good subjective evaluation from the following images for objective metric evaluation. During λ 1 = 0, the image lacks detailed information on visible and near-infrared images. When λ 1 = 1, the fusion result is better when the value of λ 2 is in the range of 20-40, and beyond 40, the sky region becomes dim, and the overall contrast of the image becomes weaker. When λ 1 = 2, the subjective evaluation of the image is better when the value of λ 2 is in the range of 30-50. The subjective evaluation of the image is better when λ 2 takes the value range of 30-50 at λ 1 = 3. We used a red box to box out the fusion results with good subjective evaluation. Several metrics were used to comprehensively estimate the fusion results, including the average gradient (AG), spatial frequency (SF), entropy (EN), peak signal-to-noise ratio (PSNR), and SSIM. These metrics provide a comprehensive evaluation of fusion results in terms of information quantity, information transfer, edge information, and local similarity. Moreover, the objective metrics were evaluated, as shown in Table 1. It can be seen that the best results are obtained when λ 1 = 3 λ 2 = 30. Therefore, in subsequent experiments, our parameter λ 1 was set to 3, and λ 2 was set to 30.

2) THE INFLUENCE OF ITERATIONS NUMBER
The number of iterations, N , is another essential factor affecting the fusion effect of this algorithm. If the number of iterations N is too small, satisfactory fusion results may not be produced. If the number of iterations N is set too large, the running time of the algorithm increases significantly and may have other undesirable effects. To determine the appropriate value of N , we conducted experiments to study the effects of different numbers of iterations N on the fusion results. The fusion results for different numbers of iterations are shown in Fig. 5. It can be seen that as the number of iterations grows, the detailed texture features become obvious. To reflect the difference more intuitively, we conducted an objective evaluation, as shown in Fig. 6. As the number of    iterations increased, the values of the SF, AG, and PSNR metrics showed an increasing trend and then stabilized at a certain level or even decreased, which is consistent with the subjective evaluation. The values of Qabf and SSIM exhibited a decreasing trend. The EN value always remained at a certain level. This phenomenon may be because, as the number of iterations increases, more detailed information is incorporated into the fusion results. When the number of iterations increases to a certain level, it may no longer be possible to obtain more detailed information from the source image. Therefore, it appears that the metric value cannot continue to grow or even decrease. This degradation has been observed in other studies [61], possibly because of the difficulty in training the deep network. To balance the fusion effect and running time, we used N = 5. To observe the impact produced by each iteration, we visualized the results of each iteration during N = 5 and obtained intermediate results, as shown in Fig. 7. As the iteration progressed, the fusion results gradually contained more information, and the fusion results became more consistent with the visual perception of the human eye.

B. COMPARATIVE EXPERIMENTS
We qualitatively and quantitatively compared the proposed method using eight representative methods. These include the GTF [20], RFN-Nest [38], U2Fusion [7], DenseFuse [23],    MDDR [3], CUFD [62], FusionGan [22], and SeAFusion [6]. Except for MDDR, the other methods are applicable to twoband image fusion, which is converted to three-band image fusion using a sequential method. The best fusion result was selected for discussion, considering that the different input orders of the source images will have different fusion results.

1) QUALITATIVE RESULTS
The fusion results of the different methods for the TNO dataset are shown in Figs. 8-14. Our method has three dis-tinct advantages over existing methods. First, our method successfully retains rich texture details, such as the wall in Fig. 8, car in Fig. 12, and tree in Fig. 11, which helps in target identification. Second, our method greatly preserves the salient target of the infrared images, such as the targets in Figs 8,9,12,13, and 14, which facilitates the detection of the targets. Third, our results highlight the contrast of the images, which is more in line with human visual perception.
From Figs. 8,9,12,14, we can see that the fusion results of RFN-Nest, U2Fusion, and DenseFuse contain rich detail   information, but they cannot highlight the salient targets well, and the U2Fusion images are particularly affected by noise. CUFD and FusionGan can retain the thermal target information in the infrared images well, but it can be seen that the target edges are blurred. GTF fusion results lack realism and are not rich in detailed information. The MDDR and SeAFusion fusion results were superior. However, compared with the proposed method, it can be seen from Figs. 8 and 14 that the salient targets in the MDDR are dimmer. SeAFusion is competitive in retaining thermal targets and detailed information in the infrared region; however, the fused images are brighter, and some information is missed, as shown in Figs 9, 10, 11, and 12.

2) QUANTITATIVE RESULTS
To evaluate the effectiveness of the proposed method, several metrics were used to assess the fusion results comprehensively. The results are listed in Table 2, where the best values are shown in bold. As can be seen from the table, the method proposed in this study performs well in the AG, SF, Qabf, and PSNR metrics. The highest values obtained in AG and SF metrics indicate that the fusion results are rich in detail information, and high Qabf values indicate that the detail information of the source images can be well transferred to the fusion results; good performance in PSNR indicates that the proposed method has a greater advantage in image fidelity, which are consistent with the subjective evaluation. The proposed method performs well in EN measurements,  which measures the amount of information contained in the fusion results. It can be seen that U2Fusion often achieves larger values in this metric, and it is easy to find that its fusion results are more affected by noise through subjective analysis. This is most likely due to the effect of noise that makes the pixels widely distributed, resulting in higher values in the EN. The proposed method performs at a moderate level on SSIM, which aims to measure the similarity between the fusion results and source image from three perspectives: brightness, contrast, and structure. The subjective evaluation shows that the proposed method achieves good visual results in terms of luminance and contrast and contains significant target and rich detail information; however, the fusion results are less similar to the original image, which may cause our method to underperform in this metric. The average quantitative comparison results of the 25 pairs of images on TNO are given in Table 3, which can objectively evaluate the fusion results. Our method achieved the highest values in the AG, SF, and Qabf metrics, and the other three metrics were in the middle to the upper level. Tables 2 and Figs 8-14 clearly illustrate that the proposed method performs better in subjective visual comparison and objective quantitative evaluation than the other methods.

V. CONCLUSION
Efficient learning of the regularization prior and the source image of the data fidelity terms is important for multi-band image fusion. In this study, we proposed a new multi-band image fusion model and unfolded it into a network. The data prior and source image of the data fidelity term can be represented by the network and are learned from the dataset. To preserve the significant targets of the infrared images, we also proposed a saliency loss function. Overall, our method can highlight salient targets while retaining useful details in source images. Compared to other methods based on open datasets, the proposed method is highly competitive in both qualitative and quantitative aspects.