End to End Infrared and Visible Image Fusion With Texture Details and Contrast Information

Infrared and visible image fusion combine data information from different sensors to achieve a richer description of the same scene. In order to highlight the salient features of the infrared image and the visible image in the fusion image and obtain a fusion image with good performance, an end-to-end infrared and visible image fusion algorithm is proposed in this paper. The contrast attention module and visible image cascade part are introduced in the generator, so that the fusion image can focus on the detail information in the visible image and the contrast information in the infrared image. And in order to retain more structural contour information in the original image, the contour loss is added to the content loss function. In addition, the contrast and detail information in infrared and visible images are balanced by two discriminators. And a goal-guided reward function is introduced into the discriminator, which further facilitates the generator to produce effective fused images. Finally, extensive fusion experiments on public datasets verify the advantages of the proposed algorithm compared with other classical algorithms, and ablation experiments demonstrate the effectiveness of the improved part of the algorithm.

to the complementary characteristics of infrared and visi-30 ble images, the fusion of the two can obtain rich detailed 31 information and important thermal target information, and 32 obtain ideal fusion results. Infrared and visible image fusion 33 has been widely utilized in target detection [2], tracking [3], 34 agricultural activities [4], military operations [5], and many 35 other fields [6], [7]. 36 For the fusion of infrared and visible images, the key is to 37 maintain the significant contrast information in the infrared 38 images and the rich detailed texture information in the visible 39 images. To achieve this goal, many traditional infrared and 40 visible image fusion algorithms have been widely proposed 41 in the past, and these traditional algorithms can be divided 42 into six categories: feature-based methods [8], decision-43 based methods [9], sparse representation-based methods [10], 44 transformation-based methods [11], subspace-based meth-45 ods [12], and hybrid methods [13], which have achieved 46 good results in many application scenarios. In order to better 47 rules in traditional algorithms, but also solve the shortcom-104 ings of previous GAN-based fusion algorithms. A large num-105 ber of experiments have proved that the proposed algorithm 106 can achieve significant fusion results with visual effects, and 107 because the trained model is directly utilized in the test stage, 108 the real-time performance of the algorithm is also very high. 109 The main contributions are as follows: 110 (1) By introducing the contrast attention module and cas- 111 cading the visible image with the network, not only the 112 contrast information of the fused image is enhanced, 113 but also the texture details of the fused image are 114 effectively enhanced.

115
(2) Based on the original loss function, the contour loss 116 function is introduced to make the fusion image and 117 the source images have more similar edge and contour 118 details.

119
(3) Two discriminators are employed to fully extract 120 information from infrared and visible images. A target-121 guided reward function module is designed in the 122 discriminator to improve the discriminator's discrim-123 inative ability and thus the quality of the generator to 124 generate fused images.

125
(4) Comparative experiments and ablation experiments 126 verify the good performance of the proposed algorithm 127 from different perspectives. 128 The rest of this paper is organized as follows. 129 Section 2 presents related work on image fusion, and 130 Section 3 introduces the algorithm proposed in this paper. 131 Section 4 is the experimental part, which validates the pro-132 posed algorithm on public data and compares it with other 133 methods. The effectiveness of the key parts of the proposed 134 algorithm is verified by ablation experiments. Conclusions 135 are in Section 5. Deep learning is a research subfield of machine learning, with 139 artificial neural network (ANN) as the main architecture of 140 the model, aiming to make the model fit the end-to-end data 141 mapping well in a data-driven manner. The deep learning 142 fusion methods can not only mine the deep features of the 143 image, but also have good model learning ability. In recent 144 years, deep learning fusion methods have become a new 145 research direction in the field of infrared and visible image 146 fusion [19]. Compared with the traditional fusion algorithm, 147 the deep learning network can optimize the error caused 148 by the traditional algorithm from manually extracting fea-149 tures, constructing fusion rules and reconstruction, further 150 to improve the performance of image fusion, and it is more 151 robust. 152 Liu et al. [20] first used convolutional neural networks 153 for learning and training to obtain feature maps, and then 154 reconstructed feature maps and source images to obtain fused 155 images. Li et al. decomposed the infrared and visible images 156 separately to obtain the base layer and the detail layer [15]. 157 They used a pre-trained VGG network [21]  ness when multi-source image registration is poor.

177
The above methods either still require manual design of 178 partial fusion rules, or a manual design of ground truth to train 179 the network. On the one hand, there will be some errors in  improve the clarity of fused images. With the development 215 of deep learning technology, starting from the modal charac-216 teristics of infrared and visible images and the target task of 217 image fusion, the semantic information of images is mined 218 and utilized to improve the fusion efficiency and quality in 219 different scenarios, which is a new direction for deep learning 220 fusion method research worthy of expansion.

222
GAN is a probabilistic generative model, which makes the 223 generated sample distribution obey the real data distribution 224 through a game. The traditional GAN model consists of a 225 generator G and a discriminator D. In order for G to learn 226 the distribution P g of the real data x, a noise variable P z (z) is 227 first defined. The G is mapped to the data space G(z, θ g ); the 228 D(z, θ d ) is a binary classification network, which is utilized to 229 judge whether the data generated by the G comes from the P g . 230 During the model training, G tries to ''deceive'' D by gener-231 ating real samples as much as possible, and the discriminator 232 tries to identify fake samples as much as possible, and the 233 two play against each other. In other words, the generative 234 adversarial network is a minimax optimization problem, and 235 the optimal value is a saddle point at which the generator 236 reaches the minimum value and the discriminator reaches the 237 maximum value.
where P data represents the distribution of real training 242 data, P z (z) represents the input noise variable, D (x) rep-243 resents the probability that the discriminator distinguishes 244 the sample from the real sample, log D (x) represents the 245 cross entropy (CE) of [1,0]   However, in the early stage of traditional GAN model 249 training, the learning performance of the generator is very 250 poor, and the generated data distribution is quite different 251 from the real data distribution. So that the discriminator 252 has a high degree of confidence in identifying the gener-253 ated data, and it causes the problem that the gradient of the 254 objective function in (1) back-propagated to the generator 255 is very small, resulting in the problem of gradient disap-256 pearance. The Least Squares Generative Adversarial Nets 257 (LSGAN) [28] introduced to this paper proposed a least 258 squares loss function based on the traditional GAN model. 259 The new decision boundary generated by the discriminator 260 D will penalize those who are far away from the decision 261 boundary. The samples are generated to provide a larger gra-262 dient for the generator G during the training process, which 263 overcomes the problem that the gradient of the original GAN 264 model disappears. The LSGAN objective function is shown 265 in Eq. (2) and Eq. (3): Similarly, the infrared image and the fused image also go 317 through a discriminator, and the discriminator is used to judge 318 whether the generated fused image has sufficient contrast and 319 the structural information compared with the infrared image. 320 Both discriminators output a compared value. The generator 321 and discriminator in the training phase alternate until the 322 desired effect is achieved.

323
In the testing phase, the discriminator is not utilized and 324 only the generator is kept. The infrared image and the visible 325 image are directly input into the trained generator to obtain 326 fused images.
The generator model G is an improved network based on 329 VGG-16 [18], which is divided into visible path and infrared 330 path. The basic structure of each path is the same, each 331 path contains 3 convolutional blocks, each block has 2-3 332 convolutional layers and a max pooling layer at the end of 333 each block. First, the infrared image and the visible image 334 pass through the three convolutional blocks of their respective 335 paths simultaneously. Then the two paths together feed into 336 the fourth convolutional block. Finally, the fused image is 337 output after the fifth convolution block. Before the two-way 338 convergence, in order to better extract the contrast features of 339 the infrared image, a contrast attention module is added after 340 the third convolution block of the infrared path, which can 341 better refine the feature mapping in the network. Inspired by 342 SE-Net [29], the network structure of this module includes 343 convolutional layers, activation layers and pooling layers. 344 As shown in Figure 2.

345
Contrast attention generating features can be computed as: 346 where F is the output of the three convolutional blocks and 348 the input of the contrast attention module. RELU is the Leaky 349 RELU activation function, Conv 1×1 is the convolution oper-350 ation with a 1 × 1 × 64 size, and the size of the convolution 351 kernel is set to 3, and the stride is set to 1. S represent 352 Sigmoid activation function. The contrast attention module 353 finds out the contrast salient features in the infrared images by 354 learning the context information of the samples, and enhances 355 the network's ability to express the contrast saliency regions. 356 In order to obtain the contrast attention feature, the average 357 value of the contrast feature map is obtained from the input F 358 by using average pooling, and then the non-linear interaction 359 of the average value of the contrast feature is learned with 360 two convolutions and the Leaky RELU activation function, 361 resulting in a contrast attention feature map. Finally, the Sig-362 moid activation function is used to map the contrast attention 363 feature to the [0, 1] interval to obtain the output contrast 364 attention descriptor.

365
In addition, in order to obtain more significant detailed 366 texture features of the visible image, the visible image is 367 cascaded with each convolutional block of the visible path to 368 correlate to the deeper detailed features of the visible image. 369 VOLUME 10, 2022

370
In order to make the feature information of both infrared and 371 visible image reflected in the fusion image, two discrimina-372 tors with the same network structure are used to judge the 373 quality of the generated fusion image. As shown in Figure 1. 374 D visible is designed to discriminate the similarity between the 375 generated fusion image and the original visible image, and 376 D infrared is designed to discriminate the similarity between the 377 generated fusion image and the original infrared image. The 378 two discriminators contain 4 convolutional blocks, a linear 379 layer, and a reward function module. Among them, every 380 convolutional block designs a convolutional layer, a switch-381 able normalization (SN) layer and a Leaky RELU. The size 382 of samples of each convolutional layer is designed to 32, 383 64, 128 and 256, respectively. The size of all convolutional 384 kernels is set to 3, and the stride of all convolutional layer is 385 set to 2. Especially, SN is adopted in the discriminator, which 386 can not only improve the speed of network training, but also 387 obtain well-normalized data.

388
The setting of the reward function can improve the discrim-389 inator's discriminative ability, thereby improving the qual-390 ity of the fused image. In this paper, a goal-guided reward 391 function module is designed, and its specific expression 392 is as follows.
where r represents the reward function, θ ∈ (0, ∞) is the can be expressed as: where D (·) represents the discriminator model function.

438
F n represents the fusion image, t is the threshold for the 439 discriminator to judge that the input image is a fake image, 440 n represents the n th image input into the discriminator, and N 441 represents the number of images input to the discriminator. 442 The intensity loss is expressed as follows: where F n represents the fusion image, and I ir represents the 445 infrared image.

446
L contour Generator is the contour loss function proposed in this 447 paper. It uses the feature description operator to map the 448 infrared image, visible image and fusion image from the pixel 449 space to the shallow gradient space. By calculating the shal-450 low contour feature distance between the fused image and the 451 source images, the contour feature distribution of the infrared 452 and visible images is learned, and the fusion image is realized. 453 At the same time, the contour information of the infrared 454 target and the detailed texture information of the visible 455 image are preserved. In this paper, the Laplacian operator is 456 used to extract the contour information of the source images, 457 and the Manhattan distance is used to calculated the distance 458 between the gradient of the fusion image and the gradient 459 of the infrared and visible images. The edge loss function 460 L contour Generator is expressed as follows: where H and W are the height and width of the input image, 463 respectively, ∇ represents the Laplacian operator, and · 1 464 represents the L 1 norm.

466
Two separate discriminator loss functions work simultane-467 ously for dual adversarial training, namely L ir Discriminator and 468 L vis Discriminator . L ir Discriminator is used to discriminate the dif-469 ference between the generated fused image and the infrared 470 image, and L vis Discriminator is used to discriminate the difference 471 between the generated fused image and the visible image. 472 Through these two loss functions, the discriminator param-473 eters can be well updated, and a fusion image that can bal-474 ance infrared features and visible features can be generated. 475 The loss functions of the two discriminators are defined as 476 follows: where N is the total number of input images, [26], GANMcC [32] to compare with the proposed method.

541
Correlation coefficient represents the correlation between 542 the fusion image and the source images. The larger the value 543 of the correlation coefficient is, the better the quality of the 544 fusion image is. The formula for calculating CC is as follows: 545 where s and f are the source image and fusion images of size 548 M ×N , respectively.s andf are the mean values of the source 549 image and fusion images, respectively.

550
Mutual information means that the fusion image retains 551 the amount of information in the source image. The larger 552 the mutual information value is, the better the image fusion 553 performance is. The formula for calculating MI is as follows: 554 Structural similarity can express the similarity of the 556 structural texture between the fusion images and the source 557 images. It is used to measure the structural similarity between 558 the source images and the fusion images. The larger the struc-559 tural similarity value is, the better the quality of the fusion 560 image is. The formula for calculating SSIM is as follows: where s and f represent the source image and the fusion 563 image, respectively, µ s and µ f are their average values, σ 2 s 564 and σ 2 f are their variances, σ sf are the covariances between 565 them; C 1 and C 2 are variable used to stabilize the denomina-566 tor.

567
Q ABF can express the degree to which the salient informa-568 tion of the source images is contained in the fusion images 569 through local analysis. The higher the value is, the better the 570 quality of the fused image is. The formula for calculating 571 Q ABF is as follows: where W is the family of all windows and |W | is the cardinal-575 ity of W. λ (ω) is a local weight, a and b are the pixels, Q 0 (·) 576 is the local quality index.      Table 1. The data in the table also show 623 that the proposed algorithm has significantly higher objective 624 evaluation index values than other algorithms, which further 625 proves the effectiveness of the proposed algorithm.

626
To sum up, for the fusion of infrared and visible images, the 627 proposed algorithm has good performance both subjectively 628 and objectively.  Figure 6 shows the results of the ablation experiment. It can 647 be seen that in the first case, the contrast and detail retention 648 of the fused image are not good, and even some areas have the 649 phenomenon of missing details. In the second case, the reten-650 tion of detailed texture information is improved compared to 651 the first case, but the contrast information is still not obvious 652 enough. The third case is much more contrasty than the first 653 case, but the detail retention is still insufficient. The fourth 654 case is the best performer in terms of visual performance, 655 contrast, and detail retention.

656
In addition, we use six evaluation metrics including EN, 657 SF, CC, MI, SSIM, and Q ABF to conduct quantitative exper-658 iments on the fusion results in different situations, and the 659 obtained results are shown in Table 2. As can be seen from 660 Table 2  the highest, which also shows that the improved part added in 666 this paper is effective.

667
Therefore, it can be concluded that the design of the con-    for this phenomenon is that the ability to extract feature 707 information is poor due to the lack of content loss. The third 708 and fourth cases correspond to lack of contrast detail and lack 709 of texture detail, respectively. Finally, when both content loss 710 and adversarial loss are included, the fusion results not only 711 have obvious contrast and rich texture details, but also are 712 well-balanced in the distribution of fused images, with good 713 visual performance.

714
Similarly, we use six evaluation metrics including EN, SF, 715 CC, MI, SSIM, and Q ABF to conduct quantitative experiments 716 on the fusion results of different situations, and the obtained 717 results are shown in Table 3. The results in Table 3 are 718 also consistent with Figure 7. It can be seen from Table 3 719 that the six index values of Case5 are all the best, while 720 the quantitative performance of the other cases is reduced to 721 varying degrees.

722
From the above analysis, it can be concluded that content 723 loss and adversarial loss are complementary, and they work 724 together to obtain well-performing fusion results. In the discriminator, in order to improve the discrimina-728 tor's discriminative ability and thus promote the generator 729 to produce better fused images, a target-guided reward func-730 tion module is designed. This module can more accurately 731 discriminate the images sent by the generator. Therefore        in the test experiment of each algorithm is calculated. From the results in Table 5, it can be seen that the pro-765 posed algorithm has relatively low computational efficiency 766     In the future, we will further optimize the network struc-799 ture to reduce the complexity of the model without losing 800 the ability of retaining the source images' features. And we 801 will generalize the algorithm to other types of image fusion. 802 Moreover, we can improve the way we verify the accuracy of 803 the algorithm by adding a set of semantic tags to the infrared 804 and visible images. In this way, the fusion image can show 805 the details of the elements in the source images more clearly, 806 and the advantages of the algorithm can be more prominent.