Deep Learning L2 Norm Fusion for Infrared & Visible Images

Fusion is a strategy for collecting data from multiple images in order to improve information quality. Infrared images can recognise objects from their surroundings depending mostly on radiation disparity, which works better in all weather conditions as well as irrespective of whether it is day or night. Visible images can integrate texture information with great visual precision and in detail that matches with human visual system. Integrating the benefits of thermal radiation information with precise visual information from infrared and visible modalities is a good idea. The presented algorithm utilises the $\ell _{2} $ norm and a combination of residual networks for combining the complementary information from both image modalities. The encoder consist of convolutional layers with selected residual connections in which the output of each layer is associated with each other layer. The $\ell _{2} $ norm approach is then used to fuse the two featuremaps. At last, decoder recreates the fused image. The large mutual information value of 14.85084 indicates more complementary information retained in the fused image than in the infrared and visible images. The large entropy value of 6.92286 indicates more information content in the fused image and the fused image is equipped with more edge information. The proposed architecture collect more pixel values from both infrared and visible image and the fused image looks more natural as it contain more textual content. The proposed system accomplishes a noteworthy performance with the existing models.


I. INTRODUCTION
Multi-sensor data fusion advancement have supported a number of areas, including distant identifying, clinical imaging and contemporary military. Infrared (IR) pictures are taken utilizing IR cameras that are sensitive to warm radiation and marks. As a result, they unquestionably show heat signature assignment over the area specified, but they also have a poor dynamic range and lack of nuances. On the other hand, self-evident visible (VI) images often have simple structures and nuances because of the reflected light catch instrument of VI sensors. The targets in the images are vague when the scene is under low-light conditions or the actual locations are not clear. The IR and VI image combines and fuse comparable details from source images to create an informative image that uplifts ongoing uses of visible and infrared image fusion technology [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Jiju Poovvancheri . Several hybrid models have been implemented already in the domain of infrared and visible fusion. We partition these as 3 groups: Deep Learning-based methods [11]- [19], multi-scale transform (MST)-based systems [2]- [10], and different strategies [20]- [24]. The mixed after effects of MST-based systems for infrared and VI images are unpalatable as the images indicate different information [20]. While MST-based fusion procedures typically yield better outcomes because that multi-scale handling technique is normally appropriate to human visual systems [25]. The warm radiation within IR images is predominantly reflected in pixel power, while the appearance of visible images is fundamentally reflected by the gradient. In order to address the issue, Ma et al. presented a novel fusion method called the gradient transform fusion (GTF) method [20]. Their methodology assess the visible and IR image fusion as minimization process, expecting to save the thermal radiation data in IR and the gradient data in VI. But complexities with little degree are ignored in the fusion results, which can be subject to two factors. The first is that the 1 -norm is utilized to make up for the deficiency of the drops, and the second is that gradient transform fusion overlooks the pixel powers in VI.
Notably, description learning-based approaches also gained a lot of recognition. In the limited space, numerous fusion strategies have been proposed including Histogram of Oriented Gradients (HOG) and Sparse Representation (SR)-based merging strategy [26], Co-sparse Analysis [27], and Joint Sparse Representation (JSR) [28]. Li et al. [33] pioneered one low rank representation (LRR) combination technique within the low-position region. It uses LRR instead of using SR to remove the highlights from image and then use 1 -norm as well as the choose-max fuse strategy to recreate the fused image. As Deep Learning (DL) became more prevalent, a slew of DL-based fusion techniques were proposed. Convolutional neural network (CNN) was utilised to extract image highlights and reconstruct the combined image [22], [30]. Only the effects from the last layer are used as the features in these CNN-based hybrid approaches and this results in loosing a large number of valuable data collected by the central layers. The lost features are crucial for fusion technique. Modern fusion approaches mainly collect deep and relevant features from large images and are achieved by utilising the computational capacity of deep learning architectures.
Technological advancements in imaging devices produce images with more fine details, which will be useful for further developments in industrial applications. The fusion of image modalities will collect more features and hence, it will be useful for the generation of enhanced images. Deep learning (DL) has produced cutting-edge outcomes in many computer vision and image processing applications due to its high capabilities in feature extraction and data representation. Deep learning help to collect more deep features from the image modalities. The amount of texture details in the fused image is less in most of the available literature on the subject. Texture details are fine features mostly contributed by the visible image.
To address the aforementioned problem, this paper proposes a model that involves an auto-encoder network with the encoder extracting the critical features in an image and the decoder will reconstruct the fused result. The CNN layer and residual layers are used to create the encoding network, which results in the creation of feature maps in each layer. Proper fusion strategy is adopted to get the fused feature map. Finally, we obtain a fused image as a result of the fusion strategy and by using five convolutional layers in the decoder network.

A. KEY CONTRIBUTIONS
This paper brings out an efficient deep learning model for fusion of Visible and infrared images. The paper has the following highlights; a. Fusing IR and VI image in an efficient and accurate way b. L2-norm is used as a fusion strategy c. An auto encoder network creates the deep learning model. d. Fused output of IR and VI is obtained in decoder network where it contains 5 convolutional layers Organization of Paper: The remaining part of the paper is organised as follows: Section II depict related works that are used for fusing IR and VI images. Section III provides the proposed approach and its brief explanation. Section IV brings the performance analysis of presented approach with selected images and with different methods and finally conclusion in Section V

II. RELATED WORKS
A number of fusion procedures have been presented in recent years, many of which are heavily reliant on DL. In contrast to multi-scale decomposition strategies and representation learning approaches, fundamental learning based approaches use a collection of images and the learning have been used to find useful features.
Liu et al. [12] put forward one CNN based fusion strategy of IR and VI images, in which weight map obtained from the network could theoretically incorporate activity of pixel data from the source images. This model performs two key tasks: measuring activity levels and assigning weights. When comparing this model to other methods, it achieves a better visual and objective state. Ma et al. put forward another fusion idea named FusionGAN [14], a fusion procedure dependent on the Generative Adversarial Network. the generator searches for images that blend infrared warm radiation information to the visible gradient information. The discriminator produce the image created by the generator have more visible subtleties. Because of the discriminator, FusionGAN's combination effects have more nuances than GTF's. Since ill-disposed preparing is unreliable and prickly, there is detail mismatch with FusionGAN's combination performance. After effects of FusionGAN's fusion would be smooth and fuzzy in general, particularly the limits of targets, which is brought about by enhancing the 2 -norm.
Zhang et al. [16] proposed CNN for image fusion named IFCNN where notable features are extracted from image and are fused by fusion rule and thereby these fused features will be given to 2 layers of convolutional to gain the fused image data. They also build multi-focus datasets based on RGB-D that have the ability to own ground values. This model can be used to generalize fusing various types of images. Ma et al. [19] suggested an approach that uses adverse learning to retain image information. The complete model overcomes the earlier drawbacks of conventional fusion approaches, such as the manual and complex nature of activity-level calculation and merging rules. This also allows the merged image to retain both thermal radiation and abundant textural information in the visible image while sharpening infrared target boundaries in the infrared image. When compared to other methods of evaluation, this approach provides significantly better results. VOLUME 10, 2022 Li et al. [15] suggested a ResNet and Zero-phase portion analysis(ZCA) based image fusion technique. To solve the output degradation of fused images, these integrated models are used. In which ResNet extracts features from an image and then ZCA is utilised to normalise and obtain the weight maps. The final merged output is created using the weighted-average concept and the method performs better when evaluating this with the Github dataset. Li et al. [13] have built a deep learning method to produce an image with all of the requisite IR and VI features. They do this by decomposing the input set and then fusing the bases with a weighted-average strategy. DL is used to gather information and data and the fused image is recreated by performing 1 norm and weighted average on the data. Using dual discriminators, Xu et al. [18] suggested a conditional GAN for generating fused images. For merging VI and IR images of various resolutions, they used a dual-discriminator conditional generative adversarial network (DDcGAN). The fusion task is carried out between two discriminators and a generator, with the generator producing a real merged image to deceive the discriminators. The discriminators are prepared to figure out the structural dissimilarity between the likelihood distribution of down-examined fused images and infrared images, just as the structural disparity between the likelihood dispersion of fused image gradients as well as that of the gradients of infrared images. When compared to other models of assessment, this model performs better.
For the exposure fusion issue, Prabhakar et al. [32] suggested an approach that use CNN. The researchers utilised a basic CNN unit consisting of two convolution layers with in the encoding net as well as three Convolution layers in the decoding net. The encoder network encodes two images and by using an addition process, two-feature map patterns are produced and fused. The decoding network, which consists of three CNN layers, reconstructs the final fused image. Although this method achieves better performance, it still suffers from two main drawbacks: 1) the proposed system architecture is very simple, and key features could not be retrieved correctly. 2) These methods only use outcomes identified by final layers with in encoding net, resulting in a lack of critical details retrieved by that of the middle layers and this phenomenon would become more severe as the network becomes deeper.
To improve information transfer among the different layers, the Huang et al. [29] introduced a new residual block network architecture in which it uses direct connections from any layer to every successive layer. There are three advantages of dense block architecture: 1) it increases the flow of data and gradients through the network, making training of the network become smoother 2) it saves as much information as possible and 3) dense links have a normalizing impact which eliminates overfitting of the model. Fig. 1 briefly describes our proposed model in which the input images are collected from MSCOCO 2014 repository that will be given for entire process. Initially these images will have certain anomalies due to movement of camera and object. To avoid those anomalies, we pre-registered all the images and they are passed to the encoding network which is the most important step. The principle that works behind this method is a block of convolutional neural network with selected residual connections. The pre-registered images are used to train the network which generate the feature map and the decoder network recreates the image from this featuremap. After the training process, Featuremap generated by the encoding network are passed to next important process, which is L 2 -norm. This is used for measuring activity level of the two featuremaps and an averaging operation is performed on the featuremaps. Finally fused feature maps are passed over to 5-layer CNN, which is the decoder network and this important stage produce the fused image based on calculation from previous stages.

A. DATASET
For the proposed system, the images that is used for training is taken from MS-COCO 2014 (http://cocodataset.org/) repository with the following highlights [34]; a. It contains more than 80,000 instances of IR and visible images b. Images are resized to 256 × 256 c. Learning rate is set to 1x10 −4 d. Images are split into train (79,000), and validation (1000) sets e. The batch size is set to 4

B. SOFTWARE AND PLATFORM
Tensor Flow is one of the open-source programming libraries created by Google Brain Team. They have created tensor flow to lead research in machine learning and deep neural organizations. Fundamentally, this is a delicate product library used for mathematical calculations utilizing the information stream charts. The hubs in the chart address the numerical activities and the multidimensional information exhibits are addressed by the edges in the diagram (called tensors).In this work, we have used Tensor Flow and it is executed over Google Collaboratory with NVIDIA GTX 1050Ti GPU.

C. PRE-REGISTRATION
Pre-registration is the initial stage in which we feed the dataset. Before this we convert these pairs of images (IR, VI) to grey scale image using conversion concept. Then we will analyse the outliers that are presented in this converted image. These errors that are happened due to movement of camera and object in fusion, to rectify it preregistration algorithm is used. Pre-registration uses area-based and feature-based methods to find the best alignment between images. Area based methods use comparison of intensity values for the preregistration, while feature based methods look for features like corners, neighbourhoods, coordinates etc. Traditional pre-registration flowchart is as follows in Fig. 2.

D. ENCODING NETWORK
The encoding network performs the major step in the whole approach. The encoding network has two main functions, namely, feature selection and feature map generation (Fig. 3).
Here the pre-registered image is given to this network as it moves to feature selection which is convolutional layer (Conv 1) with 3 × 3 filter and also to layer (Conv 2) with 5 × 5 filter. This is used to extract several rough features. The required number of filters is set to 16 and the output of 1 × 16 are passed towards next process which will generate feature maps. This approach uses convolutional layers with selected residual connections to reduce the training procedure and 3 × 3 filters to extract the relevant features. This will generate 1 × 16 feature maps for each layer. As compared to other CNNs, the output of all layers are send to the consecutive layer which retains the deep features are used for fusion. Residual networks are complex in nature, but they reduce the chance of overfit, which is common in deep convolution network models. Selection of stride as 1 and filter size of 3 can collect more deep features from the selected image. Then by usage of residual concept, it can preserve features that are deep and also it can make sure all notable features are used or not. Also, it reduces the overfitting of data, training and testing time gets increased and finally visual perception also increased [17]. CNN layer with 5 × 5 filter can extract some fine features and can help in the reconstruction of some important features missed due to the fast convergence of residual network explained earlier.
The features from this convolutional layer are combined with those from the residual network and driven to the decoding network for reconstruction.
Selection of the depth parameter for the convolutional layer is very crucial as it relates to the feature selection concept. As the layers gets increased, we can easily reduce the overfitting and represent features deeply. Convolutional layer is represented by 4 parameters such as filter size; here we use 3 × 3 filter size in form of W 1 X H 1 X D 1 , depth; the output volume is a hyper-parameter. It corresponds to the number of filters (K) that we would like to represent the feature information. stride (s); relates to sliding of the weight values over the image. When the stride is set to 1, we slide the filter masks by one pixel at a time across the entire image, zero padding (P); for controlling special size (F) of output volume. Thus a size volume W 2 X H 2 X D 2 , is generated where;

E. DECODER NETWORK
This is the last and final stage in the proposed model, where the network is dense network and comprises of 5 convolutional layers with 3 × 3 filters. The 64 mapped features will be passed to decoder along with the feature collected by the 5 × 5 filter and finally reconstruct the fused image (Fig. 4). The convolutional layer comprise of ReLu layers and the activation function would be element based, like as max(0,x) thresholding at 0. As a result of this, the volume remains constant. The pool layer will conduct a down sampling process, resulting in a volume with spatial dimensions of 16 x 16 x 1.

F. 2 NORM STRATEGY
Here we use trained network with 2 norm fusion in which it is used for measuring the activity level. Activity map of each feature point are derived from the formula; Averaging operation is applied to the individual feature map to obtain the activity level measurement of relevant features for the fusion purpose and is given by equation no.6 The activity map and function map are used to create the final fused image. When searching for a fused coefficient map, Softmax is widely used. The final merged output is generated by G. TRAINING PHASE During the training stage, we only take account of the encoder-decoder nets and the fusion layer is ignored. The training process try to build encoder and decoder nets capable of recreating the saliency map or the image. Once the encoder and the decoder weights are set, a new fusion technique is selected to combine the feature map from the encoder. Fig. 5 depicts the overall training procedure of the auto encoder network. The modification of fusion layer is possible based on the different applications. The layer 1 represents the convolution layer resides in the encoder network, which comprises of a 3 x 3 channels, as seen in Fig. 5.The residual convnets and the output of each convnet is cascaded into the subsequent layer. Five consecutive convolutional layers make up the decoder. It will be used to recover the given data image. Table 1 will give architectural details convolutions layers used the auto encoder network.
The loss function, which decreases the deviation from the actual target to the predicted values, is an error-minimizing function. The total number of absolute differences between n samples is expressed as, Mean square error (MSE) is utilised as a cost function for Conv-net training in several articles and mainly deals with the perceived errors in the image. The 2 loss function or MSE minimizes the squared difference between the expected and VOLUME 10, 2022 existing target values and is defined as; The primary cost feature for training CNN is usually selected as an MSE or 2 loss function. For its easy optimization behaviour the 2 cost function is preferred. When compared to the L1 norm, L2 error will be considerably larger than in the presence of noise.
The Structural Similarity Index Measure (SSIM) is a new quality index that will give information about the loss and distortion in an image. It contains 3 components, such as Luminance, Contrast Distortion, and Loss of correlation [42]. The expression of SSIM is, SSIM x,f structural similarities of input (x) and fused (f) image, σ xf the covariance of input and fused images, σ x ,σ f Standard deviation of input and fused images µ x , µ f mean value of input and fused images C 1 , C 2 , C 3 constants used to stabilise the algorithm SSIM can correlate well with human's perception of image quality. SSIM mainly deals with the structural similarity between two images and it mainly attempts to model the perceived changes in the structural information of the image. The training process's aim is to develop an auto-encoder network (encoder and decoder) that can extract and reconstruct features more accurately. Since infrared and visible image training data are not enough, we utilize gray level images within the MSCOCO dataset for model training process.

IV. EXPERIMENT RESULTS AND ANALYSIS
During the training and testing processes, we assess the model using subjective and objective parameters as well as comparing it to other models to examine how well it works. SSIM is also used for the training process, which is addressed in Section 3. Under the aforementioned conditions, the proposed model produced a better reconstruction image. Here we use MS COCO dataset [34] images for training our model. Around 79000 images are used as the input from the dataset, with 1000 images used to analyze the auto encoder network. To check the effectiveness of the algorithm, SSIM is used. In general, as we move through the training process, our network will converge and it will take less time for the training process.
The optimal weights of the trained auto encoder model are used for the fusion process. 20 sets of Visible (VI) and Infrared (IR) images were utilised for experimental analysis of the fusion algorithm. Fig. 6 depicts raw IR and VI images as well as their fused images. The fused image contains more complementary information and it is evident in Fig. 6. To evaluate the proposed algorithm's effectiveness, it is compared to similar approaches such as the cross-bilateral filter (CBF) [5], gradient transfer and total variation minimization (GTF) [20], the joint-sparse representation (JSR) [28], DeepFuse [32], Dense fuse [17], Weighted Least Square optimization (WLS) [7], JSR with saliency detection (JSRSD) [31], ResNet-ZCA [15] and FusionGAN [14].
The fused outputs generated by the CBF algorithm have more noise than the information from both images. Due to the artefacts, the relevant features are not clear and CBF is not recommended for merging visible and infrared images. The images generated by GTF and CBF hold more details  from the IR image, but texture information is less. So, the fused output is not suitable for daylight conditions. The fused images generated by JSR and JSRSD are also not good as they contain more artefacts and the complementary information provided is not useful. Fused images by other methods contain less artefacts and noise, and contains more complementary information.
The fused images generated by CNN based methods like Deep fuse, Dense fuse, WLS, Res-Net, Fusion GAN, and the presented method holds more relevant features. The fused output images are consistent with human visual perception. Fused images by WLS provide more textual information when compared with GTF and JSR. When compared to other fused outputs, the Fusion GAN-based merged image retains more information from the Infrared image and appears darker. The proposed fusion method generates a merged image with more textual content that appears more realistic, as well as relevant information from the IR image. Based on the subjective evaluation of the selected fused images, we can conclude that proposed method retains more salient features from infrared and relevant textual information from visible images.
Visual perception and objective assessment are needed to study the effectiveness of the approach. Eleven performance measures were utilised for the evaluation of the implemented fusion method and selected related approaches. They are VOLUME 10, 2022 as follows: Entropy (EN) will determines how much information is retained in the fused image [35]. The high EN value reflects the large number of features in the combined image. Mutual Information (MI) estimates the features that are conveyed to the fused image. A larger MI metric indicates more details are retained from the individual to the fused image [37]. Q abf is an another metric that gives an idea about the quality of the visual information in the merged image [36]. Sum of Correlation Differences (SCD) will give information about the correlation differences of the fused image with individual images [38].
Other metrics used commonly for objective evaluation are FMI p , FMI w , and FMI dct , which give mutual information in a fused image with pixel, cosine, and wavelet as features [39]. The quantity of edge pixels preserved in the merged image is indicated by edge preservation index (EPI) [40]. Visual Information Fidelity (VIF) is a metric for assessing picture quality that provides details on information fidelity [41]. SSIM a and MS-SSIM are modified SSIMs that will check the amount of structural similarity among fused and individual images [43]. SSIM a will give the average value for similarities between fused and individual images [42].  Table 2 shows the average values of metric values of all the images selected for conducting the analysis. Better metrics are shown in red, while the second best metrics are shown in blue. The proposed method achieves better values for seven metrics (EN, MI, Q abf , SCD, FMI p , FMI w , VIF) and second best values for two metrics (SSIM a , MS-SSIM) and comparable values for other metrics. High value of En indicate the presence of more information content in the merged image. When compared to other models, higher values of Qabf and SCD imply that the merged image contains fewer artificial noise and the images are more realistic. The fused image contains a large quantity of data from the IR and VI images, as evidenced by the higher MI value.
The objective evaluation demonstrates that the suggested algorithm outperforms other models in terms of fusion performance. So the auto encoder network with the 2 norm as the fusion strategy can be used as a tool for fusing infrared and visible images.

V. CONCLUSION
We formulate the task of fusing visible (VI) and infrared (IR) images as a 2 -norm minimization concept, along with the intention of producing fused image that resembles the IR image but contains more VI presentation details. We proposed the residual architecture model that includes both convolutional and residual layers in order to obtain an efficient fused image. To develop this model, we feed MS-COCO 2014 dataset containing both IR and VI into pre-registration stage and then pass through encoder network. Fusion is performed by 2 norm strategy and then to decoding network to reconstruct the fused image. While evaluating our model, we obtained much greater fusion performance compared to other existing models. Future work will look at the semantic relationship and its correction to improve the algorithm's efficiency. This architecture can be used to address a variety of multi-sensor fusion challenges in medical imaging and also in remote sensing applications. The work is also useful for other researchers out to explore, analyse and finally bring new add-ons to build this model even more efficient. They can take this model as an inspiration to build other integrated model using deep learning for obtaining fused image.